Synthetic Data for Machine Learning

What is synthetic data?

Synthetic data refers to artificially generated samples from real cases with the goal of retaining statistically descriptive features. A synthetic dataset aims to replace real data in order to preserve data privacy or to generate a dataset with more samples than the original. Synthetic data is not made up data, just as a restored picture is not a new image. By analyzing synthetic data, we can discover patterns that may not be apparent in real data. For example, if we have a low resolution picture and there is an object in the bottom-right corner that we cannot clearly identify, a restoration tool may allow us to recognize that the object is a dog. In a similar way, synthetic data generation algorithms can help us understand the nature of relationships between variables in tabular data, even if those relationships are not clear in the original data.

Why synthetic data is important for NextBrain?

The main barrier to bringing machine learning technologies to a significant percentage of users is data. To be effective, these technologies demand a significant volume of data. To get accurate predictions, most algorithms used to solve classification or regression problems require a huge amount of data. However, not all users have access to a large amount of data (what is known as “Big Data”). On the contrary, the majority of users, whether from the business world, a professional activity, or academia, deal with a limited amount of data. Accessing data is expensive and time-consuming.

To overcome this barrier, more data must be available to users. There are two possible solutions for this: the first is to provide access to external data sources that users can use to make decisions. We already implement this at NextBrain by providing several connectors. The second is, literally speaking, inventing the data. But how are we going to “invent” the data? It is possible to do so. There are technologies available now that enable this. We say that we have a spreadsheet of data that describes a problem that we want to solve. We say the table has 20 rows and 10 columns. Machine learning technologies require more data than this. With these data, any algorithm can only do so much, and the conclusions we can draw will be questionable. But consider creating another table based on this one, with 300 rows and 10 columns. Now we can now get more realistic results from algorithms thanks to this.

How do we do this magic?

Generative Adversarial Networks, or GANs, are the technology at the heart of these generative applications. GANs were introduced by Ian Goodfellow in 2014. The idea was to engineer two separate neural networks and pit them against each other. The first neural network starts out by generating new data that is statistically similar to the input data. The second neural network is tasked with identifying which data is artificially created and which not. Both networks continuously compete with one another: the first tries to trick the second, and the second tries to figure out what the first is doing. The game ends when the second network is not able to ‘discriminate’ if data is coming from the first network output or from the original data. We call the first network generator and the second network discriminator.

In NextBrain we have released our own GAN architecture based on a Wassertein GAN (Arjovsky et al, 2017). We have developed a special architecture suitable for being trained with a very small number of samples.

The most critical step in generating synthetic data is to check the similarity or “closeness” to real data. In NextBrain we have made a strong effort in developing cutting-edge tools to perform this comparison in order to be sure our synthetic data can replace original data samples with confidence (Marin, J., 2022).

References:

Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. International Conference on Machine Learning, 214–223.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., & Bengio, Y. et al. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.

Marin, J. (2022). An experimental study on Synthetic Tabular Data Evaluation. arXiv preprint arXiv:2211.10760. Arjovsky, M., Chintala, S., &