The main barrier to bringing machine learning technologies to a significant percentage of users is data. To be effective, these technologies demand a significant volume of data. To get accurate predictions, most algorithms used to solve classification or regression problems require a huge amount of data. However, not all users have access to a large amount of data (what is known as “Big Data”). On the contrary, the majority of users, whether from the business world, a professional activity, or academia, deal with a limited amount of data. Accessing data is expensive and time-consuming.
To overcome this barrier, more data must be available to users. There are two possible solutions for this: the first is to provide access to external data sources that users can use to make decisions. We already implement this at NextBrain by providing several connectors. The second is, literally speaking, inventing the data. But how are we going to “invent” the data? It is possible to do so. There are technologies available now that enable this. We say that we have a spreadsheet of data that describes a problem that we want to solve. We say the table has 20 rows and 10 columns. Machine learning technologies require more data than this. With these data, any algorithm can only do so much, and the conclusions we can draw will be questionable. But consider creating another table based on this one, with 300 rows and 10 columns. Now we can now get more realistic results from algorithms thanks to this.
How do we do this magic?
Generative Adversarial Networks, or GANs, are the technology at the heart of these generative applications. GANs were introduced by Ian Goodfellow in 2014. The idea was to engineer two separate neural networks and pit them against each other. The first neural network starts out by generating new data that is statistically similar to the input data. The second neural network is tasked with identifying which data is artificially created and which not. Both networks continuously compete with one another: the first tries to trick the second, and the second tries to figure out what the first is doing. The game ends when the second network is not able to ‘discriminate’ if data is coming from the first network output or from the original data. We call the first network generator and the second network discriminator.
In NextBrain we have released our own GAN architecture based on a Wassertein GAN (Arjovsky et al, 2017). We have developed a special architecture suitable for being trained with a very small number of samples.