The field of data science and machine learning is growing every day. As new models and algorithms come up over time, these new algorithms and models require huge data for training and testing. Deep learning models are gaining popularity these days, and these models are also data-intensive. Obtaining such a large amount of data in the context of the various problem statements is a rather hideous, time-consuming and expensive process. The data is collected from real scenarios, which raises security and privacy concerns. Most data is private and protected by privacy laws and regulations, which hampers the sharing and movement of data between organizations or sometimes between different departments within the same organization, delaying experiments and testing of products. So the question arises how can this problem be solved? How can data be made more accessible and open without raising concerns about someone’s privacy?
The solution to this problem is called Synthetic data.
So what is synthetic data?
By definition, synthetic data is artificially or algorithmically generated and closely resembles the structure and underlying properties of real data. If the synthesized data is good, it is indistinguishable from the real data.
How many different types of synthetic data can there be?
The answer to this question is very open, as data can take many forms, but mainly we have
- Text data
- Audio or visual data (for example, Images, videos and audio)
- Tabular data
Use cases of synthetic data for machine learning
We will only discuss the use cases of only three types of synthetic data as mentioned above.
- Using synthetic text data for training NLP models
Synthetic data has applications in the field of natural language processing. For example, Amazon’s Alexa AI team uses synthetic data to complete the training set for its NLU (natural language understanding) system. It provides them with a solid foundation to learn new languages without existing or sufficient consumer interaction data.
- Use synthetic data to train vision algorithms
Let’s discuss a popular use case here. Suppose we want to develop an algorithm to detect or count the number of faces in an image. We can use a GAN or other generative network to generate realistic human faces, i.e. faces that don’t exist in the real world, to train the model. Another advantage is that we can generate as much data as we want from these algorithms without infringing anyone’s privacy. But we can’t use real data because it contains some people’s faces. Some privacy policies therefore limit the use of this data.
Another use case is reinforcement learning in a simulated environment. Suppose we want to test a robotic arm designed to grab an object and place it in a box. A reinforcement learning algorithm is designed for this purpose. We have to do experiments to test it because that’s how the reinforcement learning algorithm learns. Setting up an experiment in a real-world scenario is quite expensive and time-consuming, which limits the number of different experiments we can run. But if we do the experiments in the simulated environment, setting up the experiment is relatively inexpensive because it won’t require a robotic arm prototype.
Tabular synthetic data is artificially generated data that mimics real-world data stored in tables. This data is structured in rows and columns. These arrays can contain any data, such as a music playlist. For each song, your music player keeps a lot of information: its name, the singer, its duration, its genre, etc. It can also be a financial record like bank transactions, stock prices, etc.
Synthetic tabular data related to banking transactions is used to train models and design algorithms to detect fraudulent transactions. Past stock price data can be used to train and test models to predict future stock prices.
One of the significant advantages of using synthetic data in machine learning is that the developer has control over the data; he can make changes to the data as needed to test any idea and experiment with it. During this time, a developer can test the model on synthesized data, and it will give a very clear idea of how the model will perform on real data. If a developer wants to try out a model and expects real data, data acquisition can take weeks or even months. Therefore, retarding the development and innovation of technology.
We are now ready to discuss how synthetic data helps solve data privacy issues.
Many industries depend on data generated by their customers for innovation and development, but this data contains personally identifiable information (PII), and privacy laws strictly regulate the processing of this data. For example, the General Data Protection Regulation (GDPR) prohibits uses that were not explicitly consented to at the time the organization collected the data. As synthetic data very closely resembles the underlying structure of real data and, at the same time, guarantee that no individual present in the real data can be re-identified from the synthetic data. As a result, the processing and sharing of synthetic data is much less regulated, resulting in faster developments and innovations and easy access to data.
Synthetic data has many significant advantages. It gives ML developers control over experiments and increases development speed because data is now more accessible. It promotes collaboration on a larger scale since the data is freely shareable. In addition, synthetic data ensures that the privacy of individuals is protected against real data.
Vineet Kumar is an intern consultant at MarktechPost. He is currently pursuing his BS from Indian Institute of Technology (IIT), Kanpur. He is a machine learning enthusiast. He is passionate about research and the latest advances in deep learning, computer vision and related fields.
#synthetic #data #types #cases #applications #machine #learning #privacy