Data is our new leader. You’re not just a number anymore; you’re a line of code. Further, we get our morning refreshments from data companies that happen to sell coffee. Technically shrewd as that is, all that data must be accurate, complete, consistent, and secure. For the large-scale enterprise, data integrity testing enables the validation of those principles.
Testing evaluates your data’s accessibility and gives insight into potential issues long before they become real.
Real-World Versus Synthetic Data
Thanks to the GDPR, the use of data for any reason must also be compliant with privacy rights. As for testing, “real-world” data from production can be used, as can other forms. However, that practice is increasingly falling out of favor as compliance standards make using real-world data burdensome.
What’s an alternative? Generating synthetic data that mimics “real” data is one, provided the test data represents the real thing.
Gartner points to synthetic data as one of the top trends in the future of Data Science and Machine Learning (DSML). Last month, at its Data & Analytics Summit in Sydney, Gartner included the following statement about data-centric AI in a press release.
“The use of generative AI to create synthetic data is one area that is rapidly growing, relieving the burden of obtaining real-world data so machine learning models can be trained effectively. By 2024, Gartner predicts 60 percent of data for AI will be synthetic to simulate reality [and] future scenarios.”
Gartner adds that a mere one percent of data testing was synthetic in 2021.
How Is Synthetic Data Created?
In January, an article from MIT Sloan by Brian Eastwood discusses the merits of synthetic data. He notes that synthetic data has the same mathematical properties as real-world data sets without the same information.
According to Eastman’s research, synthetic data is generated by taking a relational database, creating a generative machine learning model for it, and generating a second set of data.
Kalyan Veeramachaneni, principal research scientist with MIT’s Schwarzman College of Computing, adds an analogy.
“You can take a phone number and break it down. When you resynthesize it, you’re generating a completely random number that doesn’t exist,” he said. “But you can make sure it still has the properties you need, such as exactly ten digits or even a specific area code.”
Doubling Down On Data Accuracy and Compliance
The LMS data integrity team utilizes synthetic testing to ensure accuracy for its global partner in the travel and leisure industry. They run synthetic testing daily, months in advance of guest bookings. That way, potential issues that could arise are addressed as many as six months in advance of the booking date.
MIT researchers showed that data sets can be generalized, with personal info such as credit card, bank account, birth dates, and ID numbers removed as required by GDPR privacy compliance.
Additionally, they advise encryption of the synthetic data as a measure of doubling down on compliance. Massive fines for violating compliance have many in the data science world justifiably cautious.
If You Think You Are, You Not
Under the Accountability heading on its home page, the GDPR offers the following advice.
“If you think you are compliant with the GDPR but can’t show how, then you’re not GDPR compliant.”
Veeramachaneni offered one more analogy to bring the spirit of data integrity to light.
“Data today is treated like the computer lab of yesteryear: Access is restricted – and so are opportunities for college students, professional developers, and data scientists to test new ideas. With far fewer necessary limitations on who can use it, synthetic data can provide these opportunities.”
Data Integrity Characteristics:, Inc
- Completeness: To what degree is the data fully available in the database?
- Accuracy: Is the data in the right form, and is it correct and true?
- Consistency: Consistency of data can be low level (i.e., customer contact info is formatted in the same way) or high level (different groups are using the same dataset).
- Timeliness: How near to real-time is data being collected? Old data is often not useful.
- Compliance: Does the data meet compliance standards, such as data privacy and other regulations?
Courtesy of Talend, Inc.