Legislation and Government Policy

Building Artificial Intelligence with ‘Artificial’ Data:  Fake it Until You Make it? (Part I)


Dr. Deborshi Barat*


While artificial intelligence (AI) relies on the availability of vast datasets, the legality of using copyrighted material and/or protected personal information is unclear. Accordingly, this note explores the possibility of reconciling social needs in terms of training AI models with protectionist regimes (involving intellectual property or privacy laws) by using ‘artificially’ generated information. This is the Part I of a two part article.

I.        Background

The success of artificial intelligence (“AI”) systems and machine learning (“ML”) models is ultimately reliant on data. Such data must be both easily and abundantly available. The recent rise in AI/ML adoption has been driven, in large part, by (i) an exponential rise in hitherto-limited computational power; as well as (ii) an unprecedented surge in data access via the internet and digital platforms,  accompanied by an increasing array of socioeconomic processes and institutions, which now operate through the prolific use of stored and/or networked information.

However, the legality of using copyrighted material for training AI/ML algorithms is still unclear. Further, existing intellectual property laws may be ill-equipped to address AI-generated creations and works – even if these stem from, or arise pursuant to the intervention of, human prompts. Nevertheless, future AI regulations should, at a minimum, protect individuals against instances of potential harm. Such harms may involve breaches of personal data protection law (e.g., India’s Digital Personal Data Protection Act, 2023 (“DPDP Act”)) and related privacy infringements. Besides, second-order harms may include violations of intellectual property rights (“IPRs”).

In light of the above, this note explores the possibility of reconciling (1) the developmental needs of AI/ML models – which remain reliant on vast datasets for continued growth, with (2) protectionist legal regimes that seek to secure individual and commercial rights related to data. To that end, based on recent trends in India, this note suggests the use of ‘artificially’ generated information for training AI models. Pursuant to a discussion on the conceptual and technical underpinnings of such data, along with potential use-cases and advantages relative to its ‘real’ equivalent, a cautious conclusion is reached, consistent with various inevitable risks related to data deployment in general – both known and unknown.

The IndiaAI Program

On October 13, 2023, the seven expert groups that were constituted by India’s Ministry of Electronics and Information Technology (“MeitY”) for the purpose of deliberating upon the core goals and design of a national AI program (“IndiaAI”) submitted the first edition of their formal report (the “AI Report”). IndiaAI aims to address certain perceived deficiencies in the country’s AI ecosystem with respect to a set of focus areas (“pillars”). These areas include the following: computing infrastructure; data; AI financing; research and innovation; skilling; and the institutional capacity for data. 

Among various other components, a key pillar of the IndiaAI program is the India Datasets Platform (“IDP”). The IDP is a large collection of anonymized datasets to be used by Indian researchers for the purpose of training multi-parameter models.

The AI Report

The AI Report is intended to serve as a roadmap for the development of India’s AI ecosystem, including in respect of its intersection with: (i) governance (e.g., the application of AI technologies in government processes, as well as improved decision-making, efficiency and transparency); (ii) intellectual property; (iii) hardware and software infrastructure related to computation (e.g., the ‘India AI Computer Platform’, a public-private partnership projected to create substantial capacity for graphics processing units (“GPUs”) for start-ups and researchers; along with (iv) ethics (e.g., the responsible deployment of AI systems to ensure fairness, accountability and social benefits).

In addition, the AI Report describes certain operational aspects with regard to establishing centers of excellence, as well as an institutional framework for the purpose of governing the collection, management, processing and storage of data by the National Data Management Office (“NDMO”).

IDP

The IDP aims to leverage data to fuel the development and capabilities of AI in the country, enabling better insights, superior predictions and more intelligent decision-making. Accordingly, the IDP aims to provide a foundation for dataset sharing, analysis, collaboration and monetization among both dataset providers and consumers, including for the purpose of contributing towards the growth of India’s AI ecosystem. Built on an open-source architecture, the IDP is – at its core – a unified and interoperable national exchange platform for stakeholders to upload, browse through, and consume datasets, metadata, user-created data artefacts and application programming interfaces (“APIs”) in a safe and standardized way.

Nevertheless, a well-defined legal and regulatory framework needs to be established for the purpose of governing the operation of the IDP, as well as to ensure compliance with data privacy laws and information security along with IPRs – even while allowing for flexibility and future adaptation.

Referring to a survey conducted by McKinsey & Company in December 2021 (which had shown that almost three out of every five organizations were using AI in at least one business function), the AI Report also explores the need to identify how AI can serve businesses to foster better integration – including through possible use-cases across marketing, sales, customer services, security, data, technology and other processes.

Along with descriptions of such domain-specific use-cases, the AI Report states that:

Computers can artificially create synthetic data to perform certain operations. The synthetic data is usually used to test new products and tools, validate models, and satisfy AI needs. Companies can simulate not yet encountered conditions and take precautions accordingly with the help of synthetic data. They also overcome the privacy limitations as it doesn’t expose any real data. Thus, synthetic data is a smart AI solution for companies to simulate future events and consider future possibilities…” [emphasis added]

With regard to the excerpt referenced above, the next section will examine the conceptual and technical underpinnings of ‘synthetic data.’

II. Synthetic Data

At the core of AI’s transformative potential lies the quality of data used for training AI models. Traditional approaches rely on large datasets. However, this approach is often fraught with challenges related to privacy, accessibility and bias.

In this context, synthetic data represents an innovative solution. Synthetic data is generated artificially by computer simulations or algorithms. Accordingly, such datasets can be designed to mimic real-world scenarios – even augmenting or replacing data collected from the physical world – but without containing actual, identifiable information. Various techniques – including statistical modelling, generative adversarial networks (“GANs”) and rule-based algorithms – are employed to create synthetic datasets.

Generating Synthetic Data

There are various methods for generating synthetic data – including those based on transforming collected data.

Method 1

The idea behind this particular method is to compute the principal statistical characteristics of the original dataset and create a synthetic one in its stead, with similar characteristics. It involves four main steps.

Step 1

The first is data preparation: i.e., cleaning the collected data to remove errors, ensuring that all fields in the dataset use consistent coding schemes, and confirming that data from multiple sources is mapped into the same data typology.

Step 2

The second step involves developing a data generator to produce synthetic data based on manipulations of the collected data. The generator’s algorithm computes the metrics for the collected data and then sets the parameters that will be used to generate synthetic data. To maintain logical consistency, some characteristics of the original dataset may need to be checked.

Step 3

The third step is computing metrics for the synthetic data.

Step 4

In the last step, the metrics of the collected and the synthetic data, respectively, are compared using a discriminator. This step assesses the utility of the synthetic dataset by determining whether its statistical properties are similar to those of the original set. If the comparison concludes that the synthetic data is too different from the collected data, the generation parameters are adjusted and new synthetic data is generated. The process iterates until acceptable synthetic data is produced. Such utility comparisons can be formalized using similarity metrics that are repeatable and automated. Where privacy concerns arise, a privacy assurance assessment can be added to ensure that privacy risks remain below a certain benchmark. 

Method 2

An alternative group of generation methods are those which reduce the amount or quality, or change the features, of the collected data necessary in order to achieve a given result. For example, GANs employs two neural networks (i.e., deep learning algorithms) by pitting them against each other in an adversarial fashion in a zero-sum game. The first network, the generator, produces synthetic data without directly using the collected data. The generated data is then sent to the second neural network – the discriminator – which was trained on collected data. The discriminator compares the synthetic data with the collected data – creating a propensity score – and determines which parts of the data give away the fact that it is artificial. The result is then fed back to the generator. A good synthetic model is created when the discriminator is unable to distinguish between the collected and synthetic datasets.

Method 3

Furthermore, some types of synthetic data can be generated without using collected data directly. This is done through a data simulator. The simulator generates synthetic data based on a set of rules which determine the relationships between relevant data attributes. Such simulators have become the main tool for training and testing ML algorithms.

III.       Advantages of Synthetic Data

Diversifying Training Datasets

One primary advantage of synthetic data is its ability to create diverse datasets. Such generated data can represent various demographic factors, geographic locations, and contextual nuances – providing a more comprehensive training environment. In many cases, real-world data may be limited or lack diversity, hindering the ability of AI models to generalize well to new situations. Synthetic data provides a solution by introducing a broader range of scenarios.

Relatedly, synthetic data serves as a valuable tool for augmenting existing datasets, making AI models more robust. By introducing variations in the training data, models become more adaptable to real-world complexities. Furthermore, optimal results may even require the use of such synthetic data that does not reflect real-world conditions. For example, a dataset used to train autonomous vehicles may stimulate more effective learning if it contains an unrealistically high number of risky situations.

Overcoming Data Scarcity

In scenarios where obtaining sufficient real-world data is challenging or expensive, synthetic data serves as a valuable resource. This aspect is particularly beneficial when data collection is resource-intensive or time-consuming: e.g., in industries like healthcare – where sensitive patient data is limited, or in emerging technologies like autonomous vehicles – where vast amounts of diverse training data are required. In such situations, synthetic data fills the gap, accelerating the deployment of AI models.

Privacy Preservation

One of the critical advantages of synthetic data is its inherent ability to preserve privacy – given current concerns over data protection laws and stringent compliance requirements under overarching regimes like the EU’s General Data Protection Regulation (“GDPR”) or India’s DPDP Act. Synthetic data eliminates the risk of exposing real personal information – including sensitive data in contexts like healthcare and finance – providing a privacy-compliant alternative without sacrificing the efficacy of the AI model.

Cost-Efficiency and Accessibility

Generating synthetic data can be more cost-effective and efficient than collecting and managing large volumes of real-world data. This is particularly significant for smaller companies or start-ups with budget constraints. Consequently, synthetic data reduces the barriers to entry for those organizations that seek to leverage AI but lack access to extensive and expensive datasets.

Synthetic data also reduces the resources needed for preparation of the raw data for analysis, which involves cleaning, labelling and organizing the raw data. In particular, manual labelling is often costly, time-consuming, and error-prone. By labelling and organizing the data automatically during the generation process, synthetic data combines data collection and preparation – creating data that is fit for purpose. This is especially important for ML algorithms, where the scale of datasets can reach millions of data points.

Synthetic data can also reduce storage costs. After all, if a synthetic dataset can be easily recreated, its user does not need to store data for future use. Further, the generation of synthetic data for a specific purpose reduces the amount of redundant information that might otherwise be included in the dataset.

Addressing Bias and Ethical Concerns

Bias in AI models has been a longstanding concern, often reflecting the biases present in the training data. Synthetic data presents an opportunity to mitigate this issue by allowing careful control over the characteristics of the generated datasets. By consciously designing synthetic data to be diverse and unbiased, developers can contribute to more ethical AI systems. For instance, the use of synthetic data can facilitate the creation of more representative datasets that include rare events and extreme scenarios, thereby reducing bias in predictive models.

Customization and Control

Synthetic data allows for precise customization of datasets, providing control over the characteristics and features of the generated data. This level of control enables tailored training scenarios, aligning with specific use cases. In addition, ML algorithms can be trained on synthetic data to increase their accuracy before using them in the real world. However, certain concerns remain even in respect of synthetic data deployment. The next section explores some such concerns.


*Deborshi Barat is a Counsel at S&R Associates, New Delhi. His areas of practice include regulatory and policy matters. Previously, he was an Associate Professor at the Jindal Global Law School. He holds a Ph.D. from The Fletcher School of Law and Diplomacy, Tufts University.

Read Part II here: https://lawschoolpolicyreview.com/2023/12/17/building-artificial-intelligence-with-artificial-data-fake-it-until-you-make-it-part-ii/