Sarrah Darugar and Mustafa Rajkotwala

The New York Times
Abstract: The Digital Personal Data Protection Act, 2023 excludes publicly available personal data under Section 3(c)(ii) and conditionally exempts research processing under Section 17(2)(b). This article analyses their interaction in hybrid AI workflows, rejects blanket permission for unrestricted AI training, and calls for clearer statutory limits on large-scale use of publicly available personal data.
Introduction
The Digital Personal Data Protection Act, 2023 (“DPDP Act” or “the Act”) contains a quiet but consequential carve-out. Section 3(c)(ii) excludes from its scope any personal data that a data principal has made, or caused to be made, publicly available. On a plain reading, information such as public social media posts, parliamentary debates, court records, and other publicly accessible materials falls entirely outside the Act. Where this exclusion applies, the Act imposes no statutory obligations on processing, and the rights ordinarily available to data principals do not arise.
The difficulty posed by this provision does not lie in uncertainty about its legal effect. Section 3(c)(ii) is clear and unqualified in its operation. What the statute does not explain, however, is why publicly available personal data has been excluded from regulation or what policy objective this exclusion is intended to serve. While the Digital Personal Data Protection Rules operationalise several conditional exemptions under the Act, they do not meaningfully narrow or explain the public data exclusion in Section 3(c)(ii), leaving its implications for large-scale AI processing largely intact.
This silence has particular significance in an artificial intelligence (“AI”) driven environment. AI systems routinely collect publicly accessible data at scale, combine it across sources, and reuse it to generate outputs far removed from the context of original disclosure. These dynamics are particularly evident in the training of large language models and other foundation models, which rely on large-scale ingestion of publicly accessible data and are designed for persistent, cross-context deployment.
At the same time, the Act separately provides a separate, conditional exemption for research, archiving, and statistical processing under Section 17(2)(b), which relaxes obligations only where personal data is not used to take decisions about individuals and prescribed safeguards are followed.
This article argues that public availability alone cannot operate as a blanket permission for unrestricted AI training. Where AI workflows combine data that falls within Section 3(c)(ii) with personal data that does not, or where processing framed as research is later used in ways that affect decisions about individuals, Section 3(c)(ii) cannot be relied on for the entire pipeline. In such cases, the processing that remains within the Act’s scope must be assessed against the conditions of Section 17(2)(b).
I. An Express but Unexplained Public Data Exclusion
In the absence of an expressed legislative rationale or internal limiting criteria, Section 3(c)(ii) must be interpreted using settled principles of statutory interpretation, including purposive and harmonious construction and a contextual reading of the Act as a whole. Section 3(c)(ii) excludes from the Act’s scope personal data that a data principal has made, or caused to be made, publicly available. Its effect is clear: where it applies, the Act imposes no substantive or procedural limits on processing. The exclusion is not qualified by scale, purpose, duration, or downstream use, and the statute offers no explanation for why public availability is treated as sufficient to justify complete withdrawal from data protection regulation.
Three features of the Act underscore the significance of this silence.
First, the Act uses categorical threshold exclusions. Unlike regimes that regulate all personal data and adjust obligations through lawful bases and safeguards, the Act removes entire classes of personal data from regulation. Publicly available personal data is excluded in full, not subject to reduced or conditional duties.
Second, the provision leaves key questions unanswered. It does not clarify what it means to have “made or caused to be made” data publicly available, whether platform defaults reflect meaningful agency, how long public status persists, or how scraped, cached, indexed, or derivative datasets should be treated. Nor does it explain how public data should be assessed in workflows that mix excluded and non-excluded personal data.
Third, the exclusion shifts responsibility for downstream harms to other legal regimes that are poorly suited to address the systemic risks posed by large-scale aggregation, automated inference, and model-driven decision-making. While such harms may, in theory, be addressed through constitutional remedies, sector-specific regulation, intellectual property law, or general civil and criminal law, these tools offer fragmented and limited oversight in practice. This ambiguity is reflected in parliamentary responses. In August 2024, the Minister of State for Electronics and Information Technology (“MeitY”) stated in the Rajya Sabha that the scraping of publicly available user data remains subject to the Information Technology Act, the Information Technology Rules, and the DPDP Act, including consent and transparency obligations. This position sits uneasily with the plain text of Section 3(c)(ii) and underscores continuing uncertainty within the executive about the scope and effect of the public data exclusion. The result is an unusual design outcome: the clearer the statutory exclusion, the less the data protection framework has to say at precisely the point where personal data is most easily aggregated, reused, and operationalised by AI systems.
Where AI-driven processing uses only personal data that a data principal has made, or caused to be made, publicly available, Section 3(c)(ii) places that processing outside the Act. In practice, many AI workflows are hybrid. Once the pipeline also involves personal data that does not fall within Section 3(c)(ii), the Act applies to the processing of that in-scope personal data, and obligations cannot be avoided by pointing to public availability at the input stage. In such cases, “re-entry” occurs not because the exclusion is displaced, but because its factual conditions are not met across the entire processing chain. For the purposes of this analysis, “hybrid workflows” refers to processing chains that involve both personal data excluded under Section 3(c)(ii) and personal data that is not so excluded, or that begin as non-decisional research but later move into decisional deployment. The Act does not provide an explicit framework for assessing such hybrid AI workflows; accordingly, this article uses “re-entry” as a descriptive concept to analyse when the factual conditions for exclusion or exemption cease to be satisfied, rather than as a settled doctrinal rule.
The Digital Personal Data Protection Rules (“the Rules”) do little to narrow down or explain Section 3(c)(ii). They do not define “made publicly available,” add limiting criteria based on context or duration, or address platform defaults, scraping, caching, indexing, republication, or revocability. The Rules therefore leave intact the deregulatory effect of the exclusion.
By contrast, where the Act provides conditional exemptions, particularly Section 17(2)(b) for research, archiving, and statistical processing, the Rules reflect boundedness, safeguards, and oversight, including the prohibition on decision-specific use. Nothing in the Rules suggests that large-scale or persistent AI model training is presumptively covered by the research exemption. Read together, the Rules keep Section 3(c)(ii) intact for genuinely public-only processing, but ensure that the Act’s oversight applies where processing involves non-exempt personal data or exceeds the limits of exemptions such as Section 17(2)(b).
II. Research and Archiving Under Section 17(2)(b)
While Section 3(c)(ii) operates as a categorical exclusion, Section 17(2)(b) adopts a different regulatory technique. It provides a conditional exemption for research, archiving, and statistical processing, but keeps such processing within the Act’s framework and subject to limits on decisional use and compliance with prescribed safeguards. Accordingly, Section 17(2)(b) cannot be treated as a general safe harbour for AI-driven processing simply because a system is described as “research” or incorporates publicly available data. Unlike Section 3(c)(ii), Section 17(2)(b) applies only where the Act already governs the processing and does not bring public-only processing back within the Act.
The research exemption relaxes obligations only where personal data is not used to take decisions specific to a Data Principal and where prescribed safeguards are followed. It therefore assumes boundedness, proportionality, and restraint in downstream use, and does not function as a general authorisation for large-scale or persistent extraction of personal data into AI systems where these conditions are not met.
Large-scale AI training is often described as research, particularly in academic or exploratory contexts. That characterisation is not determinative under the Act. It is a legal claim that matters only where the Act applies and, even then, must satisfy the conditions in Section 17(2)(b), including the prohibition on decisional use and compliance with safeguards. In practice, many forms of AI training struggle to meet these requirements.
A. Why AI Training Fits Uneasily Within the Research Exemption
Traditional research and archiving are typically limited in scope, based on defined datasets, clear objectives, and relatively stable audiences. AI training operates differently. Large language models are trained on vast, heterogeneous datasets drawn from multiple sources and are designed for reuse across diverse applications. Once personal data is absorbed into a trained model, it cannot be meaningfully isolated or removed. The model continues to encode patterns derived from that data long after the original sources are no longer visible or correctable. In functional terms, AI training creates durable digital infrastructure rather than discrete or time-bound research outputs, challenging the assumptions underlying the research exemption.
B. How AI-Driven Processing Creates Harm
AI-driven processing introduces specific risks even when personal data use is framed as research. AI systems detach data from its original context and enable aggregation at unprecedented scale, revealing sensitive attributes such as behavioural patterns, socio-economic status, political preferences, or health-related inferences. These inferences may be inaccurate, biased, or discriminatory, yet can have real consequences for individuals who lack visibility into how they are generated and have limited ability to contest them. Such harms are also persistent, since trained models cannot easily be corrected or selectively amended. These risks arise regardless of whether the underlying data was publicly available and directly implicate the decisional and safeguard limits built into Section 17(2)(b).
C. Decisional Use and the Limits of the Research Exemption
Section 17(2)(b) draws a clear boundary: where personal data is used to take decisions specific to a Data Principal, the exemption no longer applies. This boundary must be interpreted functionally. AI systems that profile, score, rank, or recommend may not issue final determinations, but they materially shape outcomes that affect individuals’ rights and interests, especially where such decisional use is a reasonably foreseeable and intended outcome of the training process.
Where such decisional influence exists, reliance on the research exemption is no longer justified. The processing remains subject to the Act without the benefit of Section 17(2)(b), and obligations attach to any non-exempt personal data in the workflow, even if some inputs were publicly available or initially framed as research. In this limited sense, hybrid workflows and downstream deployment trigger “re-entry” because the exemption falls away and ordinary obligations apply to in-scope data.
D. Commercial AI Training and the Limits of the Research Exemption
Section 17(2)(b) does not distinguish between commercial and non-commercial entities. Its application turns on whether the statutory conditions for exemption are met. In principle, commercial entities may undertake research-oriented processing that qualifies for the exemption. In practice, however, large-scale AI training conducted for commercial deployment often sits uneasily within this framework, particularly where systems are designed from the outset for reuse, productisation, and cross-context deployment.
The exemption does not apply where personal data is used to take decisions specific to a Data Principal. While the training phase may be characterised as research, commercially deployed models are intended to generate outputs that rank, recommend, profile, or otherwise shape outcomes affecting individuals. Once integrated into products or services, the processing pipeline can no longer be treated as non-decisional. Continued reliance on the research exemption then fails, not because the processing is commercial, but because the statutory conditions are no longer satisfied.
Commercial AI training therefore frequently strains the bounded and purpose-specific assumptions of Section 17(2)(b). Where non-exempt personal data is involved and the processing produces scale, persistence, or downstream impact on individuals, reliance on the research exemption is no longer plausible, and the in-scope data must be assessed under the Act’s ordinary obligations.
III. Constitutional Proportionality and Public Data
The Supreme Court’s decision in Justice K S Puttaswamy (Retd.) v. Union of India affirms that privacy interests do not disappear merely because information is accessible in public. Public availability may reduce expectations of confidentiality, but it does not eliminate constitutional protection against arbitrary, discriminatory, or disproportionate use of personal data. At the same time, constitutional proportionality is not a free-standing licence to disregard a clear statutory exclusion. Unless Section 3(c)(ii) is held unconstitutional and read down or struck down, courts and regulators must apply it as enacted, while interpreting its scope narrowly to avoid disproportionate outcomes.
Read in this light, Section 3(c)(ii) cannot be treated as a self-sufficient policy justification for large-scale AI reuse of public data. Although the provision excludes certain processing at the statutory threshold, constitutional scrutiny continues to apply where public data is aggregated, repurposed, or used to produce outcomes that affect individuals. The harms identified in Puttaswamy arise not from exposure alone, but from how information is processed, combined, and operationalised. Even where the Act does not apply to public-only processing, constitutional limits remain relevant to state action and to the design of future legislative and regulatory frameworks governing large-scale data use.
IV. Comparative Regulatory Models
Looking across jurisdictions highlights what is distinctive about India’s approach. Many data protection regimes do not treat publicly available personal data as automatically outside regulation. Instead, public access is treated as one factor in assessing risk and determining what safeguards are required.
European Union: Under the General Data Protection Regulation (“GDPR”) 2016, there is no blanket exclusion for publicly available personal data. Controllers must still identify a lawful basis for processing, comply with core principles such as purpose limitation and data minimisation, and provide transparency, even where the data comes from public sources. Public availability may influence expectations or the balancing exercise under legitimate interests, but it does not remove the data from the GDPR’s scope. As a result, large-scale scraping or reuse of public data can still trigger accountability duties, and high-risk uses may be restricted through sectoral rules and enforcement, particularly where profiling or automated decision-making is involved.
United States: The United States follows a more permissive but fragmented model. In the absence of a comprehensive federal data protection law, reuse of publicly accessible information is often treated as broadly lawful in private-to-private contexts. Constraints instead arise from specific legal pockets, including sectoral privacy laws, consumer protection enforcement against unfair or deceptive practices, tort and contract law, intellectual property and computer misuse rules, and constitutional limits on state action. This makes scraping and repurposing public data easier than in the EU, but oversight is uneven and unpredictable, and individual protections are limited and fragmented. In practice, public data often becomes low-friction input for model training and profiling, with patchy regulatory control.
Other jurisdictions: Several systems that resemble India’s “middle path” neither exclude public data outright nor treat it as freely usable. Instead, they apply combinations of accountability duties, reasonableness standards, and purpose- or risk-based limits.
United Kingdom: Under the UK GDPR, personal data does not lose protection just because it is publicly available. Regulatory guidance on web scraping and AI focuses on having a lawful basis, being transparent, and acting fairly, especially on whether the later use of the data is reasonably expected, not just whether the data was easy to access.
Singapore: Singapore’s Personal Data Protection Act, 2012 does not exclude publicly available personal data from regulation. Public availability operates only as an exception to consent, defined as data that is generally available to the public or observable by reasonably expected means at a public place or event. Even where consent is not required, organisations remain subject to reasonableness and other core obligations, particularly for large-scale or high-impact uses.
Canada: Canada’s privacy laws keep publicly available personal information within the privacy framework, permitting collection, use, or disclosure without consent only for specific categories defined by regulation and generally only where the processing relates directly to the purpose for which the information was made public.
India’s position is structurally distinct. The Act excludes publicly available personal data at the threshold under Section 3(c)(ii), placing such processing entirely outside the Act once the exclusion applies. At the same time, it retains a conditional, safeguard-based model for public interest processing through Section 17(2)(b), which operates only where the Act already applies and relaxes obligations for research, archiving, and statistical purposes subject to clear limits, including the bar on decision-specific use.
The coherence of India’s framework therefore turns on how “made publicly available” is interpreted and on how regulators address large-scale AI reuse that alters the context and impact of public data. Without clear definitions and principled guidance on hybrid workflows, the model risks drifting toward a permissive approach to public-data processing by default, even as global practice increasingly treats the reuse of public data as a question of accountability and downstream effects rather than access alone.
V. Policy Implications and Regulatory Direction
To prevent Section 3(c)(ii) from becoming a backdoor to deregulation, the Data Protection Board of India (“Board”) should clearly define what qualifies as personal data that has been “made publicly available.” This guidance should focus on whether data was made public through a person’s meaningful choice, rather than through platform defaults or mere technical visibility, and should address the treatment of scraped, cached, indexed, or republished data so that temporary or unintended exposure does not result in permanent loss of protection. While the Board cannot extend the Act to processing that Parliament has placed outside its scope, it can narrow and clarify the category of public data and explain when hybrid workflows and research-framed processing remain within the Act.
It is also relevant to note that the MeitY has issued non-binding guidance on the development and deployment of artificial intelligence systems. While such AI guidelines may encourage responsible practices, they do not alter the scope or operation of the DPDP Act. In particular, soft-law instruments cannot narrow or expand the statutory exclusion under Section 3(c)(ii), nor can they substitute for the conditional safeguards embedded in Section 17(2)(b). To the extent that large-scale AI training raises risks linked to aggregation, inference, or downstream decisional use, these concerns must be addressed through the Act’s interpretive framework and, where necessary, legislative amendment, rather than through advisory guidance alone. Ultimately, however, any move to impose substantive limits on AI training that relies exclusively on publicly available personal data would require legislative intervention, since the Board cannot rewrite or override the categorical exclusion enacted in Section 3(c)(ii).
Regulation of AI training that uses public data should turn on risk and real-world impact, not solely on public accessibility at the input stage. Where AI systems collect public data at scale, combine it across sources, or generate inferences that affect individuals, reliance on Section 3(c)(ii) cannot be assumed for the entire pipeline, particularly where non-exempt personal data is processed or the system is used to take decisions about individuals. In such cases, entities should not be able to shelter behind a research characterisation under Section 17(2)(b) and must comply with the Act’s ordinary obligations for in-scope data.
Finally, the Board should make clear that labelling AI training as “research” does not, by itself, limit regulatory responsibility. Where AI systems are persistent, widely deployed, or influence outcomes for individuals, regulatory focus should rest on the risks created by the processing, while encouraging practical safeguards such as data minimisation, limits on reuse, de-identification where possible, avoidance of high-risk public sources, and the use of synthetic data. These measures support innovation without treating publicly available personal data as a free and consequence-free resource.
Conclusion
Section 3(c)(ii) of the Act is clear in what it does but silent on why it does it. By excluding publicly available personal data at the threshold, the Act removes data protection safeguards without explaining why public availability should justify that result or how the exclusion is meant to be limited. The effect is unusual: the more widely personal data is exposed, the less the data protection law has to say about how it may be used.
This problem is most visible in AI-driven processing. Once Section 3(c)(ii) applies, the Act places no limits on the scale of data collection, aggregation, inference, or downstream use. Any constraints must then come from other legal regimes that are poorly suited to address the systemic risks created by automated systems. Treating publicly available personal data as free input for AI training blurs the line between access and use and weakens the safeguards that the Act applies in other contexts, particularly under Section 17(2)(b).
A sensible application of the Act therefore requires pushing back against the idea that public availability equals deregulation. Public data should not be treated as free or consequence-free input for AI systems. Where AI workflows involve non-exempt personal data, or move from non-decisional to decisional use, they remain subject to the Act because the statutory conditions for exclusion or exemption are not met. Within the limits of the statute, the role of the Data Protection Board of India is to clarify these limits through interpretive guidance on what counts as “made publicly available,” how hybrid workflows are treated, and when research exemptions apply, while recognising that stricter controls on AI systems trained only on public data will ultimately require legislative action.
Sarrah Darugar is a final-year student at Bharati Vidyapeeth University’s New Law College, Pune with an interest in litigation and technology policy related matters.
Mustafa Rajkotwala is a lawyer based in Mumbai, India. He graduated from NALSAR University of Law, Hyderabad and advises on regulatory and policy related matters.
Categories: Legislation and Government Policy
