Ninth Circuit Weighs DMCA 1202b in AI Training and Copilot Case

  • Thread Author
The Ninth Circuit’s decision to take interlocutory review of a narrow but consequential question under Section 1202(b) of the Digital Millennium Copyright Act — whether plaintiffs must plead an identicality link between protected works and generative‑AI outputs to state a DMCA claim against Microsoft and OpenAI — has quietly become one of the most important test cases for how existing copyright law will be applied to modern machine‑learning systems.

A glowing brain with circuit-like connections floats amid code, data and provenance icons.Background​

The dispute arises from a consolidated action brought by software developers against GitHub, Microsoft and affiliated AI developers who built the Codex family of models and GitHub Copilot. Plaintiffs allege their publicly posted code was used to train these models and that the AI assistants can and sometimes do output code that reproduces or meaningfully replicates their copyrighted work. In addition to traditional copyright and contract claims, the plaintiffs asserted a cause of action under DMCA Section 1202(b) — the provision that forbids the intentional removal or alteration of copyright management information (CMI) when that removal is likely to induce, enable, facilitate, or conceal infringement. The district court dismissed the Section 1202(b) claim on plausibility grounds, reasoning the complaint failed to identify any concrete instance where a Copilot output was identical to a plaintiff’s code; the Ninth Circuit accepted interlocutory review of that precise legal question.
This appeal does more than decide one pleading issue. It asks whether a statute drafted for the era of digital media distribution can be stretched to address the displacement of metadata through mass automated ingestion for training models, and whether downstream, blended model outputs — which may not be byte‑for‑byte identical to any training example — can trigger remedies tied to the loss of attribution or licensing metadata.

What Section 1202(b) does and why it matters​

The text and traditional function of Section 1202(b)​

Section 1202(b) makes it unlawful to "intentionally remove or alter" CMI — author names, copyright notices, license terms and similar attribution data — when a defendant knows or has reasonable grounds to know that doing so will induce, enable, facilitate, or conceal infringement. Historically, courts have applied Section 1202 in situations where someone copies or redistributes digital media and strips away visible or embedded attribution, thereby making it easier to pass off works or to avoid license obligations. The statute functions as an anti‑tampering rule that protects information about provenance, not the underlying copyright itself.

The doctrinal tension with generative AI​

The mid‑century drafting context for Section 1202(b) presumed a discrete, human‑readable copy that could be traced to a source. Modern machine learning upends that assumption: models ingest massive corpora, learn statistical patterns, and generate novel outputs that blend many inputs. The key doctrinal tension is whether the removal of CMI during ingestion (for example, by scraping public code that had license headers or author tags) is actionable when the alleged downstream harm is a generative output that is not literally a copy but may nonetheless replicate protectable expression or cause practical displacement. The Ninth Circuit’s framing of the identicality question gets to the heart of that tension.

The procedural posture and what the district court held​

Judge Jon S. Tigar dismissed the Section 1202(b) count for failing to plead a plausible causal connection between any removed CMI and an actual infringement: the complaint lacked a concrete example of an identical reproduction of a plaintiff’s code by Copilot, and the court viewed that absence as fatal to the DMCA theory as pleaded. Importantly, the court did not dismiss the entire case — copyright, breach‑of‑license and related claims remain alive, and the district court allowed discovery tied to those claims to proceed. The interlocutory appeal focuses narrowly on whether a rigid identicality requirement is necessary to maintain a Section 1202(b) claim in the context of model training.

The technical layer: memorization, mixing, and the mechanics of risk​

How models like Codex and Copilot are trained​

Codex‑class models are trained on vast corpora of public source code, indexed by crawlers and dataset curators. These models internalize statistical correlations over tokens and can, in certain circumstances, reconstruct sequences that were present in the training set. The phenomenon known as memorization is real: empirical research has shown that models can reproduce verbatim training data in response to particular prompts, especially when sequences are long, unique, or repeated in the training corpus.

When verbatim reproduction is likelier​

  • Short, highly repetitive code snippets are easier to reproduce verbatim.
  • Blocks that appear across many repositories — repeated boilerplate — increase exposure.
  • Prompts that contain exact fragments of training data can coax models into continuations that mirror original sequences.
  • Training pipelines that include stale caches or mirrored content can persist exposures for material that was later deleted or privatized.

Why "not identical" does not equal "not harmful"​

Even when a model output is not a byte‑for‑byte clone, it can be functionally equivalent, substantively similar, or practically interchangeable. For software, small differences in whitespace or variable names can mask substantial reproduction of logic and structure. Plaintiffs argue that the absence of CMI — license terms, author credit, or usage restrictions — is meaningful precisely because model outputs can be substantively similar and widely reused without attribution, thereby creating real economic and reputational injury. The district court, by contrast, emphasized the plaintiff’s burden to show a concrete causal link tied to a specific reproduction.

The plaintiffs’ theory: how the DMCA claim is supposed to work here​

Plaintiffs present a twofold theory:
  • Training on repositories that contained manifest CMI effectively decoupled that information from the content when ingested into training corpora; CMI was thus “removed or altered” in the technical sense required by Section 1202(b).
  • Model outputs — even when not literal clones — can reproduce copyrighted expression or be practically substituted for original code, meaning the absence of CMI caused or facilitated infringement in downstream uses.
Under that theory, plaintiffs seek remedial tools that are not purely about copying but about maintaining provenance and licensing continuity in an environment where automated reuse is the default. The theory is bolstered by third‑party empirical work and real‑world incidents showing that Copilot has, on occasion, exposed private or formerly public content — so the risk is not purely speculative.

The defense: concreteness, pleading rules, and policy arguments​

Microsoft and OpenAI have advanced three core defenses:
  • Lack of concreteness: The plaintiffs’ complaint fails the plausibility standard because it lacks any concrete, verifiable example of Copilot generating an identical copy of the plaintiffs’ works.
  • Technical rarity of memorization: Defendants point to engineering safeguards and empirical evidence that verbatim memorization is not the typical behavior of large models in benign prompts.
  • Policy and innovation concerns: Imposing broad DMCA liability could chill development and deployment of useful tools built on public code and create unpredictable exposure for platforms that index and serve developer artifacts.
The companies asked the Ninth Circuit to affirm dismissal of the DMCA claim on these grounds, contending that permitting Section 1202(b) to cover blended, non‑identical outputs would be a doctrinal stretch better left to Congress than to the courts.

Possible Ninth Circuit outcomes and their likely ripple effects​

The appellate court’s ruling is likely to fall into one of three doctrinal shapes — and each has different industry consequences.

1. Narrow ruling (affirming the district court)​

If the Ninth Circuit requires a high degree of similarity — essentially an identicality threshold — for Section 1202(b) claims tied to model outputs, the result will limit DMCA exposure for model builders. That outcome would preserve plaintiffs’ other claims (copyright and license breach), but it would reduce the statutory tools available to creators seeking remedies for attribution loss in training pipelines. Industry reaction would likely be relief among platform operators and companies that rely on public code for model training.

2. Broad ruling (rejecting identicality)​

If the court accepts a broader reading of Section 1202(b), holding that CMI removal at the ingestion stage can be actionable even when downstream outputs are not literal clones, the effect would be transformative. Model builders would face increased legal exposure, which would accelerate licensing demand from rights holders, force tighter provenance and filtering mechanisms, and encourage contractual and engineering changes across ingestion pipelines. The decision could prompt rapid market adjustments — paid licensing regimes, provenance standards, and new compliance tooling.

3. Middle path (remand for fact development)​

A more pragmatic outcome is a remand for discovery. The panel could decline to announce a sweeping doctrinal rule and instead require the district court to develop the factual record about how models were trained, whether specific outputs can be tied to particular inputs, and what practical role CMI actually played (or did not play) in the pipeline. That middle approach would preserve the judicial role in fact development and reduce the risk of premature, law‑wide pronouncements. It would also keep discovery open on training corpora and filtering methods, a prospect Microsoft and OpenAI want to avoid because it risks exposing trade secrets.

Practical consequences for key stakeholders​

For developers and open‑source maintainers​

  • Assume permanence: Treat public posting as effectively irreversible with respect to indexing and training uses.
  • Embed machine‑readable CMI: Use consistent license headers and explicit attribution metadata that are easy to parse by ingestion pipelines.
  • Security hygiene: Never commit secrets or credentials to repositories; rotate keys promptly if they are accidentally exposed.

For enterprises and product teams​

  • Audit AI assistants: Before enabling Copilot or similar tools in production, conduct risk assessments and evaluate configuration options that minimize verbatim output.
  • Contractual protections: Negotiate vendor terms that specify training data policies, warranty and indemnity provisions, and obligations around provenance and takedowns.
  • Operational controls: Adopt DLP, CI/CD scans, and access restrictions to reduce the chance that proprietary or restricted code will be ingested inadvertently.

For platform operators and model builders​

  • Preserve provenance where possible: Design ingestion pipelines that store or link CMI to training artifacts and implement traceable provenance records.
  • Transparency and opt‑out tools: Offer opt‑out mechanisms for authors and clear, verifiable removal or remediation processes for material that should not be included in training sets.
  • Engineering mitigations: Strengthen deduplication and suppression layers to reduce memorization; document those engineering choices publicly as part of a compliance posture.

Critical analysis: strengths, weaknesses and litigation dynamics​

Strengths of the plaintiffs’ posture​

  • Empirical plausibility: Real‑world incidents and research demonstrating memorization and exposure of private content give the plaintiffs a credible narrative that the risk is not hypothetical.
  • Policy resonance: Courts and policymakers are increasingly receptive to the idea that web‑scale scraping without attribution imposes economic and reputational harms on creators. The DMCA claim dovetails with that policy conversation.
  • Discovery leverage: If the appeal results in remand, plaintiffs can obtain discovery into training corpora and filtering systems that may produce the concrete evidence the district court said was lacking.

Weaknesses and legal obstacles for plaintiffs​

  • Pleading burden: Federal pleading standards require plausibility; without a concrete example of infringement tied to the removed CMI, a DMCA claim is vulnerable to dismissal. The district court emphasized this point.
  • Doctrinal fit: Section 1202(b) was not drafted with machine learning in mind. Extending it to training ingestion and non‑literal outputs raises statutory interpretation problems that appellate courts may be reluctant to resolve without clearer legislative guidance.

Strengths of Microsoft and OpenAI’s defense​

  • Procedural weaponry: The defense can leverage pleading standards and standing doctrine to attack the claim before discovery, limiting exposure and preserving trade secrets.
  • Public policy argument: Framing broad DMCA liability as an innovation choke point is a politically potent defense that can influence judicial restraint.

Weaknesses and exposure for Microsoft and OpenAI​

  • Technical slipups: Documented incidents where Copilot resurfaced private or removed data undercut categorical statements that memorization is negligible. Those incidents strengthen the plaintiffs’ practical risk story.
  • Regulatory and reputational risk: A narrow judicial holding in favor of defendants could nonetheless spur legislative or regulatory responses that impose provenance obligations or new compliance regimes.

Litigation strategy and likely next moves​

  • Plaintiffs will likely press for discovery if the Ninth Circuit remands, seeking logs, training manifests, deduplication thresholds, and any examples of outputs resembling plaintiffs’ code.
  • Defendants will press for a dispositive appellate holding that limits Section 1202(b) claims in the ML context; if unsuccessful, they will emphasize protective orders and trade‑secret safeguards in discovery.
  • The industry will watch closely; even an outcome short of broad liability could produce contract rewrites, takedown tooling, and new provenance standards in product terms and open‑source license practices.

Practical checklist: what to do now​

  • For maintainers: audit public repositories for robust license headers, add machine‑readable metadata, and document publication timestamps.
  • For enterprises: run a risk assessment for any code‑generation features; update procurement terms to require vendor disclosure of training sources.
  • For platform operators: install provenance logging and provide verifiable removal and opt‑out mechanisms for creators.
  • For policymakers and trade groups: convene multi‑stakeholder working groups to define practical provenance standards that balance attribution with innovation.
These steps do not resolve the statutory questions, but they reduce near‑term risk and demonstrate good‑faith operational changes that mitigate both legal exposure and public‑policy backlash.

What to watch in the Ninth Circuit’s opinion​

  • Textual reasoning: Does the panel read Section 1202(b) narrowly in light of its drafting context, or does it adopt a flexible interpretation that adapts the statute to technological change?
  • Causation framing: Will the court require a tight causal chain between CMI removal and an identified infringement, or accept a more probabilistic theory that recognizes blended reproductions as actionable?
  • Remedies and scope: If the court sides with plaintiffs, will it limit remedies to case‑specific injunctions and fact‑driven relief, or suggest industry‑wide obligations for provenance and metadata retention?
The answers will matter not only to the parties but to thousands of developers, enterprises, and platform operators whose business models and engineering architectures depend on how the law balances attribution, reuse, and innovation.

Conclusion​

The Ninth Circuit’s willingness to decide whether Section 1202(b) can reach the ingestion practices behind Codex and GitHub Copilot is a recognition that longstanding statutory frameworks must be tested against modern AI capabilities. A narrow holding will provide breathing room for model builders but leave important non‑DMCA claims to proceed; a broad holding will push the industry toward stronger provenance guarantees, licensing regimes, and possibly new compliance markets. A remand will encourage fact development and slow doctrinal change while preserving the potential for sweeping remedies later.
For creators, developers and enterprises, the lesson is immediate: metadata and provenance matter more than ever. For courts and lawmakers, the case underscores a choice — adapt existing statutes thoughtfully and with clear limits, or ask Congress to write rules that explicitly reconcile machine learning with attribution, compensation and the public interest. The Ninth Circuit’s ruling will not be the final word on AI training data liability, but wherever it lands it will mark a significant waypoint in how the law governs the modern creative commons.

Source: Law360 9th Circ. Mulls DMCA Claim Against Microsoft And OpenAI - Law360
 

Back
Top