MediaNews Group Sues OpenAI and Microsoft for News Content Used to Train AI

  • Thread Author
Nine regional newspapers owned or managed by MediaNews Group have filed a sweeping 119‑page federal copyright complaint in the U.S. District Court for the Southern District of New York, accusing OpenAI and Microsoft of harvesting millions of copyrighted news articles to train the large language models behind ChatGPT, Copilot and related generative AI products and seeking “in excess of $10 billion” in damages.

Background and overview​

The complaint, filed on November 26, 2025, names nine titles — the Los Angeles Daily News, The San Diego Union‑Tribune, the San Bernardino Sun, the Boston Herald, the Hartford Courant, The Morning Call, the Boulder Daily Camera, the Daily Press and The Virginian‑Pilot — and runs to 119 pages, according to reporting based on the filing.
This action is the latest major publisher suit targeting the business practices of AI platform companies and follows a tilt‑toward aggressive discovery in similar cases earlier in 2025 and 2024. The new complaint is explicitly framed as both a copyright enforcement case and an economic‑harm claim: publishers allege not only unauthorized copying but also that AI outputs displace readers and siphon subscription and ad revenue.
The filings continue a pattern: publishers coordinating legal strategies while litigating how courts should treat web‑scale training of large language models (LLMs) and the commercial exploitation of model outputs. MediaNews Group’s suit is separate from an April 2024 MediaNews/Tribune‑related complaint covering a different set of titles; that earlier matter remains active on a different procedural track.

What the complaint alleges​

The core allegation is straightforward in its legal framing: OpenAI and Microsoft built and commercialized generative AI products by copying and training on copyrighted journalism without permission, then monetized the resulting models without paying the publishers. The plaintiffs seek injunctive relief and monetary damages exceeding $10 billion.
Lawyers for the newspapers argue the defendants “pay for chips, computers and programmers — but steal the raw material for GAI products — valuable well‑written content — from hard‑working journalists,” a rhetorical framing designed to underline both the literal copying claim and the economic displacement theory.
Legally, the complaint is expected to plead multiple counts typical of this litigation genre:
  • Direct and contributory copyright infringement claims;
  • Requests for injunctive relief to stop further ingestion or use of the plaintiffs’ works for model training; and
  • Potential state‑law tort claims or related statutory theories depending on the detailed pleading.
The suit also presses a damages theory tied to market displacement: plaintiffs assert that when AI assistants generate answers or summaries sourced from reporting, potential readers are less likely to visit the publishers’ websites, causing lost subscriptions and ad impressions. That dual model — copying plus displacement — underpins the plaintiffs’ valuation and damages demand.

Why this matters: economic and legal stakes​

For local and regional newspapers, the stakes are existential. Many operate on thin margins and depend on referral traffic and subscriptions to sustain investigative reporting and local coverage. Publishers argue that machine‑generated summaries or direct answers offered by AI assistants represent a new distribution channel that bypasses pageviews and subscription prompts, worsening an already precarious business model.
From the defendants’ perspective, AI firms assert that training on broadly available web content is lawful and that model training involves statistical learning rather than wholesale verbatim copying. They also point to product features that can drive traffic to original sources when attribution is present. Those defenses will collide with the plaintiffs’ technical discovery requests aimed at proving direct ingestion and downstream output reuse.
The outcome has potential ripple effects:
  • A plaintiff victory could force large AI developers to negotiate licensing deals with news organizations or to design provenance and opt‑out systems for publishers.
  • A defense win could entrench broader reuse of web content for model training without payment, reshaping incentives across media and tech.
    Either track would affect product design, licensing markets, platform economics and public policy.

The discovery battleground: what publishers will seek​

Earlier cases in this litigation landscape have turned on discovery — the technical documents that show what training data was used and how models respond to inputs. The MediaNews Group complaint and analogous suits have made clear that plaintiffs will demand:
  • Inventories of training corpora and sampling methods;
  • Retention logs showing which datasets were kept or deleted;
  • Model output logs and usage telemetry that can tie specific outputs to publisher content; and
  • Internal communications about dataset choices, deletions and legal advice.
Recent rulings in related matters indicate courts are willing to order production of internal communications where state of mind and dataset decisions are material. For example, a magistrate judge in a separate Authors Guild matter recently ordered OpenAI to produce communications about why it deleted certain book datasets, undercutting a blanket privilege claim in that context. Plaintiffs view those rulings as precedents that increase the likelihood of probing, weapon‑grade discovery in copyright suits against AI vendors.
Technically, proving the plaintiffs’ case will require connecting the dots between ingestion and output — a difficult task because training generally scrambles individual documents into statistical weights. Plaintiffs plan to use logging, sampling, and provenance techniques, plus expert testimony and output sampling, to show that LLMs can reproduce or closely paraphrase copyrighted works in ways that harm publishers’ markets. Defendants will counter with arguments about transformation, public availability, and the infeasibility of licensing every web publisher.

Strengths of the publishers’ case​

  • Concrete damages framing: The complaint quantifies harm and links alleged copying to revenue streams that OpenAI and Microsoft monetize. A clear monetary figure, “in excess of $10 billion,” focuses attention on measurable commercial consequences.
  • Favorable discovery posture: Recent judicial decisions in related cases have signaled willingness to allow discovery into how models were trained, including compelling internal communications where appropriate. Those developments increase the plaintiffs’ chances of obtaining technical evidence.
  • Policy momentum and public sympathy: Lawmakers and the public are increasingly wary of technology that can disintermediate news organizations, giving publishers a sympathetic backdrop for both litigation and potential legislative remedies.
These strengths make the litigation a plausible vehicle for compelling document production and increasing negotiation leverage, even if a full trial and final damages award remains years away.

Weaknesses and legal hurdles for plaintiffs​

  • The fair‑use inquiry is complex: Courts weigh purpose, nature, amount, and market effect. Defendants will argue that model training is transformative and therefore protected as fair use — a fact‑specific defense that has succeeded in some contexts.
  • Tracing problem: Demonstrating that models were trained on specific copyrighted articles — and that outputs reproduce those articles in a way that violates copyright — is technically challenging. LLMs do not store documents intact, and distinguishing permissible summarization from infringing reproduction requires nuanced technical evidence.
  • Procedural expense and timeline: Litigation of this complexity is slow and costly. Even a favorable ruling may take years to convert into meaningful revenue for strained newsrooms. Plaintiffs face the risk that prolonged litigation could bankrupt smaller publishers before recovery.
Because these weaknesses are also factual, much will turn on the quality of discovery and the technical experts both sides marshal.

Broader legal and policy implications​

If courts ultimately adopt a restrictive view of web‑scale training without permission, the practical impact will be significant. AI developers would face higher compliance costs, potential rulings requiring provenance tracking or opt‑out mechanisms, and incentives to negotiate licensing deals with major publishers. Conversely, narrow rulings favoring defendants could leave publishers with limited remedies and encourage alternative responses such as paywalls, technical hardening and commercial licensing offers.
Legislatively, Congress and regulators are already watching. Potential outcomes include:
  • Standards for data provenance and model transparency;
  • A statutory pathway for licensing or revenue sharing for news content used to train commercial models; or
  • Regulatory guidance on automated agents and the liability they impose when they redistribute news content.
Any such policy response will need to balance creators’ rights, innovation incentives and the practicalities of model development at web scale.

Technical reality: what training typically entails and why tracing is hard​

Large language models are trained on extremely large corpora that mix licensed datasets, public web crawls and other sources. Training converts text into learned parameters rather than stored copies, which complicates direct attribution of any downstream output to a single input article. Plaintiffs will have to show either:
  • that specific articles were present in training sets and that outputs reproduce them in a protected manner; or
  • that the models’ use of the articles caused commercial harm by supplanting readers.
Defendants, for their part, will emphasize that training is statistical and that outputs are emergent, not verbatim replicas. Courts will need to reconcile these technical realities with copyright doctrine. The upcoming discovery fights will focus on logs, dataset inventories and the provenance pipelines that show how data was collected, sampled, and retained.

What publishers can and are doing outside the courtroom​

Publishers are deploying a mixed strategy:
  • Technical hardening: moving full text behind server‑side gates, tightening session controls, and implementing bot detection to reduce the risk of automated scraping and agent‑driven access.
  • Commercial offers: exploring machine‑readable licensing or API deals so AI platforms can legally access high‑value reporting while paying publishers.
  • Industry coordination: aligning through trade groups and coordinated litigation to raise the transactional cost of unauthorized ingestion and to press for public policy responses.
These layered defenses aim to reduce the near‑term risk and create leverage for post‑litigation licensing conversations.

Risks of overly broad remedies​

Courts and policymakers must be mindful of unintended consequences. Overbroad injunctions or rigid licensing mandates could:
  • Increase compliance costs, disproportionately harm startups and academic researchers, and chill innovation;
  • Create fragmentation in data access that raises the cost of model research and development; and
  • Incentivize evasive behavior by bad actors if policy solutions are slow or inconsistent across jurisdictions.
A balanced remedy will need to protect creators while preserving legitimate avenues for model training, research, and noncommercial use.

How the case may proceed and what to watch next​

  • Early motions: Expect immediate motions to stay, strike or limit discovery, and disputes over the scope of logging and internal communications. Recent rulings in other cases suggest courts will not reflexively shield internal dataset decisions from discovery.
  • Discovery deluge: If courts permit broad discovery, the key battleground will be dataset inventories, deletion logs, model output telemetry, and communications about dataset retention and legal advice. These documents will determine how strong the plaintiffs’ tracing and willfulness theories are.
  • Settlement pressure: Given the commercial stakes and reputational risk, the parties may face pressure to negotiate licensing, revenue sharing, or gating arrangements prior to a merits trial. Media companies and tech vendors both have incentives to avoid protracted public technical disclosure.
  • Appellate and policy consequences: Whatever the district court decides, final resolution may take years and could seed important appellate precedent or spur legislative reforms.

Practical takeaways for Windows and enterprise readers​

  • Enterprises using AI assistants should be aware of evolving legal exposure tied to how those models were trained and how their outputs are used in commercial workflows. The litigation underscores the need for careful supplier due diligence and contract clauses that address training data provenance.
  • Developers and ISVs building on Copilot, ChatGPT or other AI services should incorporate attribution, provenance checks and configurable grounding to limit risks of redistributing copyrighted news content. Designing conservative defaults for browsing, memory and content reuse will reduce legal and reputational hazards.
  • Publishers should prioritize layered defenses: technical access controls, business models that monetize machine access (APIs/licenses), and participation in coordinated legal or policy efforts to shape the rules of engagement.

Caveats and unverifiable claims​

Several assertions in public reporting and legal filings are framed as allegations and remain unproven until tested in discovery or at trial. For instance:
  • The precise composition of the training corpora and whether specific publisher articles were ingested remains a factual claim to be established through discovery; it is not independently verified at this stage.
  • Statements about how often AI outputs reproduce publisher content, or the exact magnitude of readership “displacement,” are central to the damages theory but will require empirical proof; the complaint sets a damages number in excess of $10 billion, which is a legal claim rather than an adjudicated fact.
Where judicial orders have already compelled production of internal documents in related cases, that progress is meaningful; however, the existence of discovery orders in other matters does not automatically prove the factual accuracy of the publishers’ allegations here.

Conclusion​

The MediaNews Group complaint adds a consequential chapter to the broader struggle over how society values journalism in the age of generative AI. The suit crystallizes two connected claims — unauthorized ingestion of copyrighted content for model training, and economic displacement when AI outputs substitute for direct readership — while leaning on a discovery strategy that has proven effective in related litigation.
What follows will be highly technical, expensive and precedent‑setting: judges will wrestle with the mechanics of machine training, lawyers will fight over logs and communications, and newsrooms will await either a legal vindication that yields licensing leverage or a legal defeat that forces them to double down on technical and commercial defenses. The case is likely to shape commercial negotiations, product design, and public policy for years to come.

Source: Chicago Tribune 9 more newspapers sue OpenAI, Microsoft, alleging stolen content used in AI apps