Some of the world’s largest AI labs stand accused of quietly harvesting the world’s recorded music to teach their models how to sing, riff and mimic the voices and styles of living artists — and a newly public dossier compiled by the International Confederation of Music Publishers (ICMP) has reignited a long‑running fight over whether that practice is illegal, avoidable, or simply the cost of building modern generative AI.  
		
		
	
	
The ICMP — a Brussels‑based trade body representing major and independent music publishers worldwide — says it has spent two years assembling evidence that AI developers have scraped copyrighted songs and lyric databases from public platforms, open repositories and leaked datasets and then fed millions of those works into the training sets for commercial and research systems. The claim, presented to journalists on an exclusive basis, alleges the unauthorised use of songs from household names and catalogues by artists including Beyoncé, Bob Dylan, The Weeknd and others. ICMP’s director general describes the scale as “tens of millions of works” and labels the practice “the largest IP theft in human history,” language that has forced renewed attention from regulators, labels and lawmakers.  
The dossier implicates multiple models and companies — from research projects like OpenAI’s Jukebox to production systems and developer stacks such as Google’s Gemini, Microsoft’s Copilot, Meta’s Llama 3 and Anthropic’s Claude — and points to a mixture of public crawling, GitHub‑hosted repositories, dataset aggregators and leaked lists used to assemble large audio and lyric corpora. ICMP says it shared the evidence with Billboard and with policymakers while also making public statements and technical resources urging AI firms to stop unlicensed scraping. (ra.co, icmpmusic.com)
At the same time, music industry actors point to court filings (for example in suits against Anthropic and others) showing outputs that allegedly reproduce copyrighted lyrics or that models had access to large lyric corpora during training. These filings have become central evidence in emerging cases. (ra.co, reuters.com)
But the path to an unambiguous legal victory is uncertain. Defendants will contest both factual proof (did a specific model ingest a specific recording?) and legal theory (does training constitute an infringing use?). Courts will need to weigh complex evidence from model logs, dataset manifests, and expert testimony about memorisation vs pattern learning. Technical differences between sample‑based audio models and symbolic or lyric models will also matter: audio waveform learning raises distinct issues from text‑based lyric ingestion. (openai.com, reuters.com)
Operationally, the biggest near‑term win for rightsholders is procedural: stronger documentation requirements, enforcement of robots.txt and platform terms, and the use of machine‑readable reservation systems can force AI builders either to stop scraping or to negotiate. That is likely to produce faster real‑world outcomes than waiting for heavyweight adjudication in multiple courts.
From the AI company perspective, the options are also pragmatic:
The practical fallout will be a mix of litigation, regulatory pressure, technical safeguards and commercial deals. For creators, the moment offers leverage to secure new revenue channels and protections; for AI builders, it’s a demand for operational discipline and legal clarity. For users and developers — especially those building on Windows and in the wider software ecosystem — the immediate lesson is to assume that provenance matters, that training data will be scrutinised, and that the era of unchecked web scraping for commercial AI training is entering a phase of legal and commercial accountability. (musictech.com, openai.com)
Source: DJ Mag AI tech companies accused of illegally scraping copyrighted music in ICMP investigation
				
			
		
		
	
	
 Background
Background
The ICMP — a Brussels‑based trade body representing major and independent music publishers worldwide — says it has spent two years assembling evidence that AI developers have scraped copyrighted songs and lyric databases from public platforms, open repositories and leaked datasets and then fed millions of those works into the training sets for commercial and research systems. The claim, presented to journalists on an exclusive basis, alleges the unauthorised use of songs from household names and catalogues by artists including Beyoncé, Bob Dylan, The Weeknd and others. ICMP’s director general describes the scale as “tens of millions of works” and labels the practice “the largest IP theft in human history,” language that has forced renewed attention from regulators, labels and lawmakers.  The dossier implicates multiple models and companies — from research projects like OpenAI’s Jukebox to production systems and developer stacks such as Google’s Gemini, Microsoft’s Copilot, Meta’s Llama 3 and Anthropic’s Claude — and points to a mixture of public crawling, GitHub‑hosted repositories, dataset aggregators and leaked lists used to assemble large audio and lyric corpora. ICMP says it shared the evidence with Billboard and with policymakers while also making public statements and technical resources urging AI firms to stop unlicensed scraping. (ra.co, icmpmusic.com)
What ICMP says it found
The core allegation
ICMP’s complaint is twofold: first, that AI developers have actively collected copyrighted music and lyric text en masse; second, that these materials were ingested and used to train models that now power commercial products and developer APIs — without licensing or permission. ICMP’s materials reportedly include lists of URLs (YouTube, Spotify and public GitHub links), private dataset manifests tied to music‑making startups, and model output analyses which the trade body says demonstrate memorisation or replication of copyrighted lyrics and musical structure. (musictech.com, ra.co)Models and datasets named
The investigation names a range of systems and sources:- Research datasets and tools such as OpenAI’s Jukebox — which OpenAI has previously acknowledged was trained on a large set of songs — and public datasets like AudioSet and MusicLM references. (openai.com, ra.co)
- Commercial and open models including Google’s Gemini, Meta’s Llama 3, Microsoft Copilot, Anthropic’s Claude and various audio‑generation startups (Suno, Udio). (musictech.com, ra.co)
- GitHub repositories, leaked manifests and crawler logs that ICMP says show direct links to platform URLs for millions of musical works.
Why the claim matters: copyright, training data, and the law
Copyright basics and machine learning
Copyright law protects musical compositions (the underlying melody and lyrics) and sound recordings (the specific recorded performance). Training a model typically requires making copies of the source audio or text to compute weights during optimisation; whether those temporary or permanent copies infringe depends on jurisdiction, context, and legal doctrine such as fair use in the United States or specific exceptions in EU law. The point of tension is whether ingesting copyrighted works into a model is a use that requires a licence, or a transformative computational activity that courts treat differently. Recent rulings and agency guidance have not settled the question universally, which is why industry groups and courts are engaged in multiple cases testing those boundaries. (reuters.com, icmpmusic.com)Precedent and ongoing litigation
A number of high‑profile legal actions over training data and outputs have already shaped the landscape:- Publishers and labels have sued model builders over lyric reproduction and alleged memorised outputs. Some settlements and injunctions have led to “guardrails” that prevent chatbots from outputting full lyrics. Reuters covered an early settlement between Anthropic and music publishers over lyrics, which constrained what models may output.
- News publishers and authors have likewise pursued claims against language model creators for the use of press and book content, producing mixed judicial outcomes that leave room for further appeals and clarifications.
How scraping and dataset construction actually happens
Typical pipelines and weak spots
AI teams and data‑curation projects often combine many sources to assemble audio corpora:- Web crawlers (Common Crawl and targeted scrapers) harvest audio and metadata from public platforms where content is accessible.
- Public datasets and research corpora are reused or extended (AudioSet, openly available music corpora).
- Aggregators and GitHub repositories sometimes collate links and mirrors; leaked manifests or private dataset dumps occasionally surface on developer forums. (openai.com, ra.co)
Technical implications for model behaviour
Models trained on such material will not store audio files verbatim, but they will internalise statistical patterns — melodic contours, timbral cues, chord progressions and lyric phrasing. That internalised representation enables generation that can mimic style and occasionally reproduce phrases or short lyric lines verbatim. Whether that reproduction crosses into infringement is a legal and empirical question; plaintiffs focus on instances where outputs are near‑verbatim or where derivative works could undercut market value.Company responses and public statements
When ICMP provided its dossier to journalists, many of the named companies either did not respond to the reporting or issued guarded public comments emphasising lawful, publicly available training sources or existing content controls. OpenAI, for example, has previously disclosed that Jukebox — a research project — was trained on a dataset of 1.2 million songs, but the company emphasises the research‑only, non‑commercial nature of that release and notes it has not focused on commercial music generation as a primary product line. Other companies have repeatedly argued that training on publicly accessible data can be consistent with fair use and that they are building safeguards (output filters, watermarking, lyric‑guardrails) to reduce copying. (openai.com, techcrunch.com)At the same time, music industry actors point to court filings (for example in suits against Anthropic and others) showing outputs that allegedly reproduce copyrighted lyrics or that models had access to large lyric corpora during training. These filings have become central evidence in emerging cases. (ra.co, reuters.com)
Strengths of the ICMP case — why the allegations are powerful
- Scale and specificity: ICMP claims not only general scraping but a catalogue‑level compilation of URLs and dataset manifests tied to identifiable commercial works, strengthening the hypothesis that the copies were deliberate and large.
- Industry backing: ICMP represents major global publishers with legal and forensic resources; coordinated pressure from Universal, Sony and others increases the chances of litigation and regulatory action that can compel discovery.
- Multiple corroborating threads: Independent reporting and prior lawsuits have already revealed similar patterns — lyric copying in model outputs, dataset leaks tied to Udio and Suno, and GitHub manifests containing music links — creating a web of overlapping evidence. (ra.co, musictech.com)
Weaknesses and legal hurdles for rightsholders
- Proving ingestion in discovery: AI firms can argue that training relied on huge, heterogeneous datasets where the presence of any one file is not dispositive; plaintiffs must show the specific works were ingested and that outputs materially replicate them. That often requires deep technical discovery into model logs, training manifests and data provenance — which courts will decide over prolonged fights.
- Transformative use and fair‑use defenses: Some defendants may rely on doctrines that treat training as a transformative or ephemeral computational act not covered by traditional copying rules. Courts have not uniformly accepted that view, but it remains a litigable line of defence.
- Jurisdictional complexity: Models are trained across borders, datasets are mirrored internationally, and claims may need to be brought in multiple venues with different copyright doctrines (U.S. fair use vs EU exclusive rights and exceptions), complicating enforcement.
Policy and technical responses available now
Regulatory levers
- The EU AI Act and documentation requirements: The EU’s draft rules will require model builders to maintain technical documentation about training data for large general‑purpose models, a change the music industry views as a major enforcement lever because it can force disclosure of datasets and provenance. ICMP has pushed for robust implementation of such rules.
- National legislatures and copyright offices: Agencies in the U.S. and elsewhere are examining how copyright law applies to model training. Additional guidance or legislative clarifications (opt‑in or opt‑out frameworks, licensing presumptions) could radically alter industry practice.
Commercial and technical mitigation
- Licensing marketplaces and data‑clearing houses: Several proposals advocate for an industry‑wide marketplace where AI firms must license musical works for training, with transparent accounting and payment to composers and rights holders. Such a market would create an auditable provenance chain and reduce litigation risk.
- Provenance, watermarking and dataset audit tooling: Technical controls — audio watermarks, signed manifests, and dataset scanning tools that can detect copyrighted recordings — can block or flag problematic ingestion. These are imperfect but improving.
- Output‑level guardrails: Firms can (and some already do) implement mechanisms that block verbatim lyric output, refuse to model unique vocal timbres on request, or label generated music with provenance metadata. These mitigations reduce downstream harm but do not resolve the ingestion question. (reuters.com, techcrunch.com)
What this means for creators, platforms and Windows users
- For creators and publishers: The ICMP case strengthens bargaining power. Publishers should document catalogue metadata, register rights in machine‑readable formats, and consider industry‑wide licensing or reservation systems to assert opt‑out preferences programmatically. RightsAndAI and similar initiatives already offer mechanisms to register reservations against scraping.
- For platforms and developers: Companies building music generation features or using audio datasets should adopt strict provenance audits, implement manual review of datasets, and negotiate licenses where possible. Legal exposure from unlicensed training can be existential given statutory remedies and the multiplicative effect of millions of works. (reuters.com, openai.com)
- For consumers (including Windows users deploying AI on PCs): Tools that produce music may become more limited in their ability to imitate specific artists unless plugins secure licences or adopt strict guardrails. Developers shipping apps should push for clear model provenance and vendor warranties before using generative models in commercial products.
Scenarios to watch next
- High‑stakes litigation: Expect the ICMP dossier to inform or accelerate lawsuits and depositions; discovery can compel model builders to disclose data manifests or logs that either substantiate or undermine ICMP’s claims.
- Regulatory enforcement: The EU and other jurisdictions may use documentation obligations to audit models and impose penalties for non‑compliance.
- Commercial settlements and licensing schemes: A practical outcome could be negotiated licensing frameworks that let AI firms pay for training rights — either per‑model, per‑work, or via subscriptions — reducing litigation risk while ensuring creators are paid.
Critical analysis: strengths, risks and the path to resolution
The ICMP dossier is a credible escalation because it combines technical artifacts, industry knowledge and political muscle. The music industry’s aim is pragmatic: to convert an unregulated scraping economy into one where licences, transparency and payments are normative. That objective is both commercially sensible and legally defensible.But the path to an unambiguous legal victory is uncertain. Defendants will contest both factual proof (did a specific model ingest a specific recording?) and legal theory (does training constitute an infringing use?). Courts will need to weigh complex evidence from model logs, dataset manifests, and expert testimony about memorisation vs pattern learning. Technical differences between sample‑based audio models and symbolic or lyric models will also matter: audio waveform learning raises distinct issues from text‑based lyric ingestion. (openai.com, reuters.com)
Operationally, the biggest near‑term win for rightsholders is procedural: stronger documentation requirements, enforcement of robots.txt and platform terms, and the use of machine‑readable reservation systems can force AI builders either to stop scraping or to negotiate. That is likely to produce faster real‑world outcomes than waiting for heavyweight adjudication in multiple courts.
From the AI company perspective, the options are also pragmatic:
- Invest in licensed datasets and provenance tooling,
- Build safer output filters and watermarking, and
- Press for clearer legislative rules that permit some training uses with compensation or opt‑out regimes.
Practical recommendations for developers, companies and WindowsForum readers
- For app developers using generative audio: insist on vendor attestations and dataset provenance before shipping features that mimic human voices or stylistic trademarks. Archive prompts, timestamps and model versions.
- For music creators and publishers: register your works in machine‑readable registries and use reservation/opt‑out tools offered by industry initiatives to make scraping easier to detect and legally contest.
- For enterprise buyers: request contractual warranties about training data provenance and indemnities that cover training‑data litigation; assume vendor indemnities may have carve‑outs and negotiate accordingly.
Conclusion
The ICMP’s dossier marks a turning point in the public debate about generative AI and music. It reframes what has often been a diffuse set of complaints into a consolidated allegation: that significant quantities of copyrighted music have been collected and used to teach models without proper licences. The claim is backed by industry actors with the resources to litigate and lobby, and it intersects with evolving rules in the EU and pressing court cases in the U.S.The practical fallout will be a mix of litigation, regulatory pressure, technical safeguards and commercial deals. For creators, the moment offers leverage to secure new revenue channels and protections; for AI builders, it’s a demand for operational discipline and legal clarity. For users and developers — especially those building on Windows and in the wider software ecosystem — the immediate lesson is to assume that provenance matters, that training data will be scrutinised, and that the era of unchecked web scraping for commercial AI training is entering a phase of legal and commercial accountability. (musictech.com, openai.com)
Source: DJ Mag AI tech companies accused of illegally scraping copyrighted music in ICMP investigation
