LINGUA Opens Open Call to Expand European Language Datasets Including Ukrainian

ChatGPT · Oct 13, 2025

Microsoft’s AI for Good Lab has opened a new door for Europe’s smaller languages with LINGUA — an open call and funding program designed to expand high-quality, openly licensed datasets for underrepresented European languages, explicitly including Ukrainian. The program offers up to $50,000 per project, multi‑year Azure compute credits, and technical collaboration with academic partners EPFL and ETH Zürich, and it arrives amid growing European concern that today’s large language models (LLMs) reflect an overwhelmingly English‑centric web and worldview.

Background

Most modern LLMs were trained on vast amounts of web content that skews heavily toward English. That imbalance creates real-world deficits: poorer model performance for many European languages, weaker support for culturally specific concepts, and a competitive disadvantage for local businesses and researchers who need reliable language tools. Microsoft and European institutions have framed this as both a cultural and commercial problem — one that requires investment in data, compute, and local expertise to fix. The LINGUA open call sits inside Microsoft’s AI for Good Lab and is coordinated with the APERTUS project led by EPFL and ETH Zürich as part of a broader European push to strengthen multilingual AI capacity.

What LINGUA is (and what it's not)

LINGUA is an open call to collect and publish high‑quality speech and text datasets for low‑resource European languages.
The program focuses on openly licensed datasets intended to be reusable by researchers, open‑source developers, and industry — a deliberate contrast with closed‑proprietary corpora.
LINGUA provides cash grants, Azure compute credits, and technical partnership rather than building models itself; the objective is to fill the upstream data gap so open models and tools can perform better for those languages.

The facts you need to know (verified)

Program owner: Microsoft AI for Good Lab (program page and open call).
Partners: Coordination with the APERTUS project (EPFL & ETH Zürich) and consultation with European bodies; technical collaboration opportunities are explicitly stated.
Funding: Selected projects can receive up to $50,000 for data collection; larger projects can be considered case‑by‑case.
Compute: Azure compute credits are available for up to two years to support processing and baseline experiments.
Timeline: Submission opened in late September 2025; deadline: November 11, 2025 (23:59 CET); awardees announced: January 20, 2026.
Eligibility: NGOs, universities, research institutes, startups, cultural organizations and consortia are encouraged to apply, especially projects committed to open licensing and community engagement.

These items are confirmed on Microsoft’s official LINGUA page and independently reported by multiple outlets covering the announcement.

Why LINGUA matters for Ukrainian language tech

Closing a material gap in training data

Ukraine’s linguistic heritage and contemporary web presence are critical inputs for models to understand local idioms, legal texts, historical references, news, and technical domains in Ukrainian. By funding dataset creation rather than proprietary model hosting, LINGUA directly targets the bottleneck that prevents fair model performance: insufficient, high‑quality, and ethically collected corpora.

Benefits for ecosystem actors

For universities and research labs: a clearer path to publish corpora that support reproducible research and local benchmarking.
For cultural institutions: an opportunity to digitize archives, oral histories, and regionally unique materials that often never reach model training pipelines.
For startups and SMEs: better natural language understanding (NLU), speech recognition, and TTS support in Ukrainian improves product localisation and customer experience.
For community groups and NGOs: funding to document dialects, minority varieties, and oral traditions that otherwise risk exclusion from the AI era.

Complementary tools already supporting Ukrainian ecosystems

Services such as high‑quality translation tools and enterprise copilots are increasingly supporting Ukrainian; projects that produce open datasets will improve those tools’ underlying models and reduce reliance on closed datasets or suboptimal cross‑language transfer. Independent reviews and local product rollouts show Ukrainian locale support becoming a practical requirement in enterprise stacks.

What LINGUA funds — and what applicants must plan for

Funded activities (typical)

Large‑scale speech collection (speaker diversity, dialect coverage)
Text digitization and annotation (OCR, cleaning, metadata)
Creation of aligned corpora for MT and speech‑to‑text pipelines
Licensing and legal clearance activities (copyright review, consent procurement)
Baseline experiments and release of evaluation suites

In practice, successful proposals should include

A clear, replicable data‑collection protocol (sampling strategy, consent forms, metadata schema).
A plan for open licensing and distribution (license choice, hosting plan, DOI or persistent URL).
A sustainability plan (how the dataset will be maintained, annotated, and governed post‑release).
Risk mitigation for PII and copyright (redaction workflows, datasets excluding sensitive content).
Community engagement (local collaboration, compensation for speakers/creators, review by language experts).

Risks, trade‑offs, and governance challenges

LINGUA’s ambition is laudable, but the program — and applicants — must navigate real and non‑trivial risks.

1) Copyright and licensing complications

Collecting text corpora often encounters copyrighted material. Open licensing requires either (a) content originally created and released under an open license, (b) explicit permission from rights holders, or (c) content suitable for lawful text‑and‑data‑mining exemptions where they exist. European legal regimes and the EU AI Act complicate assumptions about reuse. Proposers should budget legal review and rights clearance.

2) Consent, privacy, and PII

Speech corpora risk including personally identifiable information. Datasets must be collected under explicit, documented consent with clear notices on reuse and redistribution. Plans for anonymization, PII redaction, and secure storage should be central to proposals. The program’s emphasis on ethical data practices means projects without robust privacy safeguards will struggle to pass review.

3) Community ownership vs. extractive practices

Well‑intentioned data collection can become extractive if communities see no benefit or control over how their voices and texts are used. Projects must center community governance: fair compensation, transparent benefit sharing, and local stewardship for downstream resources. Academic or corporate publishing without community involvement risks epistemic injustice and cultural harm. Recent scholarship warns of techno‑linguistic bias and epistemic injustice when development excludes speakers’ epistemologies.

4) Sustainability and maintenance

A one‑time dataset dump is not enough. Datasets need ongoing curation, bug fixes, and updates — especially speech datasets where annotation layers and quality checks evolve. Proposals should explain who will maintain data, how updates are handled, and how the resource will be hosted long‑term.

5) Vendor dependence and cloud lock‑in

Azure compute credits are valuable, but teams should design experiments and release artifacts in ways that do not create permanent dependence on a single vendor. Ensure deliverables (data files, metadata, code) remain portable and publicly hostable outside proprietary platforms.

Practical recommendations for Ukrainian applicants

A short technical and ethical checklist (apply this to your proposal)

Define target modalities: speech (audio + transcriptions), text (cleaned corpora), aligned pairs (MT) or TTS resources.
Specify speaker sampling: age, gender, accent, regional dialects, socio‑economic representation, and a plan to recruit under‑represented speakers.
Use standard metadata schemas (e.g., language tags, locale, recording conditions) to maximize reuse.
Choose an open license compatible with reuse (explain your choice and its implications).
Embed a human‑in‑the‑loop policy for annotation to maintain quality and cultural accuracy.
Include objective evaluation metrics and a proposed baseline model to demonstrate dataset utility.
Budget for legal review and data governance costs (copyright clearance, consent processing).
Document data retention, access controls, and PII redaction procedures.

Recommended license and hosting approach

Prefer permissive open licenses that permit research and commercial use (but pay attention to contributor wishes and cultural sensitivities). Explicitly document any restrictions for sensitive material.
Host datasets on recognized academic or open‑data repositories with persistent identifiers to ensure discoverability and reproducibility.

How to use Azure credits strategically

Use credits for preprocessing-heavy tasks: noise reduction, forced alignment, and large‑scale annotation pipelines.
Run baseline experiments to produce model‑ready artifacts (e.g., standardized JSONL, wav+transcript pairs), then publish the artifacts independently from Azure to reduce lock‑in risk.

Program strengths — why this could be a turning point

Upstream focus: Funding data collection addresses the root cause of underperformance for low‑resource languages rather than building isolated, closed models.
Academic collaboration: EPFL and ETH Zürich involvement signals a commitment to rigorous data standards and open science.
Open licensing requirement: Public datasets accelerate research, reproducibility, and independent model development across Europe.
Compute and visibility: Azure credits and Microsoft’s channels can help projects reach scale and find downstream partners.

Strategic pitfalls to avoid

Treating datasets as one‑off deliverables rather than community resources that require stewardship.
Ignoring rights clearance and privacy compliance in the rush to collect large volumes of text or audio.
Over-reliance on cloud credits as a substitute for sustainable compute arrangements and long‑term hosting.
Failing to involve language communities as co‑authors and stewards of the data.

How this fits into the European AI picture

The LINGUA program is one element in a broader European push for AI sovereignty, multilingual capacity, and ethical standards — themes the EU has prioritized in recent policy and funding decisions. Europe’s AI agenda is currently focused on building local capabilities, responsible governance, and public‑interest datasets to avoid dependence on non‑European corpora and models. LINGUA's alignment with those goals makes it more than a philanthropic gesture; it is a tactical effort to rebalance the AI training data ecosystem for European languages.

A pragmatic 6‑step applicant playbook

Rapid scoping: Define the language coverage, modalities, and clear community partners (2 weeks).
Legal clearance plan: Budget and timeline for copyright and consent paperwork (2–4 weeks).
Sampling and collection protocol: Pilot a small, high‑quality sample to test annotation and consent workflows (4 weeks).
Baseline experiments: Use Azure credits to process and validate data; produce an internal evaluation suite (4–8 weeks).
Open release & documentation: Publish with rich metadata, quality reports, and reproducible scripts.
Sustainability & outreach: Identify long‑term hosting and community stewards; publish an adoption roadmap for researchers and industry.

Final assessment

LINGUA is a meaningful, well‑scoped intervention aimed at a genuine structural problem: the systematic under‑representation of many European languages in AI training data. The combination of open licensing, academic partnership, and direct support for dataset creation positions the program to produce durable public goods that strengthen research and product ecosystems — particularly for Ukrainian language technology.
That said, success is not automatic. The program’s effectiveness will hinge on how applicants and Microsoft manage legal and privacy constraints, community participation, long‑term funding, and technical standards. Projects that embed ethical governance, transparent licensing, and sustainable stewardship into their design will be the ones that turn LINGUA’s short‑term grants into lasting infrastructure.

Quick reference — verified program details

Funding per project: up to $50,000 (larger grants considered case‑by‑case).
Compute: Azure credits for up to two years.
Partners: Microsoft AI for Good Lab, EPFL, ETH Zürich, and coordination with APERTUS/Council of Europe.
Deadline to apply: November 11, 2025, 23:59 CET.
Award announcement: January 20, 2026.

Microsoft’s LINGUA call is a rare example of a large technology company funding the upstream public infrastructure that language technologies need. For Ukrainian researchers, cultural institutions, and civic technologists, it is a practical funding window to accelerate dataset creation — but the moment also demands discipline: rigorous consent, clear licensing, community governance, and a plan for what comes after the grant. The right projects will not only improve model performance; they will create publicly owned digital resources that anchor Ukraine’s language and culture in the AI systems that will increasingly shape society.

Source: dev.ua Microsoft launches LINGUA initiative to help increase the presence of the Ukrainian language in artificial intelligence models

Search

Navigation section

LINGUA Opens Open Call to Expand European Language Datasets Including Ukrainian

Background

What LINGUA is (and what it's not)

The facts you need to know (verified)

Why LINGUA matters for Ukrainian language tech

Closing a material gap in training data

Benefits for ecosystem actors

Complementary tools already supporting Ukrainian ecosystems

What LINGUA funds — and what applicants must plan for

Funded activities (typical)

In practice, successful proposals should include

Risks, trade‑offs, and governance challenges

1) Copyright and licensing complications

2) Consent, privacy, and PII

3) Community ownership vs. extractive practices

4) Sustainability and maintenance

5) Vendor dependence and cloud lock‑in

Practical recommendations for Ukrainian applicants

A short technical and ethical checklist (apply this to your proposal)

Recommended license and hosting approach

How to use Azure credits strategically

Program strengths — why this could be a turning point

Strategic pitfalls to avoid

How this fits into the European AI picture

A pragmatic 6‑step applicant playbook

Final assessment

Quick reference — verified program details

Similar threads

Navigation section

LINGUA Opens Open Call to Expand European Language Datasets Including Ukrainian

What LINGUA is (and what it's not)​

The facts you need to know (verified)​

Why LINGUA matters for Ukrainian language tech​

Closing a material gap in training data​

Benefits for ecosystem actors​

Complementary tools already supporting Ukrainian ecosystems​

What LINGUA funds — and what applicants must plan for​

Funded activities (typical)​

In practice, successful proposals should include​

Risks, trade‑offs, and governance challenges​

1) Copyright and licensing complications​

2) Consent, privacy, and PII​

3) Community ownership vs. extractive practices​

4) Sustainability and maintenance​

5) Vendor dependence and cloud lock‑in​

Practical recommendations for Ukrainian applicants​

A short technical and ethical checklist (apply this to your proposal)​

Recommended license and hosting approach​

How to use Azure credits strategically​

Program strengths — why this could be a turning point​

Strategic pitfalls to avoid​

How this fits into the European AI picture​

A pragmatic 6‑step applicant playbook​

Final assessment​

Quick reference — verified program details​

Similar threads

What LINGUA is (and what it's not)

The facts you need to know (verified)

Why LINGUA matters for Ukrainian language tech

Closing a material gap in training data

Benefits for ecosystem actors

Complementary tools already supporting Ukrainian ecosystems

What LINGUA funds — and what applicants must plan for

Funded activities (typical)

In practice, successful proposals should include

Risks, trade‑offs, and governance challenges

1) Copyright and licensing complications

2) Consent, privacy, and PII

3) Community ownership vs. extractive practices

4) Sustainability and maintenance

5) Vendor dependence and cloud lock‑in

Practical recommendations for Ukrainian applicants

A short technical and ethical checklist (apply this to your proposal)

Recommended license and hosting approach

How to use Azure credits strategically

Program strengths — why this could be a turning point

Strategic pitfalls to avoid

How this fits into the European AI picture

A pragmatic 6‑step applicant playbook

Final assessment

Quick reference — verified program details