SURF DPIA Update: Copilot Education Risks Downgraded; Hallucinations and 18 Month Retention Remain

  • Thread Author
SURF’s updated Data Protection Impact Assessment (DPIA) has moved two of four previously flagged high privacy risks for Microsoft 365 Copilot in education down to medium, but important hazards — notably AI inaccuracy (hallucinations) and an 18‑month retention window for pseudonymized telemetry — remain unresolved and carry material implications for universities and research institutes across Europe.

Background​

SURF, the Dutch cooperative that manages IT services and policy for universities, colleges and research institutes, commissioned a DPIA into Microsoft 365 Copilot after the feature’s integration into Office apps raised questions about how generative AI would interact with institutional documents, mail and calendars. The first DPIA (published December 2024) identified four “high” risk areas; after months of negotiation and technical exchanges with Microsoft and the Dutch government’s Strategic Supplier Management (SLM), an updated assessment announced in September 2025 downgrades two of those risks to medium — while two others are now rated lower — leaving a still-significant residual risk profile for the paid education Copilot subscription that accesses organizational SharePoint, OneDrive and Exchange content.
SURF’s DPIA is unusual for its depth and its sector focus: the update spans hundreds of pages of technical testing and legal analysis, and it explicitly scoped the assessment to the paid M365 Copilot Education add‑on (the license variant that allows Copilot to read institutional content via Microsoft Graph). That scoping intentionally excluded free consumer Copilot experiences and under‑18 users (Microsoft’s paid EDU add‑on is not offered to accounts for minors).

What the updated DPIA examined​

Technical scope and methodology​

  • The review covered the paid education Copilot add‑on across desktop and web clients (Windows, macOS and browsers), with active traffic interception used to map data flows, cookies and telemetry events during typical workflows.
  • The assessors catalogued a large telemetry surface: testing recorded 208 distinct telemetry event types tied to Copilot activity, and the full assessment runs to roughly 217 pages, reflecting a very detailed technical and legal examination.
  • The evaluation separated content data (prompts, document text, Copilot replies, and Graph‑fetched content) from diagnostic/telemetry data (service events, identifiers and operational logs), because the two categories raise different privacy and GDPR concerns.

High‑level conclusions of the update​

  • Microsoft implemented a package of mitigation measures and documentation improvements that reduced the severity of several original findings; as a result, SURF downgraded two of the four original “high” risks to medium/amber.
  • The two remaining medium‑rated concerns are:
  • Inaccurate or fabricated personal data returned by Copilot in replies (hallucination risk).
  • Extended retention (18 months) of required service data and telemetry, even when pseudonymized — a period SURF considers excessive under GDPR’s data‑minimization and storage‑limitation principles.

Microsoft’s mitigation steps — what changed, and where gaps remain​

Transparency and DSAR handling​

Microsoft has improved documentation around the Required Service Data and diagnostic telemetry, and it says it has clarified Data Subject Access Request (DSAR) responses when asked to disclose diagnostic fields — or why some DSAR fields are empty (Microsoft’s explanation is that some fields are simply not collected or sent). The company also committed to provide more accessible diagnostic information to customers when they exercise data subject rights. These changes helped SURF reduce the severity of some previously flagged risks.

Controls and contractual commitments​

Microsoft strengthened contractual and operational commitments for the education add‑on, including clarifications about the distinction between customer content that is not used for model training and telemetry that flows to service operations. SURF documented the vendor’s commitments but also emphasised that promises must be verifiable in deployment before the highest risk flags can be removed.

The Workplace Harms filter — insufficient documentation and controls​

Microsoft added a Workplace Harms filter intended to prevent Copilot from generating outputs that could harm employees (for example, inferences about performance or emotional state). However, Microsoft’s public documentation on this filter is sparse — a few sentences describing purpose and scope — and SURF judged that customers lack sufficient detail and control to understand or tune the filter for institutional contexts. The Microsoft safety and privacy documentation acknowledges workplace harms as a category, but the DPIA flags a lack of customer‑facing controls and rigorous documentation for definitions and severity scales.

The two remaining medium risks, in depth​

1) Inaccurate personal data generation — why “hallucinations” matter in education​

Generative models are probabilistic: they compose plausible text grounded in statistical patterns, not guaranteed facts. SURF’s testers reproduced scenarios where Microsoft 365 Copilot invented personal‑data‑related facts — for example, fabricating academic paper titles or misattributing content. One test reported that Copilot suggested multiple distinct “recent scientific papers” but attached the same source reference for all because it selected one of the ten available SharePoint articles it could see — an output blending plausibility and provenance in a misleading way. When such outputs contain names, roles, or evaluative statements, they can cause reputational and legal harm in academic processes (admissions, grading, hiring, disciplinary action). SURF emphasises that the chat‑style UI increases the risk that users will treat outputs as established facts rather than probabilistic text completion.
Microsoft has pledged UI “friction” improvements aimed at nudging users to verify AI output before accepting it, and other internal mitigations, but the DPIA notes these changes are incompletely documented and not yet deployed across customer tenants. Until such controls appear and are demonstrably effective, the risk that Copilot will surface inaccurate personal data remains non‑trivial.

2) Telemetry retention and re‑identification risk​

The updated DPIA draws attention to Microsoft’s 18‑month retention policy for certain pseudonymized diagnostic data (Required Service Data and telemetry). SURF’s assessment cautions that even pseudonymized logs — combined with granular telemetry (timestamps, tenant IDs, document references, device fingerprints) — can be re‑identified when correlated with other datasets, especially in university or lab contexts where cohorts are small and records unique.
Microsoft argues the longer retention supports essential service functions (debugging, abuse detection, continuity), and it provides mechanisms — for example, account deletion or tenant termination — that can shorten retention in practice. SURF concludes, however, that an 18‑month default retention likely exceeds what is strictly necessary under GDPR’s storage limitation and data minimization principles for many educational use cases. The assessors recommend shorter default windows and clearer customer controls to delete or limit telemetry retention.

Why this matters to institutions — practical legal and operational consequences​

European educational institutions operate under GDPR and must satisfy obligations around lawful basis, purpose limitation, transparency, and data subject rights. SURF’s DPIA makes clear that embedding Copilot into workflows without rigorous controls creates exposure across several fronts:
  • Data subject rights friction: independent exercise of access, rectification and erasure rights becomes harder if diagnostic logs are opaque or retained beyond necessity.
  • Decision pipeline risk: using Copilot‑generated content in assessments, HR decisions or research summaries risks embedding inaccurate or unverified claims into high‑stakes outcomes.
  • Reputational and regulatory risk: supervisors in Europe are actively scrutinising AI systems and automated decision‑making; institutions that ignore SURF’s guidance may face complaints, audits or enforcement action.

Recommended institutional controls (what SURF and the DPIA recommend)​

SURF’s guidance shifts responsibility to institutions to govern Copilot use carefully. The DPIA — and accompanying operational notes — recommend a layered, defence‑in‑depth approach:
  • Disable Bing web search in Copilot deployments where external web queries are not required.
  • Restrict Copilot access by role: enable the feature only for tightly scoped user groups in pilots and grant access via role‑based controls.
  • Prohibit high‑risk inputs: create policies forbidding pasting of sensitive personal data or special‑category data into prompts without prior redaction.
  • Require human verification: mandate explicit human sign‑off on any Copilot output used for administrative, HR, academic evaluation or research decisions.
  • Logging and monitoring: ingest Copilot events into institutional SIEM/Purview logs to create an independent audit trail and detect abnormal use patterns.
  • Complaint sharing and crowdsourced quality monitoring: SURF asks institutions that find inaccurate personal data or filtering problems to report complaints centrally so issues can be triaged and patterns identified across the sector.
Shorter retention windows for diagnostic telemetry and more granular DSAR outputs are key vendor commitments institutions should demand contractually. SURF plans to reassess Microsoft’s implementations in six months; that timetable gives institutions a near‑term decision point to revisit pilots and widen or restrict deployments.

Broader industry context and competing pressures​

European sovereignty, the CLOUD Act and vendor assurances​

SURF’s cautious stance reflects broader European debates about sovereignty, US extraterritorial law and the desirability of alternatives to US cloud providers. Microsoft has tried to address these concerns through initiatives like the EU Data Boundary and specific European commitments (including local processing and governance guarantees), and it says it will contest conflicting government demands where legally possible. Those assurances are meaningful, but national governments and consortia continue to explore sovereign options (for example, domestic models or European AI projects) to reduce vendor dependency.

Commercial context — Copilot and advertising​

Microsoft has been explicit about integrating Copilot into its broader product and commercial strategies, including advertising experiments in Copilot experiences and sustained investment in cloud and AI. Microsoft’s financial reporting shows continued strength in cloud and growing search/advertising lines — a reminder that Copilot is not only a productivity feature but also a commercial surface that will evolve rapidly. Public earnings and company reporting show that search and news advertising has been a meaningful revenue line (it was reported at roughly $12.6 billion in the fiscal year ending June 2024, and Microsoft’s fiscal commentary for 2025 highlights growth in search ad revenue as AI features scale). Those commercial incentives can shape product design and telemetry needs; institutions must therefore align procurement and governance with commercial realities.
Note: some reporting has claimed larger headline advertising numbers tied to Microsoft’s overall ad strategy; SURF’s DPIA does not hinge on the precise advertising dollar figure, but readers should treat single revenue claims with caution and verify against company financial documents.

Strengths of SURF’s approach — and where it matters​

  • Sector focus and rigor: the DPIA targets the idiosyncratic data mixes of education and research — a context where small cohorts, sensitive research data and academic reputations create higher harm potential. That sector specificity makes the analysis operationally useful rather than theoretical.
  • Iterative vendor engagement: SURF negotiated concrete vendor commitments and obtained expanded documentation and DSAR clarifications; this demonstrates a viable governance pathway short of blanket bans.
  • Practical, actionable controls: the guidance focuses on controls institutions can implement immediately (tenant settings, disabling Bing search, role gating) rather than abstract forbiddances.

Risks and unresolved weaknesses​

  • Unverifiable vendor promises: documentation and contractual commitments only reduce risk when independently testable; SURF emphasises the need for follow‑up audits and operational verification.
  • Telemetry granularity and re‑identification: even pseudonymized telemetry can be combined to re‑identify individuals; the 18‑month retention period amplifies that exposure unless stronger minimization or customer controls are offered.
  • UI design incentivising blind trust: chat‑style outputs presented inline in Word or Outlook create automation bias; UI friction and provenance tokens are sensible mitigations, but their rollout and effectiveness remain to be seen.
  • Scope creep and feature rollout: Copilot features continue to be extended across devices and surfaces; automatic installs, new vision or sharing affordances, and advertising experiments can each widen the telemetry surface and complicate governance. Institutions should expect ongoing product changes and plan governance accordingly.

Practical checklist for IT leaders and decision‑makers​

  • Inventory: Identify which tenants and user groups are eligible for the paid Copilot EDU add‑on and which datasets would be visible to Copilot (SharePoint sites, shared mailboxes, Teams channels).
  • Policy: Draft an AI usage policy that restricts Copilot for high‑risk processes (grading, research provenance, HR), mandates human verification, and defines escalation workflows for suspected hallucinations.
  • Controls: Disable Bing web search for Copilot where external search is unnecessary; enforce role‑based access; implement DLP rules to prevent sensitive document exposure to Copilot.
  • Logging & retention: Capture Copilot events in institutional logs, define a retention policy aligned to legal necessity, and press Microsoft contractually for a shorter default telemetry retention or stronger tenant controls.
  • Pilot & measure: Start with a small pilot, measure false positive/false negative rates for provenance and hallucinations, and require human sign‑off in any workflow that affects people.
  • Community reporting: Share complaints about inaccurate personal data centrally (SURF requests this) so systemic issues can be detected and remedied.

Where to watch next​

  • SURF’s promised reassessment in six months will be the first real test of whether Microsoft’s mitigations — and their operational deployment — materially reduce the remaining medium risks. Institutions should align internal review cycles to that timetable.
  • Microsoft’s continued expansion of Copilot surfaces (vision, taskbar sharing, automatic app installs, advertising experiments) will change telemetry and governance needs; stay current with vendor admin controls and Message Center updates.
  • European regulatory and sovereign‑AI efforts (local models and EU Data Boundary initiatives) will shape procurement options and acceptable contractual baselines for data residency and access. Monitor regulatory guidance on algorithmic systems and profiling.

Conclusion​

SURF’s updated DPIA is a pragmatic, sector‑focused intervention: it recognises vendor progress and downgrades two formerly high risks — but it also places the onus squarely on educational institutions to govern Copilot use carefully while Microsoft completes promised mitigations. The two persistent medium risks — inaccurate personal data in replies and an 18‑month retention window for pseudonymized diagnostic data — are not theoretical; they have concrete legal and reputational consequences in academic settings where personal records, research integrity and small cohort sizes amplify harm potential.
In practice, the DPIA offers a roadmap: controlled pilots, layered technical controls, shorter telemetry retention where possible, mandatory human verification and a sector‑wide complaints channel. For institutions that want the productivity gains of Copilot without unnecessary exposure, the message is clear: adopt cautiously, enforce strictly, and demand verifiable vendor commitments — because promises on paper alone will not protect students, staff or research subjects from the real‑world effects of AI inaccuracy and opaque telemetry.

Source: PPC Land SURF downgrades Microsoft Copilot education risks to medium while privacy concerns persist