When the question is how much of your life an AI chatbot keeps, the short and verifiable answer is: it varies wildly — and right now Microsoft’s Copilot stands out as the one that, by design and contract, collects the least intrusive set of user data.
Background
AI chatbots exploded into consumer and enterprise life as near‑ubiquitous assistants. Alongside their utility came a predictable tradeoff: model quality and personalization often correlate with the volume and variety of data a vendor ingests. That tradeoff is no trivial privacy question; some vendors record only minimal telemetry, while others list two dozen data types — from precise location to photos and messages — in their privacy disclosures. Recent comparative reporting shows these differences starkly.
This article explains the data‑collection behaviors of the major players — Microsoft Copilot, OpenAI’s ChatGPT, Google Gemini, China‑based DeepSeek, and Alibaba/Alibaba Cloud’s Qwen family — verifies the most load‑bearing claims against multiple independent summaries, analyzes the strengths and risks of each approach, and gives actionable guidance for Windows users who want to reduce their exposure by using browser versions, enterprise controls, or local models.
Overview: What “collects the least” actually means
Privacy claims are multidimensional. Saying an app “collects the least” can mean:
- It gathers fewer types of personal data (fewer categories listed in app reports).
- It collects less sensitive data (no access to photos, messages, or contacts).
- It promises not to use your prompts for model training.
- It restricts sharing with advertisers and third parties, or enforces tenant‑level governance in enterprise contracts.
Not every vendor checks all boxes. A responsible comparison weighs all four axes and validates vendor statements against published privacy policies and app store privacy reports. Independent summaries and technical writeups of the vendors’ policies line up on the broad picture: some vendors are explicit about human review, some publicly publish training opt‑out mechanisms, and one (Microsoft Copilot) emphasizes minimal collection and enterprise non‑training guarantees.
Microsoft Copilot — the least intrusive by design
What Copilot says it collects
Microsoft positions Copilot as an enterprise‑first assistant embedded into Microsoft 365. Its documentation and privacy language emphasize that Copilot’s use of your data is tenant‑scoped and
contextual — it uses data within your Microsoft 365 tenant to generate contextual answers rather than contributing your prompts to a public training corpus. That separation is repeatedly highlighted in reporting and vendor documentation.
Why this matters: non‑training and compliance guarantees
Two practical consequences follow:
- Non‑training guarantees: Copilot's stated practice is not to pull user prompts into foundation-model training datasets for general LLM improvement, a meaningful difference from many consumer chatbots which use prompts to refine models unless you opt out.
- Enterprise controls and compliance: Copilot is offered with governance controls and contractual compliance that map to organizational standards such as FedRAMP, HIPAA, and SOC, making it the pragmatic choice for regulated environments. This enterprise posture reduces the risk for businesses handling sensitive data.
Strengths and tradeoffs
- Strengths: tight Microsoft ecosystem integration (Word, Excel, Teams), tenant grounding for data, explicit enterprise non‑training language, and admin controls that limit exposure.
- Tradeoffs: Copilot’s privacy posture couples you to Microsoft services and policies; it’s best when used inside a managed Microsoft 365 tenant. For consumer users outside that ecosystem, the benefits shrink.
Google Gemini — powerful, but data‑hungry
What the reports show
Gemini’s app‑store privacy report and policy disclosures are notably expansive. Reporting that aggregates Gemini’s declared data categories lists over twenty types — including browsing history, contacts, emails, photos, precise location, search history, texts, and videos. Google also documents that human reviewers may examine chat content to evaluate quality, and it surfaces controls for deleting or disabling Gemini activity in account settings.
Strengths and risks
- Strengths: Gemini offers broad multimodal capabilities (images, video, live camera features, and direct Workspace integration), and Google provides transparent user controls for history and activity deletion.
- Risks: The breadth of declared data types raises exposure — both to accidental leakage and to policy pressure when content is handled by human reviewers. For privacy‑conscious users, that combination is a warning sign even while Google’s control surfaces are better documented than many competitors.
OpenAI ChatGPT — intermediate, but public training exposure
ChatGPT and OpenAI occupy a middle ground. Historically, OpenAI used de‑identified user prompts to improve models unless data‑usage settings were adjusted. Consumer‑facing ChatGPT often contributes to public datasets and model training unless users or enterprise agreements specify otherwise. This model‑training behavior contrasts with Copilot’s tenant‑scoped approach. Public reporting consistently highlights that ChatGPT contributions are used in broader datasets unless covered by an enterprise contract.
Strengths include broad ecosystem adoption, a large plugin and API ecosystem, and rapid feature development. The principal privacy tradeoff is model‑training exposure: casual use of ChatGPT is more likely to feed the public dataset than enterprise‑scoped Copilot.
DeepSeek — extremely affordable, jurisdictional questions
What’s being claimed and what’s verified
DeepSeek — a China‑origin entrant that gained rapid attention for aggressive pricing and high performance in certain benchmarks — is reported to collect extensive user data and to operate under a legal and regulatory posture that could permit broader state access. Several summaries describe DeepSeek’s free or very low‑cost consumer tiers and its open‑sourced R1 model for those who want to self‑host. These same summaries also flag geopolitically sensitive points: being China‑based may subject services or hosted data to Chinese law, raising additional vectors of concern for companies and privacy‑minded users.
Unverifiable or vendor‑claimed numbers
Public claims about DeepSeek’s development cost, parameter counts, or assertions that its launch removed hundreds of billions of market cap from chip vendors should be treated with caution. Independent verification of vendor‑level training cost and parameter claims is often unavailable; those items are vendor statements or market‑analysis inferences and should be flagged as such. Multiple reports explicitly warn readers to treat these figures as unverified vendor claims until third‑party audits appear.
Practical risk assessment
- If you use DeepSeek, evaluate legal and contractual exposure carefully.
- For enterprises: perform legal, export‑control, and supply‑chain due diligence before integrating it into regulated pipelines.
- For individual users: weigh the cost advantages against jurisdictional and data‑access concerns.
Qwen — short app‑store reports, longer policy implications
Qwen (from Alibaba/Alibaba Cloud) shows an interesting disconnect between its short app‑store privacy report (which lists only device ID and basic app interactions) and a fuller privacy policy that includes more expansive terms. Apple and Google require vendors to self‑report privacy categories to the app stores, but those disclosures are self‑declared and not independently audited. Several summaries highlight this mismatch and recommend reading the full privacy policy on the vendor site rather than relying solely on the app‑store summary.
How vendors handle human review and moderation
Across vendors, it’s now commonplace to find explicit language about human reviewers reading chat content for quality, safety, and model improvement. Google is notably transparent about this and guides users to account settings for disabling certain activity logging. The presence of human review adds a privacy vector that cannot be removed by technical countermeasures alone — only by vendor policy changes or opting out via provided controls. Microsoft and other enterprise vendors offer stronger contractual routes to limit or exclude human review for tenant data, but consumer tiers may still be exposed.
Running models locally: the privacy‑first alternative
If you want maximum privacy, running an LLM locally is the cleanest solution — but it’s not frictionless.
Tools and practical paths
- Ollama: a desktop app that supports local model hosting on Windows, macOS, and Linux. Using Ollama or similar tools, you can run models on your hardware and avoid sending prompts to vendor cloud services.
- DeepSeek R1: an open‑source variant referenced in community walkthroughs for self‑hosting. Community posts and walkthroughs exist that document installing DeepSeek‑like R1 models locally; those can lower vendor exposure but require compute and technical know‑how.
- On‑device AI: manufacturers and GPU vendors increasingly ship on‑device features (for example, Nvidia’s Chat With RTX demos), enabling private inference without a network. These are practical for certain use cases but generally lack the scale of cloud models.
What you need for local inference
- A capable CPU/GPU and ample RAM — some modern LLMs require tens of GBs of RAM or consumer‑grade GPUs for usable performance.
- Local runtime tools (Ollama, local containers, or vendor open‑source runtimes).
- Appropriate model weights and licenses — verify licenses for commercial use.
- Patience for setup and an understanding that local models may be smaller (less capable) than cloud variants.
Benefits and limits
- Benefits: prompts never leave your device, no human reviewers or cloud storage, and full control of model updates and data handling.
- Limits: lower model capacity (unless you have high‑end hardware), less up‑to‑date web grounding, and more hands‑on maintenance.
Practical guidance: reduce your exposure today
Whether you prefer Copilot, Gemini, ChatGPT, or a local solution, the following steps reduce risk.
- Use browser‑based versions when possible. App store privacy reports sometimes overstate sensor access because apps request permissions mobile platforms use; browser sessions limit access to device sensors and may not transmit the same telemetry.
- For consumer chatbots, assume prompts are used for improvement unless an enterprise non‑training contract exists. If you share regulated data, move to a vendor enterprise plan with contractual non‑training clauses or self‑host.
- Disable chat history and review account activity controls if the vendor offers them (Gemini and Google make these settings visible). Doing this reduces the chance of future human review or long‑term retention.
- Don’t paste personal identifiers, credentials, or health/finance data into consumer chatbots.
- Use tenant‑scoped Copilot or enterprise ChatGPT/Anthropic offerings for work involving sensitive PII or regulated data — these plans can provide contractual protections and compliance assurances.
- For the highest privacy, evaluate local LLMs (Ollama, DeepSeek R1) or on‑device inference options; budget for hardware and maintenance.
Critical analysis: strengths, policy gaps, and systemic risks
Microsoft Copilot — strong enterprise privacy posture, limited to Microsoft ecosystem
Copilot’s enterprise grounding and non‑training guarantees are real strengths for organizations that can adopt it. The tradeoff is vendor coupling: organizations become dependent on Microsoft’s governance and contractual model, which is the intended trade. Copilot’s privacy posture wins on paper and in practice for tenant data; that's why many Windows‑centric organizations see it as the default safe option.
Google Gemini — rich data, transparent controls, but a broader attack surface
Gemini’s broad data collection enables advanced multimodal features. Google’s transparency about deletion and human review is positive, but the sheer volume of listed categories increases attack surface and regulatory exposure. Users must weigh functionality against this expanded collection list.
OpenAI ChatGPT — convenience vs. training exposure
ChatGPT remains the most generalist consumer choice. For users who need strong privacy, the default public training posture is a drawback. OpenAI offers enterprise solutions that change the terms, but casual users should assume their prompts can contribute to model improvement unless they specifically opt into other arrangements.
DeepSeek and Qwen — geopolitical and transparency questions
DeepSeek’s performance/price value proposition is compelling, but jurisdictional questions and vendor assertions about training costs or market impacts remain only partially verifiable. Qwen’s short app reports versus its longer policies illustrate the importance of checking full privacy policies, not just app‑store summaries. Both vendors underscore a broader market reality: not all privacy risk is technical; legal and geopolitical contexts matter.
Systemic risk factors
- Human reviewers: across vendors, human review is often part of the lifecycle; disabling history or buying enterprise contracts are the only reliable counters.
- App‑store privacy reports: these are self‑reported and not independently audited; they are a starting point but not definitive.
- Vendor claims: be skeptical of dramatic vendor figures (training cost, market cap impacts) they are often unverified and should be treated as marketing until audited.
A checklist for privacy‑conscious Windows users
- Prefer Microsoft Copilot inside a managed Microsoft 365 tenant for work that needs protection.
- For creative or multimodal tasks that require Gemini features, don’t store or submit sensitive PII; disable history where possible.
- Assume ChatGPT public training unless you’re under an enterprise agreement that says otherwise.
- If you need absolute confidentiality for high‑risk workloads, self‑host or use on‑device inference with tools such as Ollama.
- Avoid mobile apps if you can use browser/web versions to limit sensor and OS‑level telemetry; read full privacy policies before trusting app‑store summaries.
Conclusion
The landscape is simple to describe but complex to act in: Microsoft Copilot currently offers the clearest route to minimal data collection for tenant data because it isolates prompts inside tenant governance and does not, by policy, feed them into public foundation‑model training. Google Gemini collects a broad set of data types and is transparent about human review and activity controls. OpenAI’s ChatGPT is broadly convenient but more likely to contribute prompts to model training in consumer tiers unless enterprise contracts say otherwise. China‑based entrants like DeepSeek complicate the picture further with jurisdictional and verification concerns.
For those seeking the highest privacy: either run models locally (Ollama, self‑hosted R1 variants) or place sensitive use under enterprise contracts that explicitly forbid training and human‑review exposure. For everyone else, treat consumer chatbots as convenience tools: useful, but never a safe place for secrets.
This is not a static field: vendor policies, training practices, and product models evolve rapidly. Always verify the most current privacy policy and enterprise terms before uploading regulated or sensitive data to any chatbot.
Source: PCMag Australia
ChatGPT, Copilot, DeepSeek, or Gemini: Which AI Chatbot Collects the Least of Your Data?