Zscaler’s claim that its cloud sees “over half a trillion transactions a day” has suddenly become more than a brag about scale — it’s the center of a fresh privacy controversy after external reports and researcher commentary interpreted CEO remarks to mean Zscaler is using customer logs and full URLs from that data stream to train its AI systems, while Zscaler’s own statements insist it does not use customer-identifiable data for model training. (sdxcentral.com, zscaler.com)
Zscaler is a US-based cloud security vendor best known for its Zero Trust Exchange platform, which inspects and routes enterprise traffic to apply security controls, data loss prevention, and threat detection. The company publicly disclosed that its platform crossed a milestone of roughly 500 billion daily transactions in 2024 and has repeatedly flagged the volume of telemetry its cloud processes as a competitive strength.
In August 2025 the issue exploded into public view after media coverage summarized CEO comments and security researchers reacted strongly to language suggesting Zscaler leverages the massive log stream to power AI features. Independent reporting interpreted the CEO’s remarks as saying Zscaler leverages transactional logs — including structured and unstructured elements and full URLs — as training material for internal AI models. Other outlets likewise reported the company’s claim of trillions of logs used to train defensive AI. (sdxcentral.com, thestack.technology)
Zscaler responded with a published statement asserting a narrower approach: it says customer proprietary or personal data is not used to train shared AI models, and that only aggregated, non-identifying metadata or platform signals are used to improve detection models — while each customer’s raw logs remain inside tenant boundaries. The company’s blog framed this as “data containment” and said sensitive information never leaves a tenant for model training.
But the public relations gap remains. Statements during earnings calls and in the press that describe “complete logs” and “full URLs” feeding a data lake create plausible deniability problems: customers and regulators will reasonably ask for concrete technical mechanisms and proof that those raw logs are not the ones used for training. The problem here is resolvable — but only with transparent, verifiable controls and contractual clarity. (zscaler.com, thestack.technology)
Zscaler’s episode is a timely reminder that in the AI era, language matters. Vendors and executives must be precise about what data they use, how it’s sanitized, and what contractual promises customers are buying. Customers, in turn, must translate marketing into enforceable technical and contractual controls — because scale without clarity can create systemic exposure for organizations operating under strict privacy and compliance obligations.
Source: BornCity ZScaler uses customer logs for AI training | Born's Tech and Windows World
Background
Zscaler is a US-based cloud security vendor best known for its Zero Trust Exchange platform, which inspects and routes enterprise traffic to apply security controls, data loss prevention, and threat detection. The company publicly disclosed that its platform crossed a milestone of roughly 500 billion daily transactions in 2024 and has repeatedly flagged the volume of telemetry its cloud processes as a competitive strength. In August 2025 the issue exploded into public view after media coverage summarized CEO comments and security researchers reacted strongly to language suggesting Zscaler leverages the massive log stream to power AI features. Independent reporting interpreted the CEO’s remarks as saying Zscaler leverages transactional logs — including structured and unstructured elements and full URLs — as training material for internal AI models. Other outlets likewise reported the company’s claim of trillions of logs used to train defensive AI. (sdxcentral.com, thestack.technology)
Zscaler responded with a published statement asserting a narrower approach: it says customer proprietary or personal data is not used to train shared AI models, and that only aggregated, non-identifying metadata or platform signals are used to improve detection models — while each customer’s raw logs remain inside tenant boundaries. The company’s blog framed this as “data containment” and said sensitive information never leaves a tenant for model training.
What was actually said — and why it matters
The CEO’s remarks and press summaries
At a Cloud Security Alliance event and on earnings calls, Zscaler CEO Jay Chaudhry referenced the platform’s data volume and said the company leverages the platform’s signals to build AI capabilities. Several reports paraphrased his comments as: Zscaler processes “over 500 billion transactions per day” and “leverages proprietary logs” to train models, with phrasing that highlighted full transaction logs and URLs. That wording sparked alarm because logs at this scale can and do contain sensitive indicators: URLs with tokens, query parameters, cloud storage object names, internal hostnames, and other identifiers. (sdxcentral.com, thestack.technology)The company’s formal position
Within a published company blog post and corporate communications, Zscaler clarified that it does not use customer personal data or proprietary tenant content to train shared AI models. Instead, Zscaler says it uses metadata and aggregated signals — technical telemetry such as traffic patterns, risk scores, and anonymized patterns — to improve platform-wide models, while maintaining data isolation per tenant. That distinction is central to the debate: the difference between (A) training on raw, tenant-linked logs (full URLs, raw request/response bodies and content), and (B) training on aggregated, de-identified signals derived across tenants.Why researchers reacted strongly
Security researchers and frustrated administrators reacted because the phrasing used in earnings calls and media reports — mention of “proprietary logs,” “full URLs,” and “complete logs” — reads to many like an admission of training on high-fidelity customer records. The perceived risk is not just theoretical: if raw logs or insufficiently sanitized URL strings were used in model training and those models later produced outputs that inadvertently regurgitated training data, that could expose organization-specific facts. The outcry reflects concern over contractual transparency, regulatory compliance, and practical attack surfaces created when telemetry is repurposed beyond operational security uses. (thestack.technology, sdxcentral.com)Verifiable technical and contractual points
The public record includes both operational facts and corporate assertions. These are the most load-bearing, verifiable items and where they stand today:- Zscaler’s platform processes massive telemetry — the company and its filings assert volumes in the hundreds of billions of transactions per day. That scale is independently reported and repeatedly claimed in Zscaler material. (zscaler.com, sec.gov)
- Reporting by multiple outlets summarized CEO remarks that the company uses “complete logs,” including full URLs, and that those logs feed a “massive data lake” used to train AI models powering platform features. These reports are based on direct quotes and public earnings or event commentary. (thestack.technology, sdxcentral.com)
- Zscaler’s corporate blog and public statements explicitly claim that customer personal or proprietary information is not used to train shared AI models, and that metadata and aggregated signals (non-identifying) are the source for training. This is Zscaler’s official containment position.
- Zscaler’s SEC filings and investor materials describe how logs are handled, encrypted in transit, and stored to customer-chosen destinations with isolation in a multi-tenant architecture, while also noting the company’s analytics and threat detection operate on streaming logs. These filings are helpful to understand the contractual and architectural promises around data residency and access.
Analysis: strengths, risks, and ambiguous boundaries
Strengths of Zscaler’s approach (if implemented as claimed)
- Network-effect detection gains: Using aggregated platform-wide signals to detect global threats is a long-standing security model. If Zscaler truly trains models on non-identifying metadata and patterns, customers benefit from signals seen across thousands of tenants without direct data sharing.
- Scale for threat hunting: A dataset comprising the shape of half a trillion transactions daily provides statistical power to detect rare or novel threat patterns that smaller deployments would miss.
- Operational efficiency: Centralized models that improve over time can deliver better zero-day detection and faster response — a key value proposition for inline security clouds.
- Architectural options: Zscaler claims tenancy isolation, encryption-in-transit, and the ability for customers to choose log storage regions — important controls for regulatory compliance if enforced correctly. (zscaler.com, sec.gov)
Key risks and failure modes
- Ambiguity in language: The difference between “metadata” and “complete logs” is not merely marketing nuance. Definitions matter. If a URL path contains an access token, document ID, or other secret, treating that as “metadata” without removing secrets can leak sensitive identifiers into training datasets.
- Re-identification risks: Even “de-identified” data can be re-identified when combined with other signals. URLs and unstructured fields sometimes include organization-specific tokens, usernames, file names, or cloud-object URLs that tie back to a tenant.
- Model memorization and exposure: Modern generative models can memorize and reproduce training inputs. If raw or insufficiently sanitized logs were used in model training, there’s a risk that model outputs could echo tenant-specific content under some prompts or hallucinations.
- Contractual and regulatory exposure: Customers in regulated sectors (healthcare, finance, defense) often require strict contractual assurances that their data will not be used for secondary purposes like model training. A mismatch between marketing language, earnings call phrasing, and contractual commitments could create liability or breach-of-contract risk.
- Insider and access control risk: Training pipelines and model-development environments require access to data. Inadequate separation between operational logs and R&D assets can expand the attack surface to include developer access, third-party vendors, or misconfigured storage.
- Transparency and auditability gaps: Customers need cryptographic and contractual proof about what was used to train shared models. Without verifiable attestations (e.g., cryptographic hashes of training sets, model provenance docs), customer trust is fragile. (thestack.technology, sdxcentral.com)
Governance and legal considerations
Regulatory context
- EU GDPR: Personal data used for model training triggers GDPR obligations (legal basis, DPIAs, data minimization, and potential rights requests). If any logs included IP addresses, usernames, or other identifiers retained for training, GDPR exposures could follow.
- Sector rules: HIPAA (healthcare), GLBA (financial), and defense contracting regulations (DFARS) impose specific constraints. Customers in those sectors must verify vendors’ data handling at contract and technical levels.
- State privacy laws: Emerging U.S. state privacy laws (e.g., CCPA-style statutes) raise additional consent and transparency obligations when customer data is used for secondary processing.
Contractual mitigations customers should demand
- Explicit model-training clauses: Prohibit any use of customer-identifiable data for third-party model training unless explicitly consented to in writing.
- Data usage and retention policies: State retention periods for logs used in analytics and training, and whether customers can opt out.
- Customer-managed keys and dedicated logging: Options for Single-Tenant or BYOK (bring-your-own-key) models reduce provider access risk.
- Audit and attestation rights: The ability to audit model training pipelines or receive independent attestations that datasets were sanitized and only aggregated signals were used.
- Breach and output redaction obligations: Commitments that if model outputs are found to leak training data, immediate remediation and disclosure obligations are triggered.
Practical recommendations for IT teams and security architects
- Review contracts and DPAs now. Confirm explicit language about model training, telemetry reuse, and opt-out controls.
- Ask for technical detail and proof:
- What exactly is the “metadata” used for training?
- Are URLs and query strings sanitized? Which patterns are redacted?
- Where are training pipelines located, who has access, and how are datasets stored?
- Use customer-managed keys and regional log storage when available to limit provider-side access surface.
- Apply data classification and DLP before data reaches inline inspection services:
- Where possible, prevent downstream exposure of high-risk tokens or secrets in URLs or headers.
- Implement URL rewriting or token-masking at the proxy/enforcement node for high-risk flows.
- Negotiate right-to-audit clauses, and require an independent third-party attestation that training data is de-identified and non-recoverable.
- If operating in a highly regulated environment, consider single-tenant or on-prem alternatives for the most sensitive traffic, or require contractual guarantees that training pipelines are physically and logically isolated.
- Monitor model outputs and any co-pilot or copilot-like features for potential leakage; treat generative outputs as audit artifacts.
Assessing Zscaler’s public assurances
Zscaler’s blog post is explicit: the company says it does not use customer proprietary or personal data to train its models, and that signals used for training are limited to aggregated telemetry that cannot be mapped back to a tenant. If that promise is followed in practice — implemented with robust technical controls and independent attestations — it addresses the main concerns.But the public relations gap remains. Statements during earnings calls and in the press that describe “complete logs” and “full URLs” feeding a data lake create plausible deniability problems: customers and regulators will reasonably ask for concrete technical mechanisms and proof that those raw logs are not the ones used for training. The problem here is resolvable — but only with transparent, verifiable controls and contractual clarity. (zscaler.com, thestack.technology)
What to watch next
- Third-party audits and attestations: The quickest path to restoring confidence is independent verification. Customers and auditors should seek SOC 2 / ISO attestations that specifically cover model-training pipelines and dataset provenance.
- Regulatory inquiries: Privacy regulators in Europe and data-protection authorities may request explanations or DPIAs if customers complain; any enforcement actions would raise the stakes for all inline security clouds.
- Customer-driven contract changes: Expect procurement teams to add model-training clauses and technical controls into SOWs (statements of work) and DPAs (data processing agreements).
- Industry precedent: How other security cloud vendors handle telemetry and AI training will matter; the sector may converge on standard language about what constitutes permissible training signals and what must be excluded.
Bottom line
- Zscaler operates at enormous scale and claims to leverage platform signals to power AI defenses — a capability that, if implemented carefully, can be beneficial for customers.
- Public reporting and earnings-call phrasing created understandable alarm by implying raw logs (including full URLs) were used to train models. Independent reports relayed those statements and researchers amplified the privacy implications. (sdxcentral.com, thestack.technology)
- Zscaler’s corporate response asserts a more limited practice: no use of customer personal or proprietary data for training shared models, and usage limited to aggregated, non-identifying signals. That assurance reduces legal and privacy risk if accompanied by demonstrable technical controls and contractual guarantees.
- For customers, the prudent path is defensive: obtain contractual clarity, insist on technical attestations, apply in-line DLP and token-masking where feasible, and require obvious audit rights. These steps preserve the benefits of a large security cloud’s threat intelligence while minimizing the legal and privacy tail-risk of model training pipelines.
Quick checklist for security teams (actionable)
- Demand explicit model-training language in vendor agreements.
- Require proof of data sanitization: what fields are redacted from URLs/headers before any training extraction.
- Insist on customer-managed encryption keys and regional data residency controls.
- Negotiate right-to-audit and third-party attestations for training datasets and ML pipelines.
- Enforce internal DLP to prevent sensitive tokens or PII from entering inline traffic streams.
- Maintain monitoring for hallucinations or outputs that could indicate memorized training data.
Zscaler’s episode is a timely reminder that in the AI era, language matters. Vendors and executives must be precise about what data they use, how it’s sanitized, and what contractual promises customers are buying. Customers, in turn, must translate marketing into enforceable technical and contractual controls — because scale without clarity can create systemic exposure for organizations operating under strict privacy and compliance obligations.
Source: BornCity ZScaler uses customer logs for AI training | Born's Tech and Windows World