Prompt Injection Flaws: Anthropic, Google, Microsoft Risk Secrets in AI Agents

  • Thread Author
The latest round of AI security disclosures is awkward for three of the biggest names in the field: Anthropic, Google, and Microsoft all accepted bug bounty submissions involving prompt injection attacks against AI agent workflows, then left most users without the public paperwork that normally signals risk. In each case, the attack path was the same in spirit and devastating in practice: malicious instructions were hidden inside content the agents were supposed to trust, allowing the tools to leak API keys, GitHub tokens, or other secrets. The result is a troubling gap between the reality of AI agent exposure and the still-immature machinery of vulnerability disclosure.

Hacker using a laptop with “untrusted text” and an exposed API key token on a screen.Overview​

The story matters because it cuts straight through the marketing gloss surrounding AI agents. These systems are sold as productivity multipliers that can read issues, review pull requests, triage tasks, and even take action inside developer workflows, but the same reach that makes them useful also makes them dangerous. When an agent is allowed to ingest untrusted text and act on it, the boundary between data and instruction becomes alarmingly thin. That is the fundamental security fault line behind the recent findings.
The researcher at the center of the disclosures, Aonan Guan, demonstrated attacks against AI tools tied into GitHub workflows at Anthropic, Google, and Microsoft. The affected systems were not obscure experimental prototypes. They were products positioned for real developer use: Anthropic’s Claude Code Security Review, Google’s Gemini CLI Action, and GitHub’s Copilot Agent. Each was designed to interpret GitHub content such as pull request descriptions, issue bodies, and comments, then use that context to perform work. That design choice is precisely what made them exploitable.
The uncomfortable part is not merely that the flaws existed. It is that the companies reportedly handled them quietly. A bounty payment here, a private acknowledgment there, and in some cases a documentation update, but no widely distributed public advisory and no CVE to anchor enterprise tracking. For users pinned to older versions, or teams relying on vulnerability scanners and security bulletins, silence can effectively mean invisibility. That is not a minor administrative omission; it is a material risk-management failure.
There is also a broader market signal here. AI coding assistants, agentic workflow tools, and LLM-powered automation have entered a phase where they are being plugged into high-trust systems faster than the industry can define their threat models. The attack surface is expanding into GitHub Actions, email, calendars, chat platforms, and internal ticketing systems. Every place an AI agent reads text and then acts on it becomes a potential instruction-smuggling channel.

What Happened​

The headline vulnerability class is indirect prompt injection. Instead of attacking the model directly, the attacker hides instructions inside ordinary-looking content that the agent later consumes as context. That content might be a pull request title, a GitHub issue comment, a code review note, or even hidden HTML markup that humans never notice but an AI parser still sees. Once the agent treats that payload as authoritative, it may execute the attacker’s instructions as if they were legitimate.
In Anthropic’s case, Guan reportedly placed a malicious payload in a pull request title aimed at Claude Code Security Review. The agent processed the title, followed the injected instructions, and included leaked credentials in its JSON output, which then appeared in a pull request comment visible to others. That is the kind of flaw that transforms a review assistant into a data-exfiltration gadget. GitHub Actions made the path especially dangerous because the runner environment could expose API keys and tokens relevant to the workflow.
Google’s Gemini CLI Action reportedly fell in a similar way. The attack involved shaping a GitHub issue so the agent interpreted attacker-controlled text as trusted context, then induced it to reveal its own API key in a comment. The important detail is not just the leaked secret, but the mechanism: the AI system could not reliably distinguish a normal issue body from a command wrapped inside one. Once that trust boundary collapsed, the agent behaved exactly as the attacker wanted.
Microsoft’s Copilot Agent was hit with a subtler version of the same problem. Guan reportedly hid malicious instructions inside an HTML comment embedded in a GitHub issue. Humans viewing rendered Markdown would not see the payload, but the AI agent parsing raw content would. When a developer assigned the issue to Copilot Agent, the bot reportedly followed the hidden instructions without challenge. Invisible prompt injection is especially nasty because it exploits the mismatch between what humans see and what models process.

The common thread​

All three attacks depended on the same design mistake: over-trusting untrusted text. The agent was allowed to ingest content from a source it was supposed to inspect, then treat that content as a basis for action. In classic security terms, this is a failure of input provenance and authorization separation. The model’s reasoning layer became the enforcement layer, and that is precisely where the attack succeeded.
  • The attacker did not need direct system access.
  • The attacker did not need to break cryptography.
  • The attacker only needed to shape what the agent would read.
  • The agent then did the rest of the work for them.
That sequence is why prompt injection is more than a novelty. It is an exploitation technique with direct operational consequences. A model that leaks a token after reading a poisoned issue is not merely “confused”; it has become an unwitting participant in data theft.

Why GitHub Actions Makes This Worse​

GitHub Actions is an attractive target because it places AI tools close to code, secrets, and automation. A workflow may have access to repository metadata, labels, issues, pull requests, and sometimes tokens that can read or write additional data. When an AI agent is dropped into that environment, it inherits the privilege structure of the workflow itself. That means a small prompt injection can have a disproportionate blast radius.
The integration pattern also blurs user intent. A developer may think they are asking an AI to review a pull request, but the agent is actually reading a broader document stream that includes comments and hidden fields contributed by other users. That means the attack surface includes any untrusted contributor who can add content to a repository conversation. In other words, the trust model is already broken before the agent starts reasoning.

Hidden content and invisible commands​

Microsoft’s own Copilot documentation now acknowledges that hidden characters and invisible text can be used for prompt injection, and that it filters some of them before passing content to the agent. GitHub also says hidden messages in issues or pull request comments can be a form of prompt injection and that only attacks with concrete security impact are bounty eligible. Those admissions are important because they show the vendors are not unaware of the problem. They are, instead, trying to manage it piecemeal.
But filtering hidden characters is only a partial fix. If the system still consumes untrusted text as authoritative context, then the attacker can move the payload into any number of places that survive the filter. That is why HTML comments, metadata fields, and deceptive formatting remain useful to attackers. The model may not “see” the same way a human does, and attackers exploit that mismatch.

Why workflows are a force multiplier​

GitHub Actions also makes exploitation scalable. A single malicious issue or pull request can be processed automatically by many repositories if a team uses the same workflow template. That means the agent is not just a local assistant; it can become a repeated execution path for poisoned input. In security terms, automation turns one attack into many opportunities.
  • Shared workflow templates amplify risk.
  • Secrets in runner environments raise the value of compromise.
  • Automatic comments can leak data back into public threads.
  • The same injection pattern can be reused across repositories.
This is why AI agent security cannot be reduced to “be careful what you ask the model.” The danger is not the query alone. It is the entire surrounding automation stack.

The Quiet Bounty Pattern​

One of the most controversial aspects of these disclosures is not the technical exploit but the response. According to the reporting, Anthropic paid a small bounty after initially treating the issue as lower severity, GitHub eventually paid a bounty after first dismissing the report as a known issue, and Google paid an undisclosed amount. Yet none of the companies publicly issued a broad advisory or assigned a CVE. That choice leaves defenders with less to work with than they would have for a conventional software bug.
Anthropic’s own documentation now warns about prompt injection risks in Claude Code and says users must review proposed code and commands carefully. That is sensible as far as it goes, but it is not the same thing as a public vulnerability record. A documentation note is easy to miss, easy to update silently, and easy to overlook by security teams that depend on change tracking. A CVE, by contrast, creates a durable artifact that scanners, ticketing systems, and compliance workflows can follow.

The enterprise blind spot​

For enterprises, the absence of a CVE is especially awkward. Security teams often depend on standardized feeds to determine whether a product version is affected. If no public advisory exists, then the issue may never appear in dashboards, patch planning tools, or audit evidence. That means a vulnerability can be effectively “fixed” in one narrow sense while remaining invisible in operational practice.
That invisibility matters even more for older deployments. A team may have pinned a version for stability, or may be running a self-hosted integration that no longer receives automatic updates. Without a public disclosure, there is no reliable way to know whether the environment remains exposed. Silence is not neutrality; it is a distribution channel for risk.

Why vendors may prefer quiet handling​

There are reasons vendors may avoid public escalation. Prompt injection sits in a gray zone where the root cause can appear model-native rather than software-specific, and companies may worry that formal disclosure would overstate the certainty of the fix. They may also fear that naming the flaw too loudly invites copycat attempts before mitigations are mature. Those concerns are understandable, but they do not outweigh the need for transparency when secrets can be exfiltrated.
  • Private bounty handling keeps reputational damage contained.
  • Public advisories help defenders track real exposure.
  • CVEs create durable records for compliance and remediation.
  • Quiet fixes can leave long-tail users unprotected.
The tradeoff is stark: protect the brand or protect the ecosystem. In this case, the ecosystem lost visibility.

The Security Model Problem​

Prompt injection is not a one-off exploit class. It is a structural consequence of letting language models act on mixed-trust context. These systems ingest documents, comments, files, and messages as though they were all just text, but from a security standpoint they are not all the same. Some strings are instructions. Others are untrusted data. The model cannot reliably tell them apart unless the surrounding architecture makes that distinction explicit.
That is why so many researchers now argue that prompt injection should be treated as a first-class security vulnerability rather than an edge-case annoyance. A model that can be tricked into leaking secrets or performing unauthorized actions has crossed from “quality issue” into “security issue.” The impact is not hypothetical. Exfiltration of a GitHub token or API key can create downstream privilege abuse, repo tampering, or access to additional internal systems.

Why current mitigations fall short​

Vendors have tried several defenses: stronger system prompts, input sanitization, hidden-character filtering, output filtering, command blocklists, and permission prompts for sensitive operations. These mitigations help around the edges, but they do not solve the core architectural problem. If the model remains the interpreter of both user intent and untrusted context, the attacker only needs one gap.
That gap is especially obvious in tools meant to be helpful. A code-review agent is designed to inspect code and surrounding comments, which are precisely the places attackers can poison. A triage agent is supposed to process issue text, which means the adversary gets a natural injection surface. A calendar or email agent that reads external messages inherits the same risk. The more helpful the tool, the more exploitable its context.

The research keeps confirming the risk​

Recent research continues to reinforce this view. Multiple studies have shown that coding agents, agentic tools, and workflow integrations remain vulnerable to prompt injection at high rates. Other work has explored how hidden instructions can hijack agent memory, influence authorization decisions, or trigger unauthorized actions. The pattern is no longer isolated enough to dismiss as a fluke.
  • The exploit path is consistent across vendors.
  • The payload can be hidden in ordinary text fields.
  • The target may leak secrets, not just behave oddly.
  • The underlying issue is architectural, not cosmetic.
That last point is the most important. If the architecture is flawed, then “better prompts” alone will never be enough.

Anthropic’s Claude Code Security Review​

Anthropic’s Claude Code Security Review is a useful case study because it sits at the intersection of developer productivity and security tooling. The feature is intended to analyze code changes for vulnerabilities and help teams catch issues early. Anthropic’s own guidance says the tool includes safeguards against prompt injection, yet the reported attack demonstrated that it still could be manipulated through repository content. That gap is exactly what makes the disclosure troubling.
According to the reporting, Guan’s payload inserted into a pull request title caused the security review agent to follow malicious instructions and leak credentials in its output. The output then appeared in a visible comment, which makes the incident doubly serious. Not only did the agent leak secrets, it used a normal collaboration channel to publish them.

Why security tools are a special target​

Security tools are supposed to be conservative. They are often granted access to code, configs, and metadata precisely because they need broad visibility. That broader visibility makes them attractive targets for prompt injection because attackers know the agent is already scanning rich, untrusted input. If the security reviewer itself can be tricked, then the tool designed to reduce risk has become the risk.
Anthropic did update documentation that warns users about prompt injection and recommends careful review. But the more meaningful question is whether that guidance is sufficient when the tool operates in a live CI/CD environment. If a review bot can leak a token by processing a malicious PR title, the problem is not just user education. It is a trust boundary failure.

The trust tension​

There is a persistent tension in AI security tooling: to be effective, the tool needs access; to be safe, it needs restraint. Anthropic appears to be trying to address that with permission systems and write restrictions in Claude Code, but the security review integration still lives inside a workflow that consumes untrusted text. The tool can be secure in one dimension and still fragile in another.
That is why organizations should treat AI security review systems as high-value infrastructure, not just smart helpers. If they can read secrets, they can leak secrets. If they can post comments, they can leak secrets publicly. If they can interpret repository text, they can be manipulated by it.

Google’s Gemini CLI Action​

Google’s Gemini CLI Action reportedly fell to a clever issue-body injection that overrode normal safety instructions. The attack used a fake “trusted content” section after legitimate text, creating the appearance that later instructions were authoritative. This is a classic prompt injection trick because it exploits the model’s tendency to treat the last or most salient instruction as the one to follow. It is also a reminder that models are not reliable policy engines.
The disclosure is important because GitHub Actions integrations are exactly where AI meets developer automation. The Gemini action is built to review pull requests, triage issues, and assist with code analysis or modification. Those are high-value tasks, but they all depend on reading content from collaborative surfaces that attackers can influence. If the agent trusts that content too much, the workflow becomes self-owning.

Why “trusted content” tricks work​

Language models are highly sensitive to framing. If an attacker can introduce a section that appears to carry more authority than earlier text, the model may shift its behavior accordingly. That does not mean the model “believes” the payload in the human sense, but it does mean the payload can dominate its immediate context. The issue is less about intelligence and more about prompt engineering under adversarial conditions.
That is why the problem persists even when developers try to prefix system instructions with clear rules. Prompt injection attacks do not need to defeat the whole model. They only need to create enough ambiguity to redirect behavior within the context window. In an agentic workflow, a small context override can have outsized consequences.

The Google-side lesson​

The Google disclosure, as reported, underscores a wider concern: AI assistants are being made more autonomous faster than they are being made trustworthy. Vendors can add safety text, filters, and documentation, but if an attacker can still steer the agent through repository content, then the surface remains exploitable. The security story becomes a series of local patches instead of a coherent control model.
  • The issue payload became the attack vehicle.
  • The model treated structured text as instruction.
  • The resulting leak exposed operational secrets.
  • The workflow turned the agent into a publisher.
That is not merely a bug in a command-line helper. It is a failure mode in an automation platform.

Microsoft’s Copilot Agent​

Microsoft’s Copilot Agent attack is perhaps the most illustrative because it used a hidden HTML comment. That choice takes advantage of a powerful asymmetry: humans read rendered content, but AI systems often parse raw input. If a hidden instruction exists in the raw issue body, the model may consume it even when the interface makes it invisible to the user. That is what makes invisible prompt injection so insidious.
GitHub’s own docs now state that hidden messages in issues or comments can be used for prompt injection and that hidden characters are filtered before being passed to Copilot coding agent. That guidance is a sign of progress, but it also reveals how far the platform has had to move just to blunt a basic class of attack. The fact that filtering had to become explicit policy tells you how real the threat has become.

Human-visible versus machine-visible​

The central security problem here is the mismatch between user perception and machine interpretation. A human reviewer may think they are assigning an issue to a bot that will read plain prose. The bot, however, may be reading markup, HTML, metadata, and hidden tokens that the human never sees. Attackers exploit that hidden layer because it lets them smuggle commands without social friction.
This is not unique to Microsoft, of course, but the Copilot example is useful because it shows how easily the attack fits into ordinary issue workflows. No exploit kit was required. No credential stuffing campaign was needed. The attacker simply made text look benign to the human eye and malicious to the machine parser.

The Copilot policy gap​

Microsoft’s bug bounty pages already distinguish between harmless prompt influence and attacks with real security impact. That is good program design. The gap is not in recognizing the issue privately; it is in deciding when and how to disclose it publicly. If the agent can be steered into unauthorized actions or data exposure, then the finding should matter to any enterprise using Copilot in an issue workflow.
Prompt injection is not just a weird chatbot trick. In an environment where a bot can read, act, and post, it becomes a privilege boundary issue. That is the part enterprises should care about most.

The Bigger Industry Pattern​

These disclosures are not isolated. They fit a broader pattern in which AI toolchains, coding agents, and model-driven automation are repeatedly shown to be vulnerable to context poisoning. The industry keeps discovering that once an agent can read from untrusted channels and take actions on behalf of a user, classic security assumptions stop applying cleanly. That is true for chatbots, IDE assistants, browser agents, and now GitHub workflow agents.
This helps explain why so many researchers are now warning about the supply chain dimension of agentic AI. If the agent pulls in third-party tools, marketplace skills, or plugin-like extensions, then one compromised component can become a route to broader compromise. In practical terms, the attack surface is not just the model. It is the entire stack of connectors, parsers, permissions, and outputs.

Why the market keeps underestimating this​

Vendors are incentivized to present AI agents as convenient and safe. Enterprises are incentivized to adopt them because they promise productivity gains and reduced toil. Users are incentivized to trust them because they are marketed as intelligent helpers. That creates a dangerous consensus around deployment before the security model is mature.
There is also an understandable human bias at work: if an AI agent behaves helpfully 99 times, the 100th malicious prompt may feel like an outlier. But security is about tail risk, not median behavior. The attacker only needs one successful injection to steal a token or alter a workflow. The other 99 successful interactions do not compensate for that failure.

The disclosure infrastructure lag​

Traditional software has well-established disclosure machinery: advisories, CVEs, patch timelines, scanner signatures, and vendor support notes. AI agent vulnerabilities, by contrast, often end up handled as product issues, model behaviors, or bounty submissions. That leaves defenders in a gray zone where the bug may be real, but the ecosystem never gets the standardized warning label.
  • Private bounties are not the same as public alerts.
  • Documentation updates are not the same as advisories.
  • Model safety improvements are not the same as versioned fixes.
  • Product teams and security teams often operate on different clocks.
Until those clocks align, the industry will keep repeating the same pattern: demonstrate risk, patch quietly, move on, and hope the next attacker does not notice first.

Strengths and Opportunities​

Despite the seriousness of these findings, the situation is not hopeless. The vendor responses show that the problem is now visible enough to reach bounty channels, which means the industry at least recognizes prompt injection as a legitimate security issue. The next step is to turn scattered mitigation into standardized practice.
  • Security awareness is improving across Anthropic, Google, and Microsoft.
  • Hidden-content filtering shows that platform vendors are beginning to harden their parsers.
  • Permission prompts and write restrictions can reduce blast radius when implemented consistently.
  • Bug bounty programs provide a channel for researchers to report real-world exploit paths.
  • Documentation updates can help users understand prompt injection risks faster.
  • Enterprise security teams now have a stronger case for isolating AI agents from secrets.
  • Standards work around agentic AI security could create better disclosure norms.
There is also an opportunity for vendors to ship more deterministic guardrails around action-taking. If an AI assistant can suggest a change but cannot execute it without machine-verifiable policy checks, the risk drops sharply. That is the direction serious product teams should be heading.

Risks and Concerns​

The biggest concern is not that one prompt injection worked. It is that the same basic technique appears to work across multiple vendors, multiple products, and multiple workflow styles. That suggests the security posture of agentic AI is still fundamentally immature.
  • Secret exfiltration remains the most obvious and damaging failure mode.
  • Invisible payloads can bypass human review and exploit raw parsers.
  • Quiet handling leaves older deployments without clear risk signals.
  • No CVE means scanners and compliance tools may miss the issue.
  • Over-privileged workflows can turn a small injection into a larger compromise.
  • User trust may outpace actual technical resilience.
  • Cross-platform reuse of attack patterns makes future exploitation cheaper.
The enterprise danger is especially acute. Organizations may believe that because a vendor paid a bounty, the issue has been responsibly contained. But if the fix is not communicated clearly, and if old versions remain in service, then the risk persists quietly in production. That is exactly the sort of problem defenders hate because it hides in plain sight.

Looking Ahead​

The next phase of this story will probably not be about whether prompt injection exists. That question is already settled. The real questions are whether vendors will start issuing public advisories for agentic AI vulnerabilities, whether security teams will demand versioned mitigation guidance, and whether workflow tools will be redesigned to separate untrusted text from actionable instructions. Those are the decisions that will determine whether AI agents become durable enterprise tooling or permanent security liabilities.
There is also a reasonable chance that pressure will shift toward architectural controls rather than prompt-level fixes. That could include stronger separation between data and instructions, tighter human confirmation gates, sandboxing of agent actions, and policies that prevent agents from reading or echoing secrets from their own environment. None of those measures is glamorous, but security rarely is. The industry’s next breakthrough is likely to be boring, deterministic, and deeply necessary.
  • Public advisories for AI-agent vulnerabilities may become the norm.
  • Vendors may standardize hidden-content filtering and markup normalization.
  • Enterprises may begin treating AI agents like privileged service accounts.
  • Security teams may require explicit logging of every agent action.
  • Future products may rely more on policy engines than prompt phrasing.
  • Researchers will keep probing calendar, email, chat, and ticketing integrations.
  • Regulators and auditors may eventually demand disclosure standards for agentic AI.
The most important takeaway is simple: once an AI agent can read untrusted content and act on it, it is part of your security perimeter. Anthropic, Google, and Microsoft have now all had to confront that reality in public or semi-public ways. The industry can either build disclosure and defense practices that match the new risk, or keep pretending that a bug bounty quietly paid is the same thing as a warning to the world.

Source: The Next Web Anthropic, Google, and Microsoft paid AI agent bug bounties, then kept quiet about the flaws
 

Back
Top