Cisco’s latest security sweep has found that many of the most widely used open-weight large language models are alarmingly easy to manipulate with a small series of crafted prompts — and multi-turn (conversation) attacks are the most effective vector, producing success rates two to ten times higher than single-prompt attempts, with some models failing more than nine times out of ten in Cisco’s tests.
Open-weight models — sometimes called open-source or “open-weight” LLMs — are model weights that researchers, developers, and organizations can download, modify, and fine-tune. They power everything from research prototypes to production chatbots and on-premises inference stacks because they offer flexibility, lower cost, and transparency compared with closed commercial APIs. Cisco’s AI Defense team used its AI Validation platform to do a controlled, black-box assessment of several popular open-weight models and concluded that systemic vulnerabilities make many of them poor candidates for unguarded production use. The Cisco assessment examined eight widely used models: Alibaba Qwen3-32B, DeepSeek v3.1, Google Gemma 3-1B-IT, Meta Llama 3.3-70B-Instruct, Microsoft Phi-4, Mistral Large-2 (Large-Instruct-2047), OpenAI GPT-OSS-20b, and Zhipu GLM 4.5-Air. It ran automated adversarial prompts and engineered conversational (multi-turn) sequences to measure whether the models would produce disallowed outputs, leak system or private context, or otherwise be steered into unsafe behavior.
Open-weight models will remain a powerful and useful class of tools. The responsible path forward is not to abandon them but to adopt a security-first deployment model: test adversarially, choose models with alignment and safety evidence, enforce least privilege on data flows, and require human oversight for any action with tangible impact. Only with those layered controls can organizations keep the benefits of open models while reducing their operational risks.
Source: Petri IT Knowledgebase Cisco Warns of Major Flaws in Popular Open-Weight AI Models
Background / Overview
Open-weight models — sometimes called open-source or “open-weight” LLMs — are model weights that researchers, developers, and organizations can download, modify, and fine-tune. They power everything from research prototypes to production chatbots and on-premises inference stacks because they offer flexibility, lower cost, and transparency compared with closed commercial APIs. Cisco’s AI Defense team used its AI Validation platform to do a controlled, black-box assessment of several popular open-weight models and concluded that systemic vulnerabilities make many of them poor candidates for unguarded production use. The Cisco assessment examined eight widely used models: Alibaba Qwen3-32B, DeepSeek v3.1, Google Gemma 3-1B-IT, Meta Llama 3.3-70B-Instruct, Microsoft Phi-4, Mistral Large-2 (Large-Instruct-2047), OpenAI GPT-OSS-20b, and Zhipu GLM 4.5-Air. It ran automated adversarial prompts and engineered conversational (multi-turn) sequences to measure whether the models would produce disallowed outputs, leak system or private context, or otherwise be steered into unsafe behavior. What Cisco found — headline results
- Multi-turn attacks dominated. Across every tested model, multi-turn jailbreak and prompt-injection strategies were far more successful than isolated single prompts. Cisco reported multi-turn success rates that were between 2× and 10× higher than single-turn baselines.
- Mistral Large-2 was the most vulnerable in the study. In Cisco’s multi-turn tests, Mistral Large‑2 reached a 92.78% attack success rate in the scenarios evaluated — an exceptionally high figure that highlights the practical risk of dialogue-based exploitation.
- Alignment philosophy matters. Models developed with a capability-first approach — where safety is left mainly to downstream integrators — tended to show larger gaps between single- and multi-turn robustness. Models with heavier alignment emphasis by the lab showed smaller single/multi-turn gaps, though they were not immune.
How multi-turn adversarial attacks work
Multi-turn attacks exploit the conversational memory and context-handling behavior of chat-style models. Instead of trying to coerce a model with a single malicious prompt, an attacker:- Builds a sequence of seemingly benign or constrained prompts that gradually change the model’s internal context and prompt framing.
- Uses persona, roleplay, incremental escalation, or decomposition strategies to rephrase or split a harmful request into smaller parts that are each acceptable on their own but which combine to produce the forbidden output.
- Exploits response patterns and the model’s willingness to be helpful to reframe refusals into compliant instructions (for example, “explain for research”, “as a fictional scenario”, or “summarize the following code” tricks).
Comparative vulnerability analysis — model-by-model patterns
Cisco’s report (and contemporaneous industry coverage) identifies a pattern rather than a single failing: design priorities and alignment investments shape how models hold up to multi-turn adversarial pressure.- Models emphasizing capability and extensibility (examples: some Meta and Alibaba releases) often ship with fewer baked-in refusals and rely on integrators to implement safety. Those models showed dramatic multi-turn gaps.
- Models built with stronger safety alignment guardrails (examples: some Google and OpenAI releases in the broader ecosystem) showed more balanced performance between single- and multi-turn attacks, though they were still vulnerable in many scenarios.
Real-world threats: data leakage, manipulation, and malicious code generation
Cisco emphasized several high-risk threat categories that were consistently successful in their testing:- Sensitive-data exfiltration. Prompt sequences that coax a model to reveal system prompts, hidden instructions, or private context can leak API keys, PII, or proprietary text if such data appears in the model’s context or retrieval layer. This matters for enterprise RAG (retrieval-augmented generation) systems and chat assistants with privileged context.
- Misinformation and manipulation. Multi-step jailbreaks can produce persuasive disinformation or reframed outputs that look authoritative, increasing the risk of reputational harm and coordinated misinformation campaigns.
- Malicious code synthesis. Attack chains that gradually refine the model’s instruction can succeed in producing harmful code or step-by-step illicit instructions while circumventing single-shot content filters.
Methodology, scope, and important caveats
Cisco’s tests used the AI Validation platform in a black-box mode: evaluators did not assume privileged access to model internals or deployed guardrails, reflecting a realistic attacker posture for many production setups. The test suite included many threat categories (data exfiltration, code gen, bias/ethics bypasses, and more) and measured attack success rate (ASR) across single-turn and multi-turn sequences. Key caveats and limitations to bear in mind:- Model versions matter. The specific weights, tuning, and safety patches applied by labs can change over time; a model vulnerable today can be hardened tomorrow and vice versa. Cisco’s results are a snapshot of the tested versions and settings.
- Deployment context matters. Many production systems add retrieval filters, moderation layers, DLP, and human review. Cisco tested open-weight models in black-box settings and noted that downstream guardrails materially affect real-world risk profiles.
- Reproducibility vs. representativeness. Automated adversarial testing is powerful, but real attack success in the wild depends on attacker resources, ability to target specific deployments (for example, those that pass sensitive context to the model), and incremental discovery of optimal multi-turn patterns. Independent reviewers emphasize that automated lab tests are necessary but not sufficient to predict every production outcome.
Cisco’s recommendations — what defenders should do now
Cisco and independent analysts converge on a layered defense posture. The study’s recommendations are practical and immediately actionable:- Conduct continuous adversarial testing and red‑teaming before and after deployment. Treat multi-turn, adaptive attacks as mandatory checks rather than optional stress tests.
- Implement context-aware guardrails that protect across the entire conversational context — not just single prompts. Guardrails must monitor evolving context, not only the immediate user query.
- Enforce runtime monitoring and anomaly detection to flag unusual conversational patterns, rapid escalation attempts, or repeated reframing strategies.
- Prefer model choices and vendor offerings with documented safety alignment, accessible model cards, and transparent security assessments. But do not rely on vendor labeling alone — require independent testing.
- Use least-privilege controls, DLP, and identity-based access for any model endpoints that can influence sensitive systems (ticketing, code commits, document production, or external communications).
- Inventory all AI endpoints and map which services send sensitive data to models.
- Add adversarial prompt tests to CI/CD and pre-deployment gates.
- Enforce token and credential rotation; log and alert on unusual model call patterns.
- Insert human‑in‑the‑loop approvals for agentic outputs that perform actions.
- Maintain an incident playbook for AI misbehavior, including data exfiltration response steps.
Critical analysis — strengths, gaps, and the risk outlook
Strengths of Cisco’s work- Breadth and automation. Cisco’s platform allowed large-scale, repeatable, and adaptive multi-turn testing that reflects realistic attacker behavior better than many single-shot jailbreak lists.
- Actionable framing. The findings focus on operational consequences (data exfiltration, manipulation, unsafe code generation) that security teams can mitigate with engineering and policy changes.
- Version and deployment drift. Results will vary with upstream patches, downstream safety layers, and the exact fine-tuning applied by a deployer — so lab numbers don’t directly translate to every production deployment. This is why continuous, in‑house adversarial testing is essential.
- Potential for alarmism if misread. Publicizing high ASRs can encourage defenders to panic or to ban open-weight models wholesale, which would be an overreaction; the right response is measured — adopt layered defenses, not blanket avoidance.
What enterprises and Windows-focused IT teams should do — a practical playbook
- Treat models as untrusted endpoints until proven otherwise. Assume any assistant could be coaxed into unsafe behavior and design retrieval, data access, and action flows accordingly.
- Harden the ingress and egress pipelines. Sanitize retrieved documents, anonymize or redact PII before sending to models, and apply DLP to responses.
- Require per-request provenance and immutable logs. Capture which model, which version, what prompt template, and what retrieval documents were used for every inference. This is essential for auditing and incident response.
- Use canary deployments and model‑version gating. Route a small percentage of traffic through new model versions and monitor ASRs and anomaly signals before wider rollout.
- Enforce human approval for any automated action. If an assistant can create tickets, send emails, or change configuration, require human confirmation for high-risk scopes.
- Contract & procurement controls. Require vendors to disclose alignment approaches, red-team histories, dataset provenance, and third-party security audits in procurement documents.
- Run an inventory of all model endpoints and classify data sensitivity.
- Block any model endpoints that handle regulated data until the above mitigations are in place.
- Add multi-turn adversarial tests to pre-deployment checks.
- Deploy model‑agnostic guardrails and response filters (refusal models, post-processing filters).
- Roll out human-in-loop gating for agentic actions.
Broader implications: supply chain, governance, and policy
Cisco’s findings join a stream of research (dataset poisoning, backdoor injections, side-channel leaks) that reframes LLM risk as an operational supply-chain and governance problem. Open-weight models are valuable precisely because they are modifiable — that same openness increases the attack surface for malicious fine-tuning and stealthy dataset manipulation. Regulatory and procurement frameworks should therefore treat model providers and managed-hosting vendors as security vendors: require provenance, red-team disclosure, and the right to independent audits where the models are used in safety-critical contexts. A policy angle to emphasize: organizations that deploy open-weight models in regulated industries must insist on contractual guarantees about non-training of tenant data, explicit deletion semantics, and incident notification SLAs. These legal levers matter because they align vendor incentives and create recourse when a model causes harm.Conclusion
Cisco’s assessment is a wake-up call for any organization treating an LLM as a simple drop-in component. The core takeaways are plain and actionable: multi-turn adversarial attacks are not a theoretical curiosity — they are a pragmatic threat that dramatically increases attack success rates against open-weight models. That reality demands a shift from single-prompt safety checks to continuous, context-aware security engineering: regular adversarial testing, runtime monitoring, DLP, provenance, and human oversight.Open-weight models will remain a powerful and useful class of tools. The responsible path forward is not to abandon them but to adopt a security-first deployment model: test adversarially, choose models with alignment and safety evidence, enforce least privilege on data flows, and require human oversight for any action with tangible impact. Only with those layered controls can organizations keep the benefits of open models while reducing their operational risks.
Source: Petri IT Knowledgebase Cisco Warns of Major Flaws in Popular Open-Weight AI Models