AI models block 87% of unique attacks, but only 8% when attackers persist

One malicious prompt is blocked, while ten prompts are forwarded. This gap defines the difference between passing benchmark tests and withstanding real-world attacks – and it’s a gap that most companies ignore.
When attackers send a single malicious request, open AI models hold up well, blocking attacks 87% of the time (on average). But when these same attackers send multiple prompts over the course of a conversation by probing, reframing, and backtracking through numerous exchanges, the math quickly reverses. Attack success rates increase from 13% to 92%.
For CISOs evaluating open models for enterprise deployment, the implications are immediate: The models that power your customer-facing chatbots, internal co-pilots, and autonomous agents can pass security tests in a single round while failing catastrophically under sustained adversarial pressure.
"Many of these models have started to improve a little," DJ Sampath, senior vice president of Cisco’s AI Software Platforms Group, told VentureBeat. "When you attack him once, with single turn attacks, they are able to protect him. But when you go from single-round to multi-round, these models suddenly start to display vulnerabilities where the attacks are successful, almost 80% in some cases."
Why Conversations Break Open Patterns
The Cisco AI Threat Research and Security team discovered that open AI models that block single attacks collapse under the weight of conversational persistence. Their recently published study shows that jailbreak success rates increase almost tenfold when attackers prolong the conversation.
The results, published in "Death by a Thousand Prompts: Open Model Vulnerability Analysis" by Amy Chang, Nicholas Conley, Harish Santhanalakshmi Ganesan, and Adam Swanda, quantify what many security researchers have long observed and suspected, but have been unable to prove on a large scale.
But Cisco’s research does, showing that treating multi-round AI attacks as an extension of single-round vulnerabilities completely misses the point. The gap between them is categorical and not a matter of degree.
The research team evaluated eight open-weight models: Alibaba (Qwen3-32B), DeepSeek (v3.1), Google (Gemma 3-1B-IT), Meta (Llama 3.3-70B-Instruct), Microsoft (Phi-4), Mistral (Large-2), OpenAI (GPT-OSS-20b), and Zhipu AI (GLM 4.5-Air). Using black box methodology – or running tests without knowledge of the internal architecture, which is exactly how real-world attackers work – the team measured what happens when persistence replaces one-off attacks.
The researchers note: "Single-Round Attack Success Rates (ASR) average 13.11% because models can more easily detect and reject isolated adversarial inputs. On the other hand, multi-round attacks, exploiting conversational persistence, achieve an average ASR of 64.21%. [a 5X increase]with some models like Alibaba Qwen3-32B achieving an ASR of 86.18% and Mistral Large-2 achieving an ASR of 92.78%." The latter is up 21.97% compared to a single round.
Results define the gap
The article’s research team offers a succinct view of the open model’s resilience against attacks: "This escalation, ranging from 2x to 10x, stems from the models’ inability to maintain contextual defenses during extended dialogues, allowing attackers to refine prompts and bypass protections."
Figure 1: Single-round attack success rates (blue) versus multi-round attack success rates (red) across the eight models tested. The gap ranges from 10 percentage points (Google Gemma) to more than 70 percentage points (Mistral, Llama, Qwen). Source: Cisco AI Defense
The five techniques that make perseverance deadly
The research tested five multi-round attack strategies, each exploiting a different aspect of conversational persistence.
-
Breaking down and reassembling information: Breaks harmful queries into harmless components over rounds, then reassembles them. Against Mistral Large-2, this technique achieved 95% success.
-
Contextual ambiguity introduces vague framing that confuses security classifiers, achieving 94.78% success against Mistral Large-2.
-
Crescendo attacks gradually increase demands over turns, starting harmless and becoming harmful, reaching 92.69% success against Mistral Large-2.
-
Role-playing and character adoption establishes fictional contexts that normalize harmful effects, achieving up to 92.44% success against Mistral Large-2.
-
Refusal to reframe repacks rejected requests with different justifications until one succeeded, reaching up to 89.15% success against Mistral Large-2.
What makes these techniques effective isn’t sophistication, it’s familiarity. They reflect the way humans naturally converse: creating cBntext, clarifying requests, and reframing when initial approaches fail. Models are not vulnerable to exotic attacks. They are sensitive to persistence itself.
Table 2: Attack success rate by technique on all models. Consistency between techniques means that companies cannot defend against a single model. Source: Cisco AI Defense
The Open Weight Safety Paradox
This research comes at a critical inflection point as open source increasingly contributes to cybersecurity. Open source and open-weight models have become fundamental to innovation in the cybersecurity industry. Whether it’s speeding time to market for startups, reducing an enterprise’s vendor lock-in, or enabling customization that proprietary models can’t match, open source is considered the go-to platform for the majority of cybersecurity startups.
The paradox does not escape Cisco. The company’s Foundation-Sec-8B model, designed specifically for cybersecurity applications, is distributed as open weights on Hugging Face. Cisco doesn’t just criticize competitors’ models. The company acknowledges a systemic vulnerability affecting the entire open-weight ecosystem, including models it launches itself. The message is not "avoid open-weight models." It is "understand what you are deploying and add appropriate guardrails."
Sampath is blunt about the implications: "Open source has its own drawbacks. When you begin shooting an open model, you need to think about the safety implications and ensure that you are consistently installing the right types of guardrails around the model."
Table 1: Success rate of attacks and security vulnerabilities on all models tested. Gaps greater than 70% (Qwen at +73.48%, Mistral at +70.81%, Llama at +70.32%) represent priority candidates for additional guardrails before deployment. Source: Cisco AI Defense.
Why laboratory philosophy defines safety outcomes
The security flaw discovered by Cisco is directly related to how AI labs approach alignment.
Their research clearly shows this pattern: "Capability-focused models (e.g., Llama) demonstrated the highest multi-round deviations, with Meta explaining that developers are “in the driver’s seat to adapt security to their use case” after training. Models with a strong alignment focus (e.g., Google Gemma-3-1B-IT) demonstrated a more balanced profile between single- and multi-round strategies deployed against it, indicating a focus on “rigorous security protocols” and a “low level of risk” of misuse."
Capability-driven labs produce gaps in this area. Meta’s Llama has a 70.32% security vulnerability. Mistral’s model card for Large-2 recognizes this "has no moderation mechanism" and shows a gap of 70.81%. Alibaba’s Qwen technical reports do not acknowledge safety or security issues at all, and the model shows the highest discrepancy at 73.48%.
Safety-focused labs produce smaller gaps. Gemma from Google points out "rigorous security protocols" and targets a "low level of risk" for misuse. The result is the lowest spread at 10.53%, with more balanced performance in single- and multi-round scenarios.
Models optimized for capacity and flexibility tend to arrive with less built-in security. It’s a design choice, and for many business use cases, it’s the right choice. But businesses must recognize that "capacity first" often means "second safety" and budget accordingly.
Where attacks are most successful
Cisco tested 102 distinct sub-threat categories. The top 15 achieved high success rates across all models, suggesting that targeted defensive measures could deliver disproportionate security improvements.
Figure 4: The 15 most vulnerable subthreat categories, ranked by average attack success rate. Malicious infrastructure operations topped the list at 38.8%, followed by gold smuggling (33.8%), network attack operations (32.5%) and investment fraud (31.2%). Source: Cisco AI Defense.
Figure 2: Attack success rates for 20 threat categories and all eight models. Malicious code generation shows consistently high rates (3.1% to 43.1%), while pattern extraction attempts show near zero success, except for Microsoft Phi-4. Source: Cisco AI Defense.
Security as a key to driving AI adoption
Sampath views security not as an obstacle but as the mechanism that enables adoption: "The way enterprise security leaders think about this question is: “I want to unleash the productivity of all my users. Everyone is clamoring to use these tools. But I need good security barriers because I don’t want to show up in a Wall Street Journal piece,’" he told VentureBeat.
Sampath continued, "If we have the ability to detect rapid injection attacks and block them, then I will be able to unlock and trigger the adoption of AI in a fundamentally different way."
What the defense needs
The study highlights six essential capabilities that businesses should prioritize:
-
Contextual guardrails that maintain state throughout conversations
-
Model-independent runtime protections
-
Continuous red-teaming targeting multi-turn strategies
-
Hardened system prompts designed to resist instruction overwriting
-
Comprehensive logging for forensic visibility
-
Threat-specific mitigations for the top 15 sub-threat categories identified in the research
The action window
Sampath warns against waiting: "Many people are in this waiting situation, waiting for AI to take hold. That’s not the right way to think about things. Every two weeks something dramatic happens that resets this framework. Choose a partner and start doubling down."
As the authors of the report conclude: "The 2-10x superiority of multi-turn attacks over single-turn attacks, model-specific weaknesses, and high-risk threat models require urgent action."
To repeat: one prompt is blocked, 10 prompts are passed. This equation won’t change until companies stop testing single-round defenses and start securing entire conversations.



