Benchmarks Find ‘DeepSeek-V3-0324 Is More Vulnerable Than Qwen2.5-Max’

April 4, 2025

News

Benchmarks Find ‘DeepSeek-V3-0324 Is More Vulnerable Than Qwen2.5-Max’

April 4, 2025

With the newest steady launch dated January 28, 2025, Qwen2.5-Max is classed as a Combination-of-Consultants (MoE) language mannequin developed by Alibaba. Like different language fashions, Qwen2.5-Max is able to producing textual content, understanding totally different languages, and performing superior logic. In accordance with latest benchmarks, it’s also safer than DeepSeek-V3-0324.

Utilizing Recon to scan for vulnerabilities

A staff of analysts with Shield AI, the corporate behind a pink teaming and safety vulnerability scanning software often known as Recon, lately used their platform to check the safety of Qwen2.5-Max in opposition to that of DeepSeek-V3.

The staff’s evaluation reads, partly: “We noticed that DeepSeek-V3-0324 is extra susceptible than Qwen2.5-Max, with Recon attaining an nearly 25% greater assault success charge (ASR).”

Whereas it could be safer than its competitors, Qwen2.5-Max isn’t precisely good. In accordance with their exams, the AI mannequin is most vulnerable to immediate injection assaults, as these represented nearly 48% of all profitable cyberattacks in opposition to Qwen2.5-Max. Evasion and jailbreak assaults proved to be much less profitable with an approximate ASR of 40% for each.

Exposing vulnerabilities in DeepSeek-V3

Recon makes use of a complete Assault Library to scan current-gen AI fashions and establish vulnerabilities throughout six particular classes:

Evasion methods
System immediate leaks
Immediate injection assaults
AI jailbreak makes an attempt
Common security controls
Adversarial suffix resistance

Along with simulated cyberattacks, Recon additionally assesses the AI fashions’ resistance to producing probably dangerous or unlawful content material. For instance, throughout adversarial suffix resistance exams, Recon makes an attempt to govern the AI mannequin into producing dangerous or unlawful content material.

The Shield AI staff ran Recon in opposition to each Qwen2.5-Max and DeepSeek-V3, with the previous boasting a decrease assault success charge (ASR) throughout quite a lot of assaults; together with jailbreaks, immediate injection, and evasion methods.

Whereas Qwen2.5-Max had a 47% ASR in opposition to immediate injection assaults, in comparison with DeepSeek-V3’s notably greater 77%. In opposition to evasion methods, Qwen2.5-Max scored a 39.4% ASR in opposition to evasion methods, whereas DeepSeek-V3 scored 69.2%. Each AI fashions displayed related outcomes throughout different simulated cyberattacks.

Analyzing DeepSeek-V3’s strengths

Regardless of its safety weaknesses, DeepSeek-V3-0324 nonetheless outperforms Qwen2.5-Max in a number of totally different benchmarks. In contrast to the ASR, the next rating in these exams truly signifies higher efficiency.

	DeepSeek-V3-0324	Qwen2.5-Max
MMLU-Professional	81.2	75.9
GPQA Diamond	68.4	59.1
MATH-500	94.0	90.2
AIME 2024	59.4	39.6
LiveCodeBench	49.2	39.2

In accordance with these benchmarks, DeepSeek-V3-0324’s strengths embrace normal language understanding (MMLU-Professional), superior subjects reminiscent of biology, physics, and chemistry (GPQA Diamond), arithmetic (MATH-500, AI in medication (AIME 2024), and coding (LiveCodeBench).

roosho Senior Engineer (Technical Services)

I am Rakib Raihan RooSho, Jack of all IT Trades. You got it right. Good for nothing. I try a lot of things and fail more than that. That's how I learn. Whenever I succeed, I note that in my cookbook. Eventually, that became my blog.

See Full Bio

share this article.

Benchmarks Find ‘DeepSeek-V3-0324 Is More Vulnerable Than Qwen2.5-Max’

Benchmarks Find ‘DeepSeek-V3-0324 Is More Vulnerable Than Qwen2.5-Max’

Utilizing Recon to scan for vulnerabilities

Exposing vulnerabilities in DeepSeek-V3

Analyzing DeepSeek-V3’s strengths

No Comment! Be the first one.

Leave a Reply Cancel reply

related posts .

Microsoft 365 Copilot Wave 2 Spring release brings Researcher and Analyst agents, and more

Windows 11 build 27842 gets redesigned green screen of death and reworked battery indicator

Recent Posts

Microsoft 365 Copilot Wave 2 Spring release brings Researcher and Analyst agents, and more

Windows 11 build 27842 gets redesigned green screen of death and reworked battery indicator

How Dropbox leverages testing to maintain high level of trust at scale | by Jose Alcérreca | Android Developers | Apr, 2025

Tag Cloud

Type and hit Enter to search

Benchmarks Find ‘DeepSeek-V3-0324 Is More Vulnerable Than Qwen2.5-Max’

Benchmarks Find ‘DeepSeek-V3-0324 Is More Vulnerable Than Qwen2.5-Max’

Utilizing Recon to scan for vulnerabilities

Exposing vulnerabilities in DeepSeek-V3

Analyzing DeepSeek-V3’s strengths

No Comment! Be the first one.

Leave a Reply Cancel reply

related posts .

Microsoft 365 Copilot Wave 2 Spring release brings Researcher and Analyst agents, and more

Windows 11 build 27842 gets redesigned green screen of death and reworked battery indicator

Recent Posts

Microsoft 365 Copilot Wave 2 Spring release brings Researcher and Analyst agents, and more

Windows 11 build 27842 gets redesigned green screen of death and reworked battery indicator

How Dropbox leverages testing to maintain high level of trust at scale | by Jose Alcérreca | Android Developers | Apr, 2025

Tag Cloud

Enjoying my articles?

Sign up to get new content delivered straight to your inbox.