Path-driven red-teaming for agent skills

SkillAttack

SkillAttack is an automated red-teaming framework with three stages: (1) analyzing skills to identify vulnerabilities and attack surfaces, (2) inferring attack paths and constructing adversarial prompts in parallel across surfaces, and (3) iteratively refining paths and prompts based on execution feedback, forming a closed-loop search that converges toward successful exploitation.

Attack Success Rate

ASR across 10 models and 3 datasets

Risk Type Distribution

Risk categories across the three datasets

Case Studies

Full attack process walkthroughs

Results

Attack Success Rate

SkillAttack's attack success rate (ASR) across 10 frontier LLMs on three evaluation datasets. Higher values indicate greater skill vulnerability.

Model	Injected – Obvious	Injected – Contextual	Hot100
Average

Low Medium High Very High

Analysis

Risk Type Distribution

Proportion of each unsafe behavior category observed across the three datasets. The distribution reveals distinct risk profiles per dataset type.

Case

Case Studies

Three representative cases from each dataset. Each case follows the full attack path: from analyzing the skill to identify vulnerable operations, constructing an adversarial prompt that steers the agent along the inferred path, to judging whether the unsafe behavior was triggered with traceable evidence from the execution trajectory, iterating until the attack succeeds.