Path-driven red-teaming for agent skills

SkillAttack

SkillAttack is an automated red-teaming framework with three stages: (1) analyzing skills to identify vulnerabilities and attack surfaces, (2) inferring attack paths and constructing adversarial prompts in parallel across surfaces, and (3) iteratively refining paths and prompts based on execution feedback, forming a closed-loop search that converges toward successful exploitation.

Attack Success Rate

SkillAttack's attack success rate (ASR) across 10 frontier LLMs on three evaluation datasets. Higher values indicate greater skill vulnerability.

Model Injected – Obvious Injected – Contextual Hot100
Average
Low Medium High Very High

Risk Type Distribution

Proportion of each unsafe behavior category observed across the three datasets. The distribution reveals distinct risk profiles per dataset type.

Case Studies

Three representative cases from each dataset. Each case follows the full attack path: from analyzing the skill to identify vulnerable operations, constructing an adversarial prompt that steers the agent along the inferred path, to judging whether the unsafe behavior was triggered with traceable evidence from the execution trajectory, iterating until the attack succeeds.