arXiv

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

June 2, 2026 · Su Wang, Pin Qian, Yihang Chen, Junxian You, Xiaoyuan Wang, Xiaochong Jiang, Lifei Liu, Haoran Yu, Jingzhou Xu · Original Source

Title: When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

Abstract:

Large Language Model (LLM) agents are increasingly dependent on skills contributed by the community to broaden their operational capabilities. This study addresses a fundamental safety challenge in agentic AI systems: the potential for individually safe skills to combine into unsafe installed skill sets. To investigate this, we introduce SkillReact, a framework for measuring compositional security that comprises three distinct elements: a deterministic static-composition benchmark, an action-based exploitability harness, and a two-rater LLM-assisted pipeline for human adjudication.

Our analysis focused on 1,520 skills from ClawHub. Of these, 651 passed individual inspection, allowing for the formation of 211,575 distinct pairs. The benchmark identified 22.25% of these pairs as structural candidates for risk. We interpret this raw rate as the ceiling for a recall-oriented scanner and calibrated it against human judgment. In a pattern-stratified audit, approximately one in five flagged pair-pattern hits was confirmed as a genuine compositional risk, resulting in a population-weighted validity of 18.2%—our primary finding. This suggests that roughly 14,000 genuine risk memberships exist within a single registry. Notably, per-skill scanning fails to detect these risks by design, as each pair is safe in isolation.

Subsequently, an action-based harness evaluated when these candidates translate into model-issued tool calls. The results indicated that realization is gated by the host model’s disposition. On an anchor-conditioned dropper subset, Haiku-4-5 issued the dropper-stage tool call in all 39 direct-prompt trials (comprising 36 full download-then-execute chains and 3 download-only instances). In contrast, Opus-4-7 halted at the download stage, and Sonnet-4-6 refused the request entirely. A control experiment, which kept the request fixed while varying only the installed skills, revealed that compliance was highest when no skills were installed. This demonstrates that while composition determines which capabilities are reachable, the host model ultimately decides whether to utilize them. These findings underscore the necessity of install-time compositional checks and capability isolation as essential complements to traditional per-skill scanning.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC