Skip to main content

agent-evaluation

Enables the design and implementation of evaluation systems for AI agents, enhancing their performance through structured benchmarks and grading.

Install this skill

or
47/100

Security score

The agent-evaluation skill was audited on Mar 7, 2026 and we found 7 security issues across 2 threat categories, including 3 high-severity. Review the findings below before installing.

Categories Tested

Security Issues

high line 454

Direct command execution function call

SourceSKILL.md
454exec(code) # In sandbox
high line 400

Eval function call - arbitrary code execution

SourceSKILL.md
4002. Run eval (expect failure)
high line 402

Eval function call - arbitrary code execution

SourceSKILL.md
4024. Run eval (expect pass)
medium line 67

Python subprocess execution

SourceSKILL.md
67result = subprocess.run(
low line 429

External URL reference

SourceSKILL.md
429- [Anthropic: Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
low line 430

External URL reference

SourceSKILL.md
430- [SWE-bench](https://www.swebench.com/)
low line 431

External URL reference

SourceSKILL.md
431- [WebArena](https://webarena.dev/)
Scanned on Mar 7, 2026
View Security Dashboard