Skip to main content

evaluating-llms-harness

Evaluates LLMs using 60+ benchmarks for model quality assessment, widely adopted in academic and industry settings.

Install this skill

or
54/100

Security score

The evaluating-llms-harness skill was audited on Feb 28, 2026 and we found 6 security issues across 2 threat categories, including 1 critical. Review the findings below before installing.

Categories Tested

Security Issues

critical line 184

Eval function call - arbitrary code execution

SourceSKILL.md
184Avoid for frequent eval (too slow):
medium line 198

System command execution

SourceSKILL.md
198os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
medium line 215

System command execution

SourceSKILL.md
215os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")
medium line 198

Python os.system command execution

SourceSKILL.md
198os.system(f"./eval_checkpoint.sh checkpoints step-{step}")
medium line 215

Python os.system command execution

SourceSKILL.md
215os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")
low line 487

External URL reference

SourceSKILL.md
487- Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)
Scanned on Feb 28, 2026
View Security Dashboard