evaluating-llms-harness

Evaluates LLMs using 60+ benchmarks for model quality assessment and comparison, widely adopted in academic and industry settings.

Install this skill

54/100

Security score

The evaluating-llms-harness skill was audited on Feb 21, 2026 and we found 6 security issues across 2 threat categories, including 1 critical. Review the findings below before installing.

Categories Tested

Security Issues

critical line 184

Eval function call - arbitrary code execution

SourceSKILL.md

184	Avoid for frequent eval (too slow):

medium line 198

System command execution

SourceSKILL.md

198	os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

medium line 215

System command execution

SourceSKILL.md

215	os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

medium line 198

Python os.system command execution

SourceSKILL.md

198	os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

medium line 215

Python os.system command execution

SourceSKILL.md

215	os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

low line 487

External URL reference

SourceSKILL.md

487	- Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)

Scanned on Feb 21, 2026

View Security Dashboard

Installation guide →

GitHub Stars 508

Rate this skill

Categorydevelopment

UpdatedMay 20, 2026

openclaw api ml-ai-engineer data-scientist data-analyst researcher product-manager huggingface development data analytics education research product

Dicklesworthstone/pi_agent_rust