evaluating-llms-harness

Evaluates LLMs using academic benchmarks like MMLU and GSM8K, aiding in model quality assessment and comparison.

Install this skill

54/100

Security score

The evaluating-llms-harness skill was audited on May 17, 2026 and we found 6 security issues across 2 threat categories, including 1 critical. Review the findings below before installing.

Categories Tested

Security Issues

critical line 192

Eval function call - arbitrary code execution

SourceSKILL.md

192	Avoid for frequent eval (too slow):

medium line 206

System command execution

SourceSKILL.md

206	os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

medium line 223

System command execution

SourceSKILL.md

223	os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

medium line 206

Python os.system command execution

SourceSKILL.md

206	os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

medium line 223

Python os.system command execution

SourceSKILL.md

223	os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

low line 495

External URL reference

SourceSKILL.md

495	- Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness)

Scanned on May 17, 2026

View Security Dashboard

Installation guide →

GitHub Stars 185.0K

Rate this skill

Categorydata analytics

UpdatedJuly 28, 2026

openclaw api data-scientist ml-ai-engineer researcher marketing-analyst product-manager data analytics development education research marketing product

NousResearch/hermes-agent