evaluating-llms-harness
Evaluates LLMs using 60+ benchmarks for model quality assessment and comparison, widely adopted in academic and industry settings.
Install this skill
or
54/100
Security score
The evaluating-llms-harness skill was audited on Feb 21, 2026 and we found 6 security issues across 2 threat categories, including 1 critical. Review the findings below before installing.
Categories Tested
Security Issues
critical line 184
Eval function call - arbitrary code execution
SourceSKILL.md
| 184 | Avoid for frequent eval (too slow): |
medium line 198
System command execution
SourceSKILL.md
| 198 | os.system(f"./eval_checkpoint.sh checkpoints step-{step}") |
medium line 215
System command execution
SourceSKILL.md
| 215 | os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...") |
medium line 198
Python os.system command execution
SourceSKILL.md
| 198 | os.system(f"./eval_checkpoint.sh checkpoints step-{step}") |
medium line 215
Python os.system command execution
SourceSKILL.md
| 215 | os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...") |
low line 487
External URL reference
SourceSKILL.md
| 487 | - Leaderboard: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard (uses this harness) |
Scanned on Feb 21, 2026
View Security DashboardInstall this skill with one command
/learn @dicklesworthstone/evaluation-lm-evaluation-harnessGitHub Stars 508
Rate this skill
Categorydevelopment
UpdatedMarch 29, 2026
openclawapiml-ai-engineerdata-scientistdata-analystresearcherproduct-managerhuggingfacedevelopmentdata analyticseducation researchproduct
Dicklesworthstone/pi_agent_rust