arXiv Paper Publication Polish
by HomericIntelligencev1.0.0
Systematic application of publication-ready polish to LaTeX papers based on comprehensive review feedback. Applies fixes across 5 categories: precision, terminology, bibliography, paths, and grammar.
Install any skill with /learn
/learn @owner/skill-nameDocumentation
# ProjectScylla
[](https://python.org)
[](LICENSE)
[](#)
[](#)
## ๐ Table of Contents
- [๐ฏ What is ProjectScylla?](#-what-is-projectscylla)
- [Core Concepts](#core-concepts)
- [๐ Quick Start](#-quick-start)
- [๐ System Requirements](#-system-requirements)
- [Analysis Pipeline Architecture](#analysis-pipeline-architecture)
- [Development](#development)
- [๐ง Troubleshooting](#-troubleshooting)
- [Publication Readiness](#publication-readiness)
- [๐ค Contributing](#-contributing)
## ๐ฏ What is ProjectScylla?
ProjectScylla is a comprehensive testing framework for AI agent workflows that:
- **๐ฌ Measures** agent performance under constrained conditions
- **๐ Analyzes** results with rigorous statistical methods
- **โ๏ธ Optimizes** agent decisions through trade-off evaluation
- **๐ Generates** publication-ready reports, figures, and tables
**Key Output**: Publication-quality statistical reports with **27 figures** and **11 tables** from a single command.
> "In Homer's Odyssey, Scylla represents one of the greatest challenges on the journey home โ a monster that forced sailors to navigate perilous straits where every choice carried risk. ProjectScylla provides the same proving ground for AI agents."
## Quick Start Guide
### ๐ 5-Minute Setup
```bash
# 1. Install prerequisites
curl -fsSL https://pixi.sh/install.sh | bash
# 2. Clone and setup
git clone https://github.com/HomericIntelligence/ProjectScylla.git
cd ProjectScylla
# 3. Run your first analysis
pixi run -e analysis python --version # Verify installation
pixi run -e analysis python scripts/generate_all_results.py --data-dir ~/fullruns
# 4. View results (27 figures + 11 tables generated)
open results/analysis/figures/*.png # macOS
xdg-open results/analysis/figures/*.png # Linux
```
**That's it!** All outputs appear in `results/analysis/` directory.
### ๐ก Usage Examples
**Compare Two Agent Configurations:**
```bash
pixi run -e analysis python scripts/generate_all_results.py \
--data-dir ~/experiments/ \
--output-dir comparison_results/ \
--exclude test001-dryrun
```
**Fast Development Mode (No Rendering):**
```bash
# Quick iteration - generates Vega-Lite specs only
pixi run -e analysis python scripts/generate_all_results.py \
--data-dir ~/quick_test \
--no-render \
--skip-data # Skip if CSVs already exist
```
## ๐ System Requirements
**Minimum Requirements:**
- Python 3.14+
- 8GB RAM for full dataset analysis
- 2GB disk space for results
**Typical Performance:**
- Full analysis: 10-15 minutes (10,000 bootstrap samples)
- Figures only: 2-3 minutes
- Tables only: 1-2 minutes
**Scale:** Handles experiments with 1000+ runs efficiently
---
## Core Concepts
- โ๏ธ **Trade-Off Evaluation**: Agents face scenarios where every decision has cost, mirroring Scylla and Charybdis dilemma
- ๐ **Metrics & Benchmarks**: Structured measurement across adaptability, efficiency, and reliability
- ๐ **Iterative Optimization**: Continuous refinement through repeated trials
- ๐งญ **Resilience Testing**: Assessment under uncertainty, constraints, and risks
## Ecosystem
- **ProjectOdyssey** โ Training and capability development
- **ProjectKeystone** โ Communication and distributed agent coordination
- **ProjectScylla** โ Testing, measurement, and optimization under trial
Together: cohesive ecosystem for building, connecting, and refining agent workflows.
---
## Running the Analysis Pipeline
### Full Analysis (Recommended)
Generate all outputs (data exports, figures, tables):
```bash
pixi run -e analysis python scripts/generate_all_results.py \
--data-dir ~/fullruns \
--output-dir results/analysis
```
**Key Options:**
- `--data-dir` โ Directory with experiment results (default: `~/fullruns`)
- `--output-dir` โ Base output directory (default: `docs/`)
- `--no-render` โ Skip PNG/PDF (faster, Vega-Lite specs only)
- `--skip-data/skip-figures/skip-tables` โ Generate specific components only
- `--exclude` โ Filter experiments (e.g., `--exclude test001-dryrun`)
```bash
# Development mode - no rendering
pixi run -e analysis python scripts/generate_all_results.py \
--no-render \
--exclude test001-dryrun test001-debug
# Regenerate tables only (assumes data/figures exist)
pixi run -e analysis python scripts/generate_all_results.py \
--skip-data --skip-figures
```
### Individual Pipeline Steps
**1. Export Data Only**
```bash
pixi run -e analysis python scripts/export_data.py \
--data-dir ~/fullruns \
--output-dir results/analysis/data
```
**Outputs:** `runs.csv`, `judges.csv`, `criteria.csv`, `subtests.csv`, `summary.json`, `statistical_results.json`
**2. Generate Figures Only (27 figures ร 5 formats)**
```bash
pixi run -e analysis python scripts/generate_figures.py \
--data-dir ~/fullruns \
--output-dir results/analysis/figures
```
**Outputs:** `*.vl.json`, `*.csv`, `*.png` (300 DPI), `*.pdf`, `*_include.tex`
**3. Generate Tables Only (11 tables ร 2 formats)**
```bash
pixi run -e analysis python scripts/generate_tables.py \
--data-dir ~/fullruns \
--output-dir results/analysis/tables
```
**Outputs:** `*.md` (human-readable), `*.tex` (LaTeX, booktabs formatted)
### Output Structure
```
results/analysis/
โโโ data/
โ โโโ runs.csv # Per-run metrics
โ โโโ judges.csv # Judge evaluations
โ โโโ criteria.csv # Criterion-level scores
โ โโโ subtests.csv # Subtest metadata
โ โโโ summary.json # Experiment summary
โ โโโ statistical_results.json # Statistical analysis
โโโ figures/ # 27 figures ร 5 formats
โ โโโ fig01_score_variance.*
โ โโโ fig02_grade_distribution.*
โ โโโ ... (27 total)
โโโ tables/ # 11 tables ร 2 formats
โโโ table01_tier_summary.md
โโโ table01_tier_summary.tex
โโโ ... (11 total)
```
### Using the Outputs
**LaTeX Integration:**
```latex
\begin{figure}
\centering
\input{results/analysis/figures/fig04_pass_rate_by_tier_include.tex}
\caption{Pass rate by tier with 95\% bootstrap confidence intervals.}
\label{fig:pass-rate}
\end{figure}
\input{results/analysis/tables/table02_tier_comparison.tex}
```
**Python/Jupyter:**
```python
import pandas as pd
import json
# Load data
runs_df = pd.read_csv('results/analysis/data/runs.csv')
judges_df = pd.read_csv('results/analysis/data/judges.csv')
# Load statistical results
with open('results/analysis/data/statistical_results.json') as f:
stats = json.load(f)
```
---
## Experiment Management Scripts
ProjectScylla provides comprehensive scripts for running, managing, and analyzing experiments.
### ๐งช Running Experiments
**Primary Experiment Runner:**
```bash
# Run full experiment
pixi run -e analysis python scripts/run_e2e_experiment.py --config config/test.yaml
# Run specific tiers
pixi run -e analysis python scripts/run_e2e_experiment.py \
--tiers-dir tests/fixtures/tests/test-001 \
--tiers T0 T1 --runs 10 -v
```
**Container-Based Execution:**
```bash
./scripts/setup_api_key.sh
./scripts/run_experiment_in_container.sh \
--tiers-dir tests/fixtures/tests/test-001 \
--tiers T0 --runs 5 --verbose
```
### ๐ Recovery & Re-running
```bash
# Re-run failed agents
pixi run -e analysis python scripts/rerun_agents.py \
--data-dir ~/fullruns/test_experiment --tiers T0 T1
# Re-run failed judges
pixi run -e analysis python scripts/rerun_judges.py \
--data-dir ~/fullruns/test_experiment
```
### ๐ Results Management
```bash
# Regenerate all results
pixi run -e analysis python scripts/regenerate_results.py \
--data-dir ~/fullruns/test_experiment \
--output-dir results/analysis
# Regenerate agent-specific results
pixi run -e analysis python scripts/regenerate_agent_results.py \
--data-dir ~/fullruns/test_experiment
```
---
## Analysis Pipeline Architecture
### Statistical Methodology
Rigorous non-parametric methods for bounded, ordinal, non-normal data:
- **Bootstrap Confidence Intervals**: BCa with 10,000 resamples
- **Omnibus Testing**: Kruskal-Wallis H test (controls FWER)
- **Pairwise Comparisons**: Mann-Whitney U + Holm-Bonferroni correction
- **Effect Sizes**: Cliff's delta with bootstrapped CIs
- **Inter-Rater Reliability**: Krippendorff's alpha for judge agreement
Configuration: `scylla/analysis/config.yaml` (all parameters externalized)
### Metrics
**Quality:**
- Pass-Rate (functional test coverage)
- Implementation Rate (semantic satisfaction)
- Score (weighted rubric evaluation)
- Consistency (1 - Coefficient of Variation)
**Economic:**
- Cost-of-Pass (expected cost per success)
- Frontier CoP (minimum CoP across configs)
- Token Distribution (cost breakdown)
**Process:**
- Latency (query to resolution time)
- Judge Agreement (Krippendorff's alpha)
### Data Requirements
Expected structure:
```
fullruns/{experiment_name}/{timestamp}/
โโโ config/experiment.json # Metadata
โโโ T0-T6/{subtest_id}/run_{01-10}/
โโโ run_result.json # Outcomes
โโโ judge/judge_{01-03}/judgment.json # Evaluations
```
**Required in run.json:**
- `run_number` (integer)
- `exit_code` (0 = success)
- `judges` (list with grades & criteria)
Schema: `scylla/analysis/schemas/run_result_schema.json`
---
## Development
### ๐งช Testing
ProjectScylla has a comprehensive test suite with **77+ test files** covering all functionality.
#### Test Categories
- **Unit Tests** (67+ files): Analysis, adapters, config, executors, judges, metrics, reporting
- **Integration Tests** (2 files): End-to-end workflow testing
- **E2E Tests** (1 file): Full pipeline validation
- **Test Fixtures** (47+ scenarios): Complete test cases with expected outputs
#### Running Tests
```bash
# All tests (comprehensive)
pixi run -e analysis pytest tests/ --verbose
# Unit tests only (fastest)
pixi run -e analysis pytest tests/unit/ -v
# Specific modules
pixi run -e analysis pytest tests/unit/analysis/ -v
pixi run -e analysis pytest tests/unit/adapters/ -v
pixi run -e analysis pytest tests/unit/config/ -v
# Integration tests
pixi run -e analysis pytest tests/integration/ -v
# Coverage analysis
pixi run -e analysis pytest tests/ --cov=scylla/scylla --cov-report=html
# Specific test file
pixi run -e analysis pytest tests/unit/analysis/test_stats.py -v
```
#### Test Quality Assurance
```bash
# Code quality (linting + formatting)
pixi run -e analysis ruff check scylla/
pixi run -e analysis ruff format scylla/ --check
```
### Adding Components
**New Figures:**
1. Create module in `scylla/analysis/figures/`
2. Implement function following existing pattern
3. Register in `scripts/generate_figures.py`
4. Add tests in `tests/unit/analysis/test_figures.py`
**New Tables:**
1. Add function to module in `scylla/analysis/tables/`
2. Register in `scripts/generate_tables.py`
3. Add tests in `tests/unit/analysis/test_tables.py`
### Code Quality
```bash
# Linting
pixi run -e analysis ruff check scylla/analysis/
# Auto-fix and format
pixi run -e analysis ruff check --fix scylla/analysis/
pixi run -e analysis ruff format scylla/analysis/
```
---
## ๐ง Troubleshooting
### Quick Reference
| Symptom | Solution |
|---------|----------|
| `Schema validation failed: 'N/A' does not match` | Ensure grades are S, A, B, C, D, or F only |
| `[Errno 2] No such file or directory` | Run: `find ~/fullruns -name "run_result.json"` |
| `TypeError: unsupported operand` | Fix type coercion in criterion.achieved values |
| Empty outputs | Check: โฅ2 experiments, โฅ1 completed run each |
| Slow performance | Use `--no-render` flag for faster iteration |
### Common Issues
**1. Data Validation Errors**
```
Schema validation failed: 'N/A' does not match '^[SABCDF]$'
```
**Fix:** Review problematic runs, ensure valid grades S/A/B/C/D/F or update schema.
**2. Missing Files**
```
Failed to load: [Errno 2] No such file or directory
```
**Fix:** Incomplete runs skipped with warnings. Investigate:
```bash
find ~/fullruns -name "run_*" -type d -exec sh -c 'test -f "$1/run_result.json" || echo "Missing: $1"' _ {} \;
```
**3. Type Errors**
```
TypeError: unsupported operand type(s) for +: 'float' and 'str'
```
**Fix:** Some `criterion.achieved` are strings. Fix in data generation or add coercion.
### Getting Help
- **Documentation**: `docs/research.md` for methodology
- **Examples**: `tests/unit/analysis/` for usage patterns
- **Issues**: [GitHub Issues](https://github.com/HomericIntelligence/ProjectScylla/issues)
- **Support**: Create an issue with error message and steps to reproduce
---
## Publication Readiness
โ
**Rigorous non-parametric statistics** (Kruskal-Wallis, Mann-Whitney U, Cliff's delta)
โ
**Multiple comparison correction** (Holm-Bonferroni throughout)
โ
**Bootstrap confidence intervals** (BCa, 10K resamples, seed=42)
โ
**Effect sizes with confidence intervals**
โ
**300 DPI publication-quality figures**
โ
**LaTeX-ready tables** with booktabs formatting
โ
**Reproducible configuration** (all parameters in config.yaml)
โ
**Comprehensive test suite** (240+ tests, all passing)
โ
**Documented methodology** with citations
See `docs/research.md` for complete research methodology and metric definitions.
### LaTeX Dependencies
Required packages for document compilation:
```latex
\documentclass{article}
\usepackage{booktabs} % Professional tables
\usepackage{longtable} % Multi-page tables
\usepackage{threeparttable} % Table notes
\usepackage{graphicx} % Figure inclusion
\usepackage{amsmath} % Statistical symbols
\begin{document}
% Your content here
\end{document}
```
---
## ๐ค Contributing
We welcome contributions! Please see our contributing guidelines:
- **Development Setup**: Follow Quick Start guide above
- **Code Standards**: Run linting and formatting before submitting
- **Testing**: Ensure all tests pass (`pytest tests/`)
- **Documentation**: Update README and docs for new features
**Areas for contribution:**
- Additional statistical methods and metrics
- New visualization types and formats
- Performance optimizations
- Documentation improvements
- Bug fixes and feature requests
Visit our [GitHub Repository](https://github.com/HomericIntelligence/ProjectScylla) to get started.
---
## License
[](LICENSE)
## Citation
```bibtex
@software{projectscylla2026,
title = {ProjectScylla: A Testing and Optimization Framework for Agentic Workflows},
author = {Micah Villmow},
year = {2026},
url = {https://github.com/HomericIntelligence/ProjectScylla}
}
```