Capstone Evaluator Rubric
A scoped, copy-able version of ../../resources/templates/evaluator-rubric.md, tuned for the three capstone tasks. Use one filled rubric per task per run.
How to score
Score each dimension 0, 1, or 2. A 0 on any dimension is an automatic fail for the task — even if the verify script later turns green, a 0 indicates a structural failure that should be noted in the ablation report's failure-attribution section.
Dimensions
| Dimension | 0 (fail) | 1 (acceptable) | 2 (strong) |
|---|---|---|---|
| Correctness | verify.sh red, or behavior contradicts task | verify.sh green; behavior matches task | + Layer 3 e2e covers the new behavior, not just unit tests |
| Scope adherence | Files outside the feature's scope were edited | Edits stayed within scope | + Scope was narrowed during the run with justification |
| Verification rigor | Only unit-level checks (no e2e against the binary) | Unit + e2e both run | + The verification command is generalizable (no hard-coded ids) |
| Citation rule | cite: line missing or wrong path | cite: line present on every result | + Citation also appears in --json output where applicable |
| Handoff readiness | PROGRESS.md Next Action stale or missing | PROGRESS.md updated and accurate | + DECISIONS.md gained a one-line ADR if a real decision was made |
| Clean state | node scripts/clean-exit.mjs fails any of 5 dims | clean-exit passes all 5 | + logs/run.jsonl tail is informative and readable |
Per-task scoring sheet
Copy this block and fill in once per task, per run:
md
### Task: <T1 / T2 / T3> — Run <A-harness / B-prompt-only>
- Time to claim done: <minutes>
- verify.sh: <GREEN | RED | n/a>
- clean-exit: <PASS | FAIL>
| Dimension | Score (0/1/2) | Note |
|------------------------|---------------|-----------------------------------------|
| Correctness | | |
| Scope adherence | | |
| Verification rigor | | |
| Citation rule | | |
| Handoff readiness | | |
| Clean state | | |
Total: __ / 12.
Failure attribution (only if any dimension scored 0): defense layer = <task-spec / context / env / verification / state>.Rules of fairness
- Score at the moment the agent declares done; do not re-score from memory.
- Use the same rubric and same scorer (you) for both runs of a given task.
- The 60-minute time cap counts; running over is a
0on Correctness regardless of final state. - Notes column is mandatory whenever a score is
0or2. A bare number teaches nothing the next time you run this.
What to do with the totals
Sum across the three tasks per run. Report both totals in the ablation report's results table. The differential (Run A total − Run B total) is your headline number; do not trust it as a benchmark, but do trust it as direction.
