Capstone Evaluator Rubric

A scoped, copy-able version of ../../resources/templates/evaluator-rubric.md, tuned for the three capstone tasks. Use one filled rubric per task per run.

How to score

Score each dimension 0, 1, or 2. A 0 on any dimension is an automatic fail for the task — even if the verify script later turns green, a 0 indicates a structural failure that should be noted in the ablation report's failure-attribution section.

Dimensions

Dimension	0 (fail)	1 (acceptable)	2 (strong)
Correctness	`verify.sh` red, or behavior contradicts task	`verify.sh` green; behavior matches task	+ Layer 3 e2e covers the new behavior, not just unit tests
Scope adherence	Files outside the feature's `scope` were edited	Edits stayed within `scope`	+ Scope was narrowed during the run with justification
Verification rigor	Only unit-level checks (no e2e against the binary)	Unit + e2e both run	+ The verification command is generalizable (no hard-coded ids)
Citation rule	`cite:` line missing or wrong path	`cite:` line present on every result	+ Citation also appears in `--json` output where applicable
Handoff readiness	`PROGRESS.md` Next Action stale or missing	`PROGRESS.md` updated and accurate	+ `DECISIONS.md` gained a one-line ADR if a real decision was made
Clean state	`node scripts/clean-exit.mjs` fails any of 5 dims	`clean-exit` passes all 5	+ `logs/run.jsonl` tail is informative and readable

Per-task scoring sheet

Copy this block and fill in once per task, per run:

### Task: <T1 / T2 / T3> — Run <A-harness / B-prompt-only>

- Time to claim done: <minutes>
- verify.sh: <GREEN | RED | n/a>
- clean-exit: <PASS | FAIL>

| Dimension              | Score (0/1/2) | Note                                    |
|------------------------|---------------|-----------------------------------------|
| Correctness            |               |                                         |
| Scope adherence        |               |                                         |
| Verification rigor     |               |                                         |
| Citation rule          |               |                                         |
| Handoff readiness      |               |                                         |
| Clean state            |               |                                         |

Total: __ / 12.
Failure attribution (only if any dimension scored 0): defense layer = <task-spec / context / env / verification / state>.

Rules of fairness

Score at the moment the agent declares done; do not re-score from memory.
Use the same rubric and same scorer (you) for both runs of a given task.
The 60-minute time cap counts; running over is a 0 on Correctness regardless of final state.
Notes column is mandatory whenever a score is 0 or 2. A bare number teaches nothing the next time you run this.

What to do with the totals

Sum across the three tasks per run. Report both totals in the ablation report's results table. The differential (Run A total − Run B total) is your headline number; do not trust it as a benchmark, but do trust it as direction.

Capstone Evaluator Rubric ​

How to score ​

Dimensions ​

Per-task scoring sheet ​