Skip to content

Capstone Evaluator Rubric

A scoped, copy-able version of ../../resources/templates/evaluator-rubric.md, tuned for the three capstone tasks. Use one filled rubric per task per run.

How to score

Score each dimension 0, 1, or 2. A 0 on any dimension is an automatic fail for the task — even if the verify script later turns green, a 0 indicates a structural failure that should be noted in the ablation report's failure-attribution section.

Dimensions

Dimension0 (fail)1 (acceptable)2 (strong)
Correctnessverify.sh red, or behavior contradicts taskverify.sh green; behavior matches task+ Layer 3 e2e covers the new behavior, not just unit tests
Scope adherenceFiles outside the feature's scope were editedEdits stayed within scope+ Scope was narrowed during the run with justification
Verification rigorOnly unit-level checks (no e2e against the binary)Unit + e2e both run+ The verification command is generalizable (no hard-coded ids)
Citation rulecite: line missing or wrong pathcite: line present on every result+ Citation also appears in --json output where applicable
Handoff readinessPROGRESS.md Next Action stale or missingPROGRESS.md updated and accurate+ DECISIONS.md gained a one-line ADR if a real decision was made
Clean statenode scripts/clean-exit.mjs fails any of 5 dimsclean-exit passes all 5+ logs/run.jsonl tail is informative and readable

Per-task scoring sheet

Copy this block and fill in once per task, per run:

md
### Task: <T1 / T2 / T3> — Run <A-harness / B-prompt-only>

- Time to claim done: <minutes>
- verify.sh: <GREEN | RED | n/a>
- clean-exit: <PASS | FAIL>

| Dimension              | Score (0/1/2) | Note                                    |
|------------------------|---------------|-----------------------------------------|
| Correctness            |               |                                         |
| Scope adherence        |               |                                         |
| Verification rigor     |               |                                         |
| Citation rule          |               |                                         |
| Handoff readiness      |               |                                         |
| Clean state            |               |                                         |

Total: __ / 12.
Failure attribution (only if any dimension scored 0): defense layer = <task-spec / context / env / verification / state>.

Rules of fairness

  • Score at the moment the agent declares done; do not re-score from memory.
  • Use the same rubric and same scorer (you) for both runs of a given task.
  • The 60-minute time cap counts; running over is a 0 on Correctness regardless of final state.
  • Notes column is mandatory whenever a score is 0 or 2. A bare number teaches nothing the next time you run this.

What to do with the totals

Sum across the three tasks per run. Report both totals in the ablation report's results table. The differential (Run A total − Run B total) is your headline number; do not trust it as a benchmark, but do trust it as direction.