Skip to content

Pairs with: all 12 lectures and Project 06 — Complete Harness Capstone. Time: 2 – 3 hrs. Difficulty: Advanced. Prerequisites: Module 10 checkpoint (./verify.sh and node scripts/clean-exit.mjs both exit 0).

Module 11. Capstone — Ablation Study

Why this module

You have built the full harness around noted-cli. The remaining question is how much it actually matters. The honest way to answer that is an ablation: take the same agent, give it the same task list, and run it twice — once against the harness intact, once with the harness stripped — then measure the difference.

This is the project-defining artifact of the course. By the end you will have an ablation-report.md and a quality-document.md you can show to a colleague, a manager, or a future you who has forgotten why any of this matters.

Concepts

  • Ablation — controlled comparison where you remove one variable (the harness) and hold everything else fixed (model, task, scoring rubric, evaluator).
  • Quality document — a recurring snapshot of the codebase scored by domain and by harness layer. Run after each session; the trend is the signal.
  • Verified completion rate (VCR), recap — the principal numerator for "did the harness help." A 0/3 vs. a 3/3 across the same task list is the headline.
  • Reproducibility checklist — the small ritual that keeps the ablation fair: same task list, same model, same allotted time, same evaluator rubric, fresh repo state for each run.
  • Failure attribution — for each red task, name which of the five defense layers the failure traces to. Without attribution, an ablation produces a number; with attribution, it produces learning.

→ Read Project 06 for the protocol this module is shaped after, scaled down for noted-cli.

Lab

Step 1 — Choose three tasks the agent will run

Pick three tasks of mid difficulty. They must be reachable by an agent in 30 – 60 minutes each. Suggested set, but feel free to swap:

  1. T1 — noted ask --top 5: support a --top <n> flag; respect existing citation rule.
  2. T2 — noted import --since <iso>: re-import only files modified after a given ISO timestamp; preserve other state.
  3. T3 — noted index --bigrams: extend the index to include bigram tokens alongside unigrams; stay within the existing JSON schema.

Write the task list in runs/tasks.md (commit it; both runs reference the same file):

sh
mkdir -p runs
cat > runs/tasks.md <<'EOF'
# Capstone task set

Each task has a verification command. The harness run will encode each
into `feature_list.json`; the no-harness run is given only this file.

## T1 — noted ask --top <n>
Behavior: `noted ask "<q>" --top 5` returns up to 5 results, each with the
required `cite:` line. `--top 0` exits 2.
Verify: ./bin/noted ask alpha --top 5 | grep -c '^\[' is between 1 and 5.

## T2 — noted import --since <iso>
Behavior: only ingest files whose mtime > <iso>; previously-imported notes
remain present.
Verify: scripted in scripts/test-import-since.mjs (you write this file).

## T3 — noted index --bigrams
Behavior: index.json gains a `bigrams` array; `ask` is unchanged.
Verify: ./bin/noted index --bigrams && jq '.bigrams | length > 0' .noted/index.json.
EOF

Step 2 — Set the rules of the comparison

Write runs/protocol.md:

md
# Ablation protocol

- Same agent (model + tool surface) for both runs.
- 60 minutes per task, hard cap.
- Fresh repo state at each run start: `git clean -fdx && git checkout .` then `./init.sh`.
- Evaluator scores each task per `evaluator-rubric.md`.
- A run records: time-to-claim-done, verification result, evaluator score,
  failure attribution if not green.

Step 3 — Snapshot the harness

sh
git tag m10-checkpoint
git archive --format=tar -o runs/harness-snapshot.tar HEAD

The tar archive lets you reset both runs to identical starting state without depending on git tags surviving local-only history.

Step 4 — Run A: with the harness

sh
git checkout -b run-A-harness
./init.sh

Open a fresh agent session. Prompt:

Read AGENTS.md, then read runs/tasks.md. Implement T1, T2, T3 in order. For each task, run pnpm wip activate <id>, implement, run ./verify.sh, then pnpm wip pass <id>. Update PROGRESS.md and commit between tasks.

Record everything in runs/A-harness/:

  • timing per task,
  • final feature_list.json,
  • last 50 lines of logs/run.jsonl,
  • evaluator-rubric.md filled in for each task.

Step 5 — Run B: without the harness

sh
git checkout main
git checkout -b run-B-prompt-only
# Strip the harness:
git rm -rf AGENTS.md docs/ feature_list.json scripts/wip.mjs scripts/check-boundaries.mjs sprint-contract.md evaluator-rubric.md verify.sh init.sh logs PROGRESS.md DECISIONS.md
git commit -q -m "ablation: strip harness for run B baseline"

Open a fresh agent session. Prompt:

Read runs/tasks.md and implement T1, T2, T3. You can run pnpm test if it exists; otherwise verify however you can.

Record runs/B-prompt-only/ in the same shape as Run A.

Step 6 — Score and compare

For each task in each run, fill the rubric. Then write ablation-report.md:

md
# Ablation Report — noted-cli

## Setup

- Agent: <model name + version>
- Date: <date>
- Tasks: T1, T2, T3 from `runs/tasks.md`
- Time cap: 60 min/task
- Evaluator: `evaluator-rubric.md`

## Results

| Task | Run A (harness) | Run B (no harness) |
|------|-----------------|--------------------|
| T1   | done in <m>m, rubric <0–10>, verify GREEN | done in <m>m, rubric <0–10>, verify <GREEN/RED> |
| T2   | …               | …                  |
| T3   | …               | …                  |

VCR(A): X/3.   VCR(B): Y/3.

## Failure attribution (Run B reds)

For each task that did not turn green in Run B, name the defense layer:

- Task <id>: layer <task-spec / context / env / verification / state>.
- ...

## Observations

- <2-4 sentences naming the qualitative difference: scope creep, false
  declarations, missing citations, etc.>

## Caveats

- Sample size of one. The ablation shows *direction*, not effect size.
- Both runs share the same human reading the rubric; bias is possible.

Step 7 — Update the quality document

Copy ../../resources/templates/quality-document.md into your repo as quality-document.md. Score noted-cli along its layers:

  • Instructions (AGENTS.md + topic docs)
  • Tools (CLI verbs + scripts)
  • Environment (Node 20 + init.sh)
  • State (.noted/, feature_list.json, PROGRESS.md, DECISIONS.md)
  • Feedback (verify.sh, logs/, rubric)

A score per layer plus one sentence of justification each. The artifact is durable — re-score every few sessions and watch the trend.

Step 8 — Final commit

sh
git checkout main
git checkout -- .   # back to the harness intact
git add .
git commit -q -m "module-11: capstone ablation report and quality document"
git tag course-complete

Verification

sh
test -f ablation-report.md && \
test -f quality-document.md && \
grep -q "VCR(A)" ablation-report.md && \
grep -qE "## (Failure attribution|Observations)" ablation-report.md && \
./verify.sh >/dev/null 2>&1 && \
node scripts/clean-exit.mjs >/dev/null 2>&1 && \
echo "M11 OK — course complete"

Expected:

M11 OK — course complete

Common pitfalls

  • Sneaking the harness into Run B. If you let yourself "just paste in feature_list.json," the ablation is dead. Use the git rm -rf step exactly.
  • Letting the agent decide when each run is done. The 60-minute cap is the protocol. If the agent says "done" early, run the verification immediately and stop.
  • Rubric scoring after the fact, against your memory. Score each task at the moment the agent declares done. Memory drifts.
  • Treating the result as a benchmark. It is one ablation on one repo on one day. The direction is the lesson; the magnitude is anecdotal.
  • Skipping the failure attribution. Without it, the report is a scoreboard. With it, the report tells the next person which layer to invest in.

Next

You have a working noted-cli repo with a full harness, an ablation report, and a quality document. Pick one:

  • Apply the harness to your real project. The 12 modules' patterns transfer 1:1. Start with ../../resources/templates/ and follow the substitutions noted in AUTHORING.md.
  • Read the source lectures end-to-end. You have built the ideas; the lectures will sharpen the why you can use to defend them in a code review.
  • Re-run the ablation against a different model. Same harness, different agent — the gap should hold.

If you only remember three things from the course:

  1. Most agent failures are harness-induced; the model is rarely the bottleneck.
  2. A feature is not done until ./verify.sh exits 0 and the evidence is recorded.
  3. Every session ends at clean state across all five dimensions, or the next session pays the bill.

Welcome to harness engineering.