Pairs with: all 12 lectures and Project 06 — Complete Harness Capstone. Time: 2 – 3 hrs. Difficulty: Advanced. Prerequisites: Module 10 checkpoint (
./verify.shandnode scripts/clean-exit.mjsboth exit 0).
Module 11. Capstone — Ablation Study
Why this module
You have built the full harness around noted-cli. The remaining question is how much it actually matters. The honest way to answer that is an ablation: take the same agent, give it the same task list, and run it twice — once against the harness intact, once with the harness stripped — then measure the difference.
This is the project-defining artifact of the course. By the end you will have an ablation-report.md and a quality-document.md you can show to a colleague, a manager, or a future you who has forgotten why any of this matters.
Concepts
- Ablation — controlled comparison where you remove one variable (the harness) and hold everything else fixed (model, task, scoring rubric, evaluator).
- Quality document — a recurring snapshot of the codebase scored by domain and by harness layer. Run after each session; the trend is the signal.
- Verified completion rate (VCR), recap — the principal numerator for "did the harness help." A 0/3 vs. a 3/3 across the same task list is the headline.
- Reproducibility checklist — the small ritual that keeps the ablation fair: same task list, same model, same allotted time, same evaluator rubric, fresh repo state for each run.
- Failure attribution — for each red task, name which of the five defense layers the failure traces to. Without attribution, an ablation produces a number; with attribution, it produces learning.
→ Read Project 06 for the protocol this module is shaped after, scaled down for noted-cli.
Lab
Step 1 — Choose three tasks the agent will run
Pick three tasks of mid difficulty. They must be reachable by an agent in 30 – 60 minutes each. Suggested set, but feel free to swap:
- T1 —
noted ask --top 5: support a--top <n>flag; respect existing citation rule. - T2 —
noted import --since <iso>: re-import only files modified after a given ISO timestamp; preserve other state. - T3 —
noted index --bigrams: extend the index to include bigram tokens alongside unigrams; stay within the existing JSON schema.
Write the task list in runs/tasks.md (commit it; both runs reference the same file):
sh
mkdir -p runs
cat > runs/tasks.md <<'EOF'
# Capstone task set
Each task has a verification command. The harness run will encode each
into `feature_list.json`; the no-harness run is given only this file.
## T1 — noted ask --top <n>
Behavior: `noted ask "<q>" --top 5` returns up to 5 results, each with the
required `cite:` line. `--top 0` exits 2.
Verify: ./bin/noted ask alpha --top 5 | grep -c '^\[' is between 1 and 5.
## T2 — noted import --since <iso>
Behavior: only ingest files whose mtime > <iso>; previously-imported notes
remain present.
Verify: scripted in scripts/test-import-since.mjs (you write this file).
## T3 — noted index --bigrams
Behavior: index.json gains a `bigrams` array; `ask` is unchanged.
Verify: ./bin/noted index --bigrams && jq '.bigrams | length > 0' .noted/index.json.
EOFStep 2 — Set the rules of the comparison
Write runs/protocol.md:
md
# Ablation protocol
- Same agent (model + tool surface) for both runs.
- 60 minutes per task, hard cap.
- Fresh repo state at each run start: `git clean -fdx && git checkout .` then `./init.sh`.
- Evaluator scores each task per `evaluator-rubric.md`.
- A run records: time-to-claim-done, verification result, evaluator score,
failure attribution if not green.Step 3 — Snapshot the harness
sh
git tag m10-checkpoint
git archive --format=tar -o runs/harness-snapshot.tar HEADThe tar archive lets you reset both runs to identical starting state without depending on git tags surviving local-only history.
Step 4 — Run A: with the harness
sh
git checkout -b run-A-harness
./init.shOpen a fresh agent session. Prompt:
Read AGENTS.md, then read runs/tasks.md. Implement T1, T2, T3 in order. For each task, run
pnpm wip activate <id>, implement, run./verify.sh, thenpnpm wip pass <id>. Update PROGRESS.md and commit between tasks.
Record everything in runs/A-harness/:
- timing per task,
- final
feature_list.json, - last 50 lines of
logs/run.jsonl, evaluator-rubric.mdfilled in for each task.
Step 5 — Run B: without the harness
sh
git checkout main
git checkout -b run-B-prompt-only
# Strip the harness:
git rm -rf AGENTS.md docs/ feature_list.json scripts/wip.mjs scripts/check-boundaries.mjs sprint-contract.md evaluator-rubric.md verify.sh init.sh logs PROGRESS.md DECISIONS.md
git commit -q -m "ablation: strip harness for run B baseline"Open a fresh agent session. Prompt:
Read runs/tasks.md and implement T1, T2, T3. You can run pnpm test if it exists; otherwise verify however you can.
Record runs/B-prompt-only/ in the same shape as Run A.
Step 6 — Score and compare
For each task in each run, fill the rubric. Then write ablation-report.md:
md
# Ablation Report — noted-cli
## Setup
- Agent: <model name + version>
- Date: <date>
- Tasks: T1, T2, T3 from `runs/tasks.md`
- Time cap: 60 min/task
- Evaluator: `evaluator-rubric.md`
## Results
| Task | Run A (harness) | Run B (no harness) |
|------|-----------------|--------------------|
| T1 | done in <m>m, rubric <0–10>, verify GREEN | done in <m>m, rubric <0–10>, verify <GREEN/RED> |
| T2 | … | … |
| T3 | … | … |
VCR(A): X/3. VCR(B): Y/3.
## Failure attribution (Run B reds)
For each task that did not turn green in Run B, name the defense layer:
- Task <id>: layer <task-spec / context / env / verification / state>.
- ...
## Observations
- <2-4 sentences naming the qualitative difference: scope creep, false
declarations, missing citations, etc.>
## Caveats
- Sample size of one. The ablation shows *direction*, not effect size.
- Both runs share the same human reading the rubric; bias is possible.Step 7 — Update the quality document
Copy ../../resources/templates/quality-document.md into your repo as quality-document.md. Score noted-cli along its layers:
- Instructions (AGENTS.md + topic docs)
- Tools (CLI verbs + scripts)
- Environment (Node 20 + init.sh)
- State (.noted/, feature_list.json, PROGRESS.md, DECISIONS.md)
- Feedback (verify.sh, logs/, rubric)
A score per layer plus one sentence of justification each. The artifact is durable — re-score every few sessions and watch the trend.
Step 8 — Final commit
sh
git checkout main
git checkout -- . # back to the harness intact
git add .
git commit -q -m "module-11: capstone ablation report and quality document"
git tag course-completeVerification
sh
test -f ablation-report.md && \
test -f quality-document.md && \
grep -q "VCR(A)" ablation-report.md && \
grep -qE "## (Failure attribution|Observations)" ablation-report.md && \
./verify.sh >/dev/null 2>&1 && \
node scripts/clean-exit.mjs >/dev/null 2>&1 && \
echo "M11 OK — course complete"Expected:
M11 OK — course completeCommon pitfalls
- Sneaking the harness into Run B. If you let yourself "just paste in
feature_list.json," the ablation is dead. Use thegit rm -rfstep exactly. - Letting the agent decide when each run is done. The 60-minute cap is the protocol. If the agent says "done" early, run the verification immediately and stop.
- Rubric scoring after the fact, against your memory. Score each task at the moment the agent declares done. Memory drifts.
- Treating the result as a benchmark. It is one ablation on one repo on one day. The direction is the lesson; the magnitude is anecdotal.
- Skipping the failure attribution. Without it, the report is a scoreboard. With it, the report tells the next person which layer to invest in.
Next
You have a working noted-cli repo with a full harness, an ablation report, and a quality document. Pick one:
- Apply the harness to your real project. The 12 modules' patterns transfer 1:1. Start with
../../resources/templates/and follow the substitutions noted inAUTHORING.md. - Read the source lectures end-to-end. You have built the ideas; the lectures will sharpen the why you can use to defend them in a code review.
- Re-run the ablation against a different model. Same harness, different agent — the gap should hold.
If you only remember three things from the course:
- Most agent failures are harness-induced; the model is rarely the bottleneck.
- A feature is not done until
./verify.shexits 0 and the evidence is recorded. - Every session ends at clean state across all five dimensions, or the next session pays the bill.
Welcome to harness engineering.
