Pairs with: all 12 lectures and Project 06 — Complete Harness Capstone. Time: 2 – 3 hrs. Difficulty: Advanced. Prerequisites: Module 10 checkpoint (./verify.sh and node scripts/clean-exit.mjs both exit 0).

Module 11. Capstone — Ablation Study

Why this module

You have built the full harness around noted-cli. The remaining question is how much it actually matters. The honest way to answer that is an ablation: take the same agent, give it the same task list, and run it twice — once against the harness intact, once with the harness stripped — then measure the difference.

This is the project-defining artifact of the course. By the end you will have an ablation-report.md and a quality-document.md you can show to a colleague, a manager, or a future you who has forgotten why any of this matters.

Concepts

Ablation — controlled comparison where you remove one variable (the harness) and hold everything else fixed (model, task, scoring rubric, evaluator).
Quality document — a recurring snapshot of the codebase scored by domain and by harness layer. Run after each session; the trend is the signal.
Verified completion rate (VCR), recap — the principal numerator for "did the harness help." A 0/3 vs. a 3/3 across the same task list is the headline.
Reproducibility checklist — the small ritual that keeps the ablation fair: same task list, same model, same allotted time, same evaluator rubric, fresh repo state for each run.
Failure attribution — for each red task, name which of the five defense layers the failure traces to. Without attribution, an ablation produces a number; with attribution, it produces learning.

→ Read Project 06 for the protocol this module is shaped after, scaled down for noted-cli.

Lab

Step 1 — Choose three tasks the agent will run

Pick three tasks of mid difficulty. They must be reachable by an agent in 30 – 60 minutes each. Suggested set, but feel free to swap:

T1 — noted ask --top 5: support a --top <n> flag; respect existing citation rule.
T2 — noted import --since <iso>: re-import only files modified after a given ISO timestamp; preserve other state.
T3 — noted index --bigrams: extend the index to include bigram tokens alongside unigrams; stay within the existing JSON schema.

Write the task list in runs/tasks.md (commit it; both runs reference the same file):

mkdir -p runs
cat > runs/tasks.md <<'EOF'
# Capstone task set

Each task has a verification command. The harness run will encode each
into `feature_list.json`; the no-harness run is given only this file.

## T1 — noted ask --top <n>
Behavior: `noted ask "<q>" --top 5` returns up to 5 results, each with the
required `cite:` line. `--top 0` exits 2.
Verify: ./bin/noted ask alpha --top 5 | grep -c '^\[' is between 1 and 5.

## T2 — noted import --since <iso>
Behavior: only ingest files whose mtime > <iso>; previously-imported notes
remain present.
Verify: scripted in scripts/test-import-since.mjs (you write this file).

## T3 — noted index --bigrams
Behavior: index.json gains a `bigrams` array; `ask` is unchanged.
Verify: ./bin/noted index --bigrams && jq '.bigrams | length > 0' .noted/index.json.
EOF

Step 2 — Set the rules of the comparison

Write runs/protocol.md:

# Ablation protocol

- Same agent (model + tool surface) for both runs.
- 60 minutes per task, hard cap.
- Fresh repo state at each run start: `git clean -fdx && git checkout .` then `./init.sh`.
- Evaluator scores each task per `evaluator-rubric.md`.
- A run records: time-to-claim-done, verification result, evaluator score,
  failure attribution if not green.

Step 3 — Snapshot the harness

git tag m10-checkpoint
git archive --format=tar -o runs/harness-snapshot.tar HEAD

The tar archive lets you reset both runs to identical starting state without depending on git tags surviving local-only history.

Step 4 — Run A: with the harness

git checkout -b run-A-harness
./init.sh

Open a fresh agent session. Prompt:

Read AGENTS.md, then read runs/tasks.md. Implement T1, T2, T3 in order. For each task, run pnpm wip activate <id>, implement, run ./verify.sh, then pnpm wip pass <id>. Update PROGRESS.md and commit between tasks.

Record everything in runs/A-harness/:

timing per task,
final feature_list.json,
last 50 lines of logs/run.jsonl,
evaluator-rubric.md filled in for each task.

Step 5 — Run B: without the harness

git checkout main
git checkout -b run-B-prompt-only
# Strip the harness:
git rm -rf AGENTS.md docs/ feature_list.json scripts/wip.mjs scripts/check-boundaries.mjs sprint-contract.md evaluator-rubric.md verify.sh init.sh logs PROGRESS.md DECISIONS.md
git commit -q -m "ablation: strip harness for run B baseline"

Open a fresh agent session. Prompt:

Read runs/tasks.md and implement T1, T2, T3. You can run pnpm test if it exists; otherwise verify however you can.

Record runs/B-prompt-only/ in the same shape as Run A.

Step 6 — Score and compare

For each task in each run, fill the rubric. Then write ablation-report.md:

# Ablation Report — noted-cli

## Setup

- Agent: <model name + version>
- Date: <date>
- Tasks: T1, T2, T3 from `runs/tasks.md`
- Time cap: 60 min/task
- Evaluator: `evaluator-rubric.md`

## Results

| Task | Run A (harness) | Run B (no harness) |
|------|-----------------|--------------------|
| T1   | done in <m>m, rubric <0–10>, verify GREEN | done in <m>m, rubric <0–10>, verify <GREEN/RED> |
| T2   | …               | …                  |
| T3   | …               | …                  |

VCR(A): X/3.   VCR(B): Y/3.

## Failure attribution (Run B reds)

For each task that did not turn green in Run B, name the defense layer:

- Task <id>: layer <task-spec / context / env / verification / state>.
- ...

## Observations

- <2-4 sentences naming the qualitative difference: scope creep, false
  declarations, missing citations, etc.>

## Caveats

- Sample size of one. The ablation shows *direction*, not effect size.
- Both runs share the same human reading the rubric; bias is possible.

Step 7 — Update the quality document

Copy ../../resources/templates/quality-document.md into your repo as quality-document.md. Score noted-cli along its layers:

Instructions (AGENTS.md + topic docs)
Tools (CLI verbs + scripts)
Environment (Node 20 + init.sh)
State (.noted/, feature_list.json, PROGRESS.md, DECISIONS.md)
Feedback (verify.sh, logs/, rubric)

A score per layer plus one sentence of justification each. The artifact is durable — re-score every few sessions and watch the trend.

Step 8 — Final commit

git checkout main
git checkout -- .   # back to the harness intact
git add .
git commit -q -m "module-11: capstone ablation report and quality document"
git tag course-complete

Verification

test -f ablation-report.md && \
test -f quality-document.md && \
grep -q "VCR(A)" ablation-report.md && \
grep -qE "## (Failure attribution|Observations)" ablation-report.md && \
./verify.sh >/dev/null 2>&1 && \
node scripts/clean-exit.mjs >/dev/null 2>&1 && \
echo "M11 OK — course complete"

Expected:

M11 OK — course complete

Common pitfalls

Sneaking the harness into Run B. If you let yourself "just paste in feature_list.json," the ablation is dead. Use the git rm -rf step exactly.
Letting the agent decide when each run is done. The 60-minute cap is the protocol. If the agent says "done" early, run the verification immediately and stop.
Rubric scoring after the fact, against your memory. Score each task at the moment the agent declares done. Memory drifts.
Treating the result as a benchmark. It is one ablation on one repo on one day. The direction is the lesson; the magnitude is anecdotal.
Skipping the failure attribution. Without it, the report is a scoreboard. With it, the report tells the next person which layer to invest in.

You have a working noted-cli repo with a full harness, an ablation report, and a quality document. Pick one:

Apply the harness to your real project. The 12 modules' patterns transfer 1:1. Start with ../../resources/templates/ and follow the substitutions noted in AUTHORING.md.
Read the source lectures end-to-end. You have built the ideas; the lectures will sharpen the why you can use to defend them in a code review.
Re-run the ablation against a different model. Same harness, different agent — the gap should hold.

If you only remember three things from the course:

Most agent failures are harness-induced; the model is rarely the bottleneck.
A feature is not done until ./verify.sh exits 0 and the evidence is recorded.
Every session ends at clean state across all five dimensions, or the next session pays the bill.

Welcome to harness engineering.

Module 11. Capstone — Ablation Study ​

Why this module ​

Concepts ​

Lab ​

Step 1 — Choose three tasks the agent will run ​

Step 2 — Set the rules of the comparison ​

Step 3 — Snapshot the harness ​

Step 4 — Run A: with the harness ​

Step 5 — Run B: without the harness ​

Step 6 — Score and compare ​

Step 7 — Update the quality document ​

Step 8 — Final commit ​

Verification ​

Common pitfalls ​

Next ​