Skip to content

Pairs with: Lecture 01 — Why Capable Agents Still Fail. Time: ~45 min. Difficulty: Basic. Prerequisites: Module 00 checkpoint (./bin/noted --help exits 0).

Module 01. Why Models Need Harnesses

Why this module

Strong models do not produce reliable execution on their own. Lecture 01 calls the gap between benchmark performance and real-world reliability the capability gap. The cheapest way to feel this gap is to give the same agent the same task twice — once with no harness, once with a one-line spec — and watch the difference. That is what you will do here.

By the end of this module you will have an empirical baseline on your own machine, a numbered table of results, and an instinct for the failure modes the next ten modules close.

Concepts

A few terms from Lecture 01 you will use the rest of the course:

  • Capability gap — the delta between a model scoring well on a benchmark and it shipping reliably against a real codebase. Empirically wide: SWE-bench solve rates and real-engineering "first-try-correct" rates routinely differ by 30 – 40 percentage points.
  • Verification gap — the agent's confidence is not the same as the work being correct. An agent will frequently report "done" while leaving a feature broken end-to-end.
  • Five defense layers — task specification, context provision, execution environment, verification feedback, state management. Failures attribute to one of the five. This module exercises only "task specification" — the rest are deliberately absent.
  • Diagnostic loop — execute, observe, attribute the failure to a layer, fix that layer, re-execute. We start using it today.

→ Read Lecture 01 for the long-form treatment, the citations, and the detailed argument.

Lab

You will run two versions of the same task against your noted-cli repo from Module 00. The task is small on purpose — about ten minutes of human work. Use whichever AI agent you have at hand (Claude Code, Cursor, Codex, etc.). If you do not have one, do the runs yourself and play the role of "agent that follows the prompt literally" — the comparison still works.

Step 1 — Define the task once, in writing

Create course-runs/M01-task.md inside your noted-cli/ repo (yes, inside the project — Module 12 uses it):

sh
mkdir -p course-runs
cat > course-runs/M01-task.md <<'EOF'
TASK: Add a `noted version` subcommand that prints the version from package.json
and exits 0. If the user runs `noted version --json`, output {"version":"<v>"}.

DONE WHEN:
- ./bin/noted version prints exactly the version string from package.json + newline
- ./bin/noted version --json prints valid JSON with a "version" key
- both invocations exit 0
- ./bin/noted whatever still exits 2
EOF

Step 2 — Run A: prompt-only

Open a fresh agent session (or terminal window). Paste only the line below as the prompt. Do not point it at the task file, do not show it package.json, do not let it open cli.ts. Just the line.

Add a version subcommand to the noted CLI.

Time it. Stop the run when the agent declares "done" or after 15 minutes, whichever comes first. Then run, in your project root:

sh
./bin/noted version 2>&1; echo "exit: $?"
./bin/noted version --json 2>&1; echo "exit: $?"
./bin/noted whatever 2>&1; echo "exit: $?"

Record what you see — verbatim — in a new file:

sh
mkdir -p course-runs/M01-A-prompt-only
# paste outputs and notes here

Step 3 — Reset the repo

sh
git stash -u || true
git checkout -- .
git clean -fd

You are back at the Module 00 checkpoint.

Step 4 — Run B: one-line spec

Open another fresh agent session. This time the prompt is:

Read course-runs/M01-task.md and complete the task. Verify by running the three commands listed under DONE WHEN before declaring complete.

Time it. Stop on "done" or 15 minutes. Then run the same three commands as Step 2 and capture the output in course-runs/M01-B-with-spec/.

Step 5 — Compare

Fill in this table inside course-runs/M01-results.md:

md
| Run | Time to "done" | version exit | version --json valid JSON | unknown still exits 2 | Files agent edited |
|-----|----------------|--------------|--------------------------|----------------------|---------------------|
| A   |                |              |                          |                      |                     |
| B   |                |              |                          |                      |                     |

You will almost always see Run A finish faster but with one or more verification cells failing. Run B is slower but passes all three. That delta — same model, same code, same human — is the harness effect.

Step 6 — Note the layer

In M01-results.md, add one paragraph under the table answering: Which of the five defense layers explains the difference between Run A and Run B? The answer is "task specification," but write out how you can tell from the outputs.

Step 7 — Commit

sh
git add .
git commit -q -m "module-01: A/B harness comparison run"

Verification

sh
test -f course-runs/M01-task.md && \
test -f course-runs/M01-results.md && \
grep -q "task specification" course-runs/M01-results.md && \
echo "M01 OK"

Expected:

M01 OK

This does not check that Run A failed and Run B passed — your specific runs may vary. It checks that you wrote the task, recorded results, and attributed the gap to a defense layer.

Common pitfalls

  • Helping the agent in Run A. The whole point is the prompt is underspecified. Resist clarifying.
  • Comparing different models. Use the same agent and the same model for both runs. The variable is the spec, not the model.
  • Skipping the reset. If you do not reset between runs, Run B starts from a half-broken Run A and the comparison is meaningless.
  • Treating "looks right" as passing. If version --json prints version: 0.1.0 (no JSON braces) and you write "passed," you have just demonstrated the verification gap on yourself.
  • Forgetting unknown still exits 2. A common Run A regression is that the agent rewrites the unknown-command branch and breaks Module 00's checkpoint. The course will keep doing this — the harness has to defend against it.

Next

Module 02 — Anatomy of a Harness. You start filling in the four subsystems Run A was missing.