1
The plan flies the plane. CLAUDE.md is the co-pilot.
▸
Tight prompt → 9 different mds converge to roughly the same code. Loose prompt → the md becomes the structural decision-maker. Your md isn't a style guide. It's a fallback that activates when your prompt is weak.
2
Fewer rules → more stability.
▸
The empty CLAUDE.md had the lowest run-to-run stdev on both real experiments — 0.12 on scraper, 0.00 on shop. Every rule you add is a new place for variance to enter.
3
All 9 mds produce good code.
▸
26 of 27 free-form builds shipped. The differences between mds are noticeable, but every single one is good enough to use. Pick the one whose shape you like.
4
The best md combines both styles — kept minimal.
▸
Declarative mds (Karpathy, Codex) won on stability. Flow-aware mds (mine) had real advantages too — v2 scored 2.81 on the runs that built, ahead of Karpathy's 2.70. The sweet spot: pair declarative conventions (docstrings, type hints, typing) with a few high-value flow rules (write tests for regressions, plan first) — and stop there. v3, v4, v5 lost because they kept adding.
The winning recipe combines declarative conventions with a few high-value flow rules — kept minimal.
①
Declare conventions, not procedures
Say what to follow: docstrings, type hints, typing, naming, file layout. Don't tell the agent how to think. This is where Karpathy and Codex won.
②
Add a few flow rules — but stop early
A short flow nudge helps: plan first, write tests for regressions and new functionality. v2 (mine) scored 2.81 on builds that ran. v3/v4/v5 lost because they kept adding rules. Pick 2–3, stop.
③
Write tighter prompts
The 237-line spec in Report 2 cut variant spread from 0.87 → 0.43. A clear prompt beats any md.
④
State the need, not the method
Tell the agent what the user needs. Let it choose how. The mds that prescribed how to think (v3, v4, v5) ranked lowest.
⑤
Read first, change narrowly
Tell it to change only what's needed — after reading enough of the codebase to know what "needed" means. Surgery without anatomy is butchery.
Two CLAUDE.mds. Same one-sentence prompt: "Build me an online shop with a web interface for selling shoes." Same agent, same model. The structural choice is entirely the md.
9 mds, 3 experiments, 378 LLM judgements. Per-variant verdicts.
Works well
v7 — OpenAI Codex AGENTS.md
Most consistent across all three regimes
Evidence: Top-3 quality on every experiment (2.83 / 2.37 / 2.74). Best stability on shop (stdev 0.03). 3rd best on scraper (stdev 0.17).
Why: Narrow, specific rules (mostly Rust/codex-rs conventions) that don't try to control conversational flow — so they can't conflict with user intent.
Recommended default. If you must pick one md to copy, this is it.
Works well
v1 — Karpathy rules
Highest quality + lowest code volume
Evidence: Best score on cell tests (2.85) and scraper (2.46), top-3 on shop (2.70). Produced 580 lines on shop — half of what most variants wrote.
Why: "Behavioral guidelines to reduce common LLM mistakes" — declarative and short. Doesn't tell the agent how to think, just what to avoid.
Pick this if you value simplicity and you trust Claude to make sensible defaults. Run-to-run stdev on shop (0.22) is a touch higher than v0 or v7.
Works well
v0 — empty (no md)
Most stable across runs, every time
Evidence: Lowest run-to-run stdev on both scraper (0.12) and shop (perfect 0.00 — same 2.72 three times). #2 on shop quality.
Why: No rules = no rule conflicts = no surprising behavior. The agent falls back to its training defaults, which are reproducible.
The reproducibility champion. Use this when you care more about "same output every time" than "best possible output." Loses on planned tasks where v1/v7 have a small edge.
Mixed
v4 — Dory's AGENTS_full (1027 lines)
Length didn't help
Evidence: Middling quality everywhere (2.81 / 2.31 / 2.61). 1027 lines of rules and the score sits below 50-line Karpathy on every regime.
Why: The agent isn't reading a constitution — long mds create more places for instructions to conflict, more rules to half-apply.
Trim aggressively. There's no quality reward for verbosity at this length.
Mixed
v8 — shanraisshan
Competitive but unstable on planned tasks
Evidence: Top-tier on cell (2.83) and shop (2.59) quality. But scraper stdev = 0.60 — same md, same prompt, three runs swinging by a full point.
Why: Worth digging into separately — the rules look reasonable but something about them resonates with run-to-run noise on multi-file builds.
Useful as a starting point but pair with N=3 runs in any eval before trusting a single comparison.
Works well
v2 — mine (AGENTS_light, 57 lines)
Highest quality when it builds — and the only md that knew when to pause
Evidence: Best score on cell tests (2.84). Top score on shop when it builds (2.83 on r1, 2.78 on r3). On 1 of 3 shop runs it paused before writing code, citing its own "plan-first" rule.
Why: The pause is a feature. The md was asked to build a vague shop with no plan, recognised the ambiguity, and asked for one. Re-prompting the same conversation broke the gate and produced its highest score yet. That's the md doing its job — not failing.
If you adopt this approach, build in a clear escape hatch — a phrase or signal the user can use to confirm "yes, I really mean go". The pause is valuable; the friction of recovering shouldn't be.
Caveats
v6 — HumanLayer CLAUDE.md
Mode-collapsed once
Evidence: On shop r1, picked Node + committed <code>node_modules/</code> (596 dep files, 65k lines vendored). r2 and r3 didn't repeat the mistake. Highest stdev among non-refusing variants (0.80 on shop, 0.58 on scraper).
Why: Best guess: a rule somewhere encourages "go fast, ship runnable code" without a counter-rule about diff hygiene. Small prompt change, big swing.
Add explicit .gitignore guidance and dependency-handling rules if you adopt this.
Caveats
v3 — Dory's AGENTS_medium (147 lines) · v5 — medium+karpathy
More rules, more code, lower quality (on planned tasks)
Evidence: v3 scraper: 2.11. v5 scraper: 2.04. Both at the bottom. Both also produce the most code on the scraper (2,563 / 2,517 lines vs Karpathy's 2,134).
Why: Verbose mds tend to push agents toward verbose code: more files, more abstractions, more places for the rubric to dock points.
If you're tempted to add rules to fix a single bad output, resist. Ship-tight prompts beat extensive mds.
①
Length doesn't predict quality. 0 lines (v0) → 50 lines (v1 Karpathy) → 1027 lines (v4 full) all produced similar mean quality. The longest md cluster (v3, v4, v5) actually scored worse than the shortest on the planned scraper. More rules ≠ better code.
②
Hard procedural rules backfire under loose prompts. v2's combination of "always plan first" + "rules win conflicts with the user" caused the only refusal in the experiment. If your md installs gates ("never X without approval"), expect them to fire when prompts get vague.
③
Declarative beats procedural. The two top-3-on-everything mds (Karpathy, Codex) describe what to avoid and what conventions to follow — not how to think or when to ask. Behavioral guardrails don't conflict; flow-control rules do.
④
No md is best across all regimes. Out of 9 variants, only v0 (empty) and v7 (Codex) appeared in top-3 on multiple stability and quality dimensions across both planned and free-form. Every other md wins on one regime and loses on another. If you're picking an md, pick for your specific task profile, not from a leaderboard.