Research · CLAUDE.md eval
Research · AI Will Replace You · 2026-05-07

Karpathy vs 8 CLAUDE.MDs

Karpathy's CLAUDE.md vs the internet's most-liked CLAUDE.md / AGENTS.md files — plus a few of my own.
The results aren't what you'd expect.

9
CLAUDE.md variants tested
empty · Karpathy · Codex · HumanLayer · light · medium · full · merged · shanraisshan
126
code builds
9 variants × 14 task-runs (8 cell + 3 scraper + 3 shop)
378
LLM judgements
Opus 4.7 + Sonnet 4.6 + Haiku 4.5
27
live smoke tests
install → boot → curl → grep for shoes

The arc — three experiments, one curve

Each experiment is the same 9 mds, same agent, judged by the same rubric. The only thing that changes between them is how much planning is in the prompt. Click any step to drill in.

The reveal

Your CLAUDE.md is co-authoring every reply.

When your prompts carry the plan, the md is a small style nudge. When your prompts are vague, the md fills the gap — sometimes well, sometimes by stopping the conversation.

v0 (empty CLAUDE.md) ranked 7th of 9 on the planned task. It ranked 2nd of 9 on the free-form one. That single rank-flip refutes "always use Karpathy's md" or any other universal-best advice.

One md refused to write code on 1 of 3 runs. Same prompt, same agent — its rules said "always plan first" and won against the user's "do not ask clarifying questions; build it." Re-sending the prompt verbatim broke the gate.

The fix isn't a better md. It's a clearer prompt.

Variant spread vs prompt specificity
Cell tests
0.14
Planned project
0.43
Free-form
0.87
← tighter prompt  ·  looser prompt →

Watch · read · subscribe