CLAUDE.mdA recent paper said no. So I tested 9 popular CLAUDE.md / AGENTS.md files across three increasingly loose tasks.
The answer is: it depends on how much plan you give the agent.
Each experiment is the same 9 mds, same agent, judged by the same rubric. The only thing that changes between them is how much planning is in the prompt.
8 surgical tasks (4 from Dory's repo, 4 synthetic rule-tests). Each task has a clear before/after — the agent has nowhere to go off-script.
Build a Serbian real-estate scraper from a detailed plan. Multiple files, vision API, dedup, filters. The plan does the structural work.
One sentence: "Build me an online shop with a web interface for selling shoes." The CLAUDE.md is now the only thing differentiating the agents.
The cross-experiment view: variance numbers across all 3 prior steps in one table. Lines added, files touched, quality scores, refusal counts — the whole curve.
Your CLAUDE.md is co-authoring every reply.
When your prompts carry the plan, the md is a small style nudge. When your prompts are vague, the md fills the gap — sometimes well, sometimes by stopping the conversation.
v0 (empty CLAUDE.md) ranked 7th of 9 on the planned task. It ranked 2nd of 9 on the free-form one. That single rank-flip refutes "always use Karpathy's md" or any other universal-best advice.
One md refused to write code on 1 of 3 runs. Same prompt, same agent — its rules said "always plan first" and won against the user's "do not ask clarifying questions; build it." Re-sending the prompt verbatim broke the gate.
The fix isn't a better md. It's a clearer prompt.