Auto-Research Pattern Requires Upfront Eval Criteria

evaluation methodology tooling pattern r&d

1 What happened

The auto-research pattern (Karpathy-style iterative prompting) was adopted for the LeanIX catalog extraction skill. Initial attempts without clear eval criteria produced iterations that "improved" subjectively but couldn't be measured. Once we defined binary pass/fail criteria (schema completeness, mode detection accuracy, validation rules), the loop converged in 10 iterations to a 10/10 score. Without criteria, the same loop had run 15+ iterations without converging.

2 What we learned

The auto-research pattern is powerful but only when paired with explicit, binary evaluation criteria defined before the first iteration. "Make it better" loops diverge; "does it pass these 10 checks?" loops converge. The eval criteria file should be written as a separate artifact and version-controlled alongside the target — this also makes the criteria themselves auditable and improvable.

3 Applies to

Any future auto-research skill usage — always write eval criteria first