LLM Evaluation Loop
Why This Matters
Guidelines are hypotheses about what the model needs to know. Some work well. Others get ignored or misinterpreted. Without a feedback loop, you canβt tell the difference.
An evaluation loop tracks which guidelines are effective, which are failing, and how to iterate β turning guideline writing from guesswork into a systematic process.
What to Include
- How to track LLM mistakes β categories, frequency, patterns
- How to trace mistakes to guidelines β was the guideline missing, unclear, or ignored?
- When to update guidelines β triggers for revision
- Prompt pattern library β recording what works for reuse
How to Write It
- Start simple. A shared document or issue label where the team records βthe LLM got this wrong.β No tooling needed.
- Categorize failures. Wrong framework, wrong file location, wrong test type, security violation, architecture violation β each category maps to a guideline.
- Review monthly. Look at failure patterns. If one category dominates, that guideline needs work.
Example
## LLM Evaluation
Track failures in: GitHub issues with label "llm-quality"
Failure categories:
- wrong-framework β Tech Stack guideline
- wrong-location β Directory Structure guideline
- wrong-test-type β Testing Strategy guideline
- security-issue β Security Basics guideline
- style-mismatch β Coding Standards guideline
- scope-creep β Project Scope guideline
Monthly review:
- Count failures by category
- Top category β rewrite or expand that guideline
- If a guideline is followed consistently β it's working, leave it
- If a guideline is consistently ignored β it may be too vague, too buried,
or conflicting with another guideline
Prompt patterns:
- Record prompts that produce consistently good results
- Share in docs/prompts/ as reusable snippets
- Retire prompts that stop working (model updates)
Common Mistakes
Not tracking at all. Without data, guideline updates are based on gut feeling. Even a simple tally is better than nothing.
Blaming the model instead of the guideline. When the LLM makes a mistake, the first question should be βis there a guideline for this?β followed by βis the guideline clear enough?β The model follows instructions β if itβs not following yours, the instructions may be the problem.
Never retiring guidelines. Some guidelines become unnecessary as the model improves or the project stabilizes. Keeping them wastes token budget.
Tool-Specific Notes
- Claude Code: CLAUDE.md supports iterative refinement β use
/initto regenerate periodically and compare with your manual version. - All tools: This is a process guideline, not a tool configuration. It works the same regardless of which LLM tool you use.