I’ve covered LLM behavioral failure modes, context management, and decision processes in previous posts. In this one, I’ll take a step forward and ask: how can we optimize the prompt itself to get better output from the model?
“Write better prompts” is advice everyone knows but nobody concretizes. In this post, I’ll try to unpack the mechanism behind the cliche as I understand and use it: how LLMs retrieve knowledge, why generic prompts fall short, and how the knowledge anchor approach addresses this.
| Quick Reference | |
|---|---|
| Scope | Domain-specific prompt optimization |
| Source count | 30+ academic studies and official documentation (2021-2026) |
| Practical focus | Knowledge anchor definition, structured format comparison, optimization steps |
| Related concepts | LLM Failure Modes, Decision Gate |
How Do LLMs Retrieve Knowledge?
When you tell a language model “design a website,” the model activates knowledge clusters associated with “website” and “design” from billions of tokens in its training data. This activation is broad but shallow: CSS, HTML, JavaScript, React, WordPress, Wix, accessibility, performance, SEO, all lightly activated at once. None deeply activated.
When you tell the same model “design an e-commerce checkout page applying Nielsen’s 10 heuristics,” the activation narrows but deepens. Jakob Nielsen’s usability heuristics are the most widely used evaluation framework in interface design: ten core principles including visibility of system status, match between system and real world, error prevention, and consistency and standards1. When the model encounters this reference, it activates these specific principles and their applications from its training data. The knowledge cluster is concentrated, and the model focuses on that region.
This is not an intuitive observation. Studies examining knowledge storage mechanisms in transformer architecture support this picture.
The knowledge neuron hypothesis was defined in Dai et al.’s (2022) study published at ACL2: specific neurons express factual knowledge, and their activation shows positive correlation with corresponding facts. Meng et al. (2022) extended this finding at NeurIPS3: factual knowledge is mediated in middle-layer feed-forward modules when processing subject tokens. In other words, names and concepts in the prompt (subject tokens) are the triggers of knowledge retrieval.
However, recent research adds nuance to this picture. Niu et al. (2024), in their study published at ICLR 2025, showed that the knowledge neuron thesis is an oversimplification4: knowledge emerges not from individual neurons but from coordinated activation across distributed components. These two views do not contradict each other; they are valid at different scales: individual neurons contribute to factual associations, but knowledge retrieval cannot be reduced to a single neuron and requires cross-layer coordination. Zheng et al. (2024) classified attention heads in a four-stage framework5: Knowledge Recalling, In-Context Identification, Latent Reasoning, Expression Preparation. The first stage is directly triggered by references in the prompt.
The practical implication is simple: what you write in the prompt determines which knowledge clusters the model activates. Generic terms produce broad but superficial activation. Specific names, theories, and framework references trigger deep, concentrated knowledge clusters in the training data.
Why Do Generic Prompts Fail?
When I analyzed over 1,400 prompts collected from various platforms, I found the majority shared common weaknesses. I scored each prompt on four criteria, 0-2 each: role definition, explicit constraints, output format, domain-specific reference usage. On a 0-8 scale, 5 and above was classified as “above average,” 3 and below as “below average.” By these criteria, the vast majority of prompts fell below the midpoint (4).
Common traits of the most successful prompts (top 18th percentile):
- Specific research or framework references
- Structured format (XML, JSON, or similar delimiters)
- Explicit constraints (MUST, NEVER, ALWAYS format)
- Defined output format expectations
The difference is not just “more detailed instructions.” It’s a structural difference.
Before/After
Generic prompt:
Write a product description for an e-commerce site.
Optimized with knowledge anchors:
<role>
You are a conversion-focused UX writer.
You apply Krug's "Don't Make Me Think" and Cialdini's persuasion principles.
</role>
<instructions>
Write an e-commerce product description.
</instructions>
<constraints>
- MUST: Benefit-first structure (lead with benefits, not features)
- MUST: Scannable format (bullet points, bold key terms)
- NEVER: Jargon or technical terms (target audience: general consumer)
- MAX: 150 words
</constraints>
<output_format>
Title + 2-3 sentence hook + bullet point benefits + CTA
</output_format>
The first prompt produces “any product description.” The second activates Krug’s usability principles and Cialdini’s persuasion framework, the output format is defined, and constraints are clear.
Prompt Quality and LLM Behavioral Degradations
Good prompts don’t just produce better output; they also minimize LLM behavioral degradations:
- Hallucination: The Chain-of-Verification (CoVe) prompt technique reduces hallucination by directing the model to verify its own output through validation questions6
- Sycophancy: Question formulation directly affects sycophancy. Saying “analyze this” instead of “isn’t this wrong?” produces analysis rather than the model’s agreement reflex7
- Context rot: Adding unnecessary information to prompts doesn’t just make it “unfindable,” it actively causes harm. Irrelevant padding is a noise source that disrupts information retrieval8
- Instruction attenuation: Instructions lose their effect in long sessions. The Forget-Me-Not technique reduces the average 39% performance drop in multi-turn through strategic instruction re-injection9
Common theme: the structure of the prompt shapes the model’s behavior. An unstructured, ambiguous prompt invites degradation modes.
What Is a Knowledge Anchor?
A knowledge anchor is a named reference that activates specific knowledge clusters in LLM training data. It can be a theory, framework, researcher, or methodology name.
When you say “do form validation,” the model produces generic form validation output. When you say “apply Luke Wroblewski’s inline validation research,” the model activates that research’s specific findings: instant feedback, positive validation, error message timing.
This is not the same mechanism as retrieval-augmented generation (RAG), but it aims at a similar result: more accurate, less hallucinated outputs. The mechanisms differ. RAG pulls information from an external database and presents it to the model as context. Knowledge anchors require no external data: they cause the model to more effectively retrieve existing knowledge in its own training data through specific references. It also differs from few-shot prompting: few-shot provides the model with example input/output pairs; knowledge anchors activate a specific region in the model’s knowledge base by naming it.
Research shows that this type of grounding measurably reduces hallucination. Shuster et al.’s (2021) foundational study proved that retrieval-grounded generation significantly reduces knowledge hallucination10. Garber et al. (2024) showed that hallucination measurably decreases when named entities extracted from knowledge graphs are used as grounding sources11. Google Research’s FACTS benchmark (2025) uses the ratio of “hallucinated named entities” (names not appearing in source documents) as its primary error metric12.
Four Categories
Organizing knowledge anchors into four categories provides an effective structure:
| Category | Function | Example |
|---|---|---|
| Core Principles | Foundational design/architecture principles | Nielsen’s heuristics, SOLID principles, Tufte’s data visualization rules |
| Anti-Patterns | Common mistakes to avoid | Premature optimization, God object, N+1 query |
| Key Metrics | Measurable success criteria | Core Web Vitals, OWASP Top 10, DORA metrics |
| Domain-Specific Anchors | Deep domain references | Wroblewski (form UX), Kimball (dimensional modeling), Cialdini (persuasion) |
Stacking anchors from all four categories into every prompt is not efficient. In my practical experience, 3-4 anchors is the sweet spot: one core principle + one anti-pattern + 1-2 domain-specific anchors. More than that adds noise to the context and dilutes the model’s focus.
Structured Prompting Formats
“Structured prompts outperform plain text” is no longer up for debate. Schulhoff et al.’s (2024) systematic survey covering 1,565 papers defines structured formatting (delimiters, XML tags, JSON schemas) as a meta-technique: a foundational building block that improves all other prompting strategies13.
However, the “which format?” question doesn’t have a simple answer.
XML
Anthropic’s Claude models are trained to recognize XML tag structures14. Semantic tags like <role>, <context>, <constraints>, <output_format> provide concern separation. Strong at combining free text with structured instructions.
<role>Senior frontend engineer</role>
<context>React 19, Tailwind CSS, TypeScript project</context>
<instructions>Create a form validation component</instructions>
<constraints>
- MUST: Wroblewski inline validation pattern
- NEVER: Alert/confirm dialog
</constraints>
JSON
OpenAI pushes Structured Outputs with JSON Schema in GPT-4o and o3 models15. The response_format parameter guarantees schema adherence, required keys, and enum validation. Strong for API integration and programmatic output parsing.
{
"role": "Senior frontend engineer",
"context": "React 19, Tailwind CSS, TypeScript",
"task": "Form validation component",
"constraints": {
"must": ["Wroblewski inline validation"],
"never": ["Alert/confirm dialog"]
}
}
YAML
YAML has an advantage in human readability. Preferred in prompt templates, config files, and multi-step workflow definitions. Its indent-based structure provides natural hierarchy.
role: Senior frontend engineer
context: React 19, Tailwind CSS, TypeScript
task: Form validation component
constraints:
must:
- Wroblewski inline validation
never:
- Alert/confirm dialog
Format Selection: Model-Specific Preferences
He et al.’s (2024) study showed in experiments on GPT-3.5-turbo and GPT-4 that prompt format alone can create up to 40% performance difference16. The difference is especially pronounced in smaller models: GPT-3.5-turbo prefers JSON while GPT-4 performs best with Markdown and is more format-resilient. Cross-model format transfer is low (IoU generally below 0.2).
Elnashar et al. (2025) compared formats across GPT-4o, Claude, and Gemini17: JSON has highest accuracy for complex/nested data, YAML balances readability and efficiency, CSV/Prefix excels in token efficiency for flat data.
| Model | Recommended Format | Source |
|---|---|---|
| Claude Sonnet 4 / Opus 4 | XML tags | Anthropic official documentation14 |
| GPT-4o / o3 | JSON Schema + Structured Outputs | OpenAI official documentation15 |
| Gemini 2.5 Flash / Pro | Format-agnostic, Markdown/JSON preferred | Google documentation |
Format Sensitivity in Open-Source Models
Open-source models are far more sensitive to format changes than closed-source models. This is not just an intuitive observation; it’s backed by research:
- 76 accuracy point difference: Performance gap from format changes in LLaMA-2-13B in few-shot settings18
- Parameter count is not the solution: The POSIX study showed that increasing parameters or instruction tuning alone does not reduce sensitivity. Even a single few-shot example creates a dramatic difference19
- Structural task gap: The StructEval benchmark revealed that open-source models (Llama-3-8B, Qwen2.5-7B) show a wider performance gap compared to closed-source models on complex structural tasks20
Each model family has its own format preferences:
| Model | Format | Note |
|---|---|---|
| DeepSeek V3 | Markdown + XML + 3-tiered prompt (System/Developer/User) | R1: empty system prompt, skip few-shot |
| Kimi K2.5 | Structured headers, tables | Tool calling is automatic |
| Llama 4 | Custom header tokens, JSON schema | ipython role for tool results |
| Qwen 3 | ChatML format, <think> block | /think and /no_think inline switch |
| Mistral Large | [INST] template, Markdown + XML | JSON Schema mode > plain JSON mode |
Common finding: regardless of format, structured prompts outperform plain text. Constrained decoding (Outlines, vLLM guided_json) improves JSON reliability in open-source models21.
How to Do Domain-Specific Optimization?
A five-step optimization flow to turn theory into practice:
Step 1: Role Injection
Assign a domain-specific expert role. Not “you’re a developer,” but “you’re a senior frontend engineer specializing in accessibility and performance who optimizes Lighthouse scores.”
The role determines the perspective from which the model evaluates subsequent instructions.
Step 2: Anchor Stacking (Max 3-4)
Select relevant knowledge anchors. Use tag matching: match terms in the prompt with anchor tags.
In my practical experience, the optimal combination is:
- 1 Core Principle (guidance)
- 1 Anti-Pattern (constraint)
- 1-2 Domain-Specific Anchors (depth)
More creates noise. Each additional anchor steals the model’s attention from existing anchors.
Step 3: Constraint Formatting
Use explicit constraints instead of ambiguous instructions:
- MUST: Required behaviors
- NEVER: Prohibited behaviors
- ALWAYS: Rules that apply in every case
Instead of “write short,” use “MAX: 150 words, 3 sentences per paragraph.” Instead of “write secure code,” use “NEVER: innerHTML with user input, MUST: parameterized queries, ALWAYS: input validation at system boundary.”
Step 4: Output Spec
Define the expected output format. Without an output format definition, the model falls back to its default (typically long, unstructured paragraphs).
<output_format>
Markdown. H2 headings, bullet points.
Each section max 100 words. Code blocks syntax highlighted.
End with 3-item action items list.
</output_format>
Step 5: Self-Check Gate
Add a verification checklist at the end of the prompt:
<success_criteria>
- [ ] Does the output conform to the specified format?
- [ ] Are all MUST constraints met?
- [ ] Are no NEVER constraints violated?
- [ ] Are knowledge anchor references concretely applied?
</success_criteria>
This ensures the model checks its own output before delivering it. An additional defense layer against instruction attenuation.
The Prompt Forge Approach
I want to share how I apply the principles described in this post to my own workflow.
Analysis
I collected over 1,400 prompts from various platforms and evaluated them on four criteria: (1) role definition, (2) explicit constraints, (3) output format definition, (4) domain-specific reference usage. Each criterion was scored 0-2; on a 0-8 total scale, 5 and above was classified as “above average,” 3 and below as “below average.”
Results: the vast majority of prompts fell in the below average category. The common trait of the top 18th percentile: specific references, structured format, explicit constraints. Exactly the three fundamentals covered in this post.
Approach
Based on these findings, I created curated knowledge anchor files for four domains:
- Frontend: UI/UX, React, CSS, accessibility, performance (Nielsen, Krug, Wroblewski, WCAG)
- Backend: Django, SOLID, DDD, OWASP, database optimization (Fowler, Evans, Martin)
- Data: Data visualization, analytics, SEO (Tufte, Kimball, GA4)
- Infra: CAP theorem, SRE, DORA metrics, 12-Factor (Nygard, Google SRE)
Each anchor file contains the domain’s 20-30 most effective references. Each reference has tags (e.g., tags: [form, validation, input, ux]). When optimizing a prompt, terms in the prompt are matched against tags, and the 3-4 most appropriate anchors are selected.
Results
Same task, same model, the difference between anchor-backed structured prompts and generic prompts is consistent: more specific, more accurate, less hallucinated outputs. The difference is especially pronounced in:
- Complex technical decisions (architecture pattern selection, library comparison)
- Domain-specific best practice application (accessibility, security, performance)
- Structured output generation (ADR, spec, test plan)
Prompt Forge is available as an open-source Claude Code skill on GitHub.
Structure Cannot Be Single-Layered
The optimization techniques covered in this post are not sufficient on their own. As I emphasized in the LLM behavioral failure modes post, defense must work across three layers:
| Layer | This Post’s Counterpart | What It Provides |
|---|---|---|
| Prompt | Knowledge anchors, structured format, constraints | Guides model behavior |
| Architectural | RAG, guardrails, constrained decoding, schema enforcement | Sets structural boundaries |
| Operational | Self-check gate, human-in-the-loop, monitoring | Controls output |
Prompt optimization is a strong starting point, but not a solution by itself. Even the best prompt is not reliable without verification.
Related Posts
- LLM Behavioral Failure Modes: The degradation modes that this post’s prompt techniques aim to address
- Decision Gate: The Missing Piece of Vibe Coding: Systematic decision-making for AI recommendations
- ADR, OpenSpec and Spec-Driven Development: Spec-first approach as an anchor point against task drift
- Context Management in Claude Code: Context window optimization and context engineering
Footnotes
- Nielsen, J. (1994). 10 Usability Heuristics for User Interface Design. Nielsen Norman Group. The most widely used usability evaluation framework in interface design. ↩
- Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., Wei, F. (2022). Knowledge Neurons in Pretrained Transformers. ACL 2022. A study proving that specific neurons express factual knowledge and their activation shows positive correlation with corresponding facts. ↩
- Meng, K., Bau, D., Andonian, A., Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS 2022. A study showing factual knowledge is localized in middle-layer feed-forward modules when processing subject tokens, using causal tracing. ↩
- Niu, J., Liu, A., Zhu, Z., Penn, G. (2024). What does the Knowledge Neuron Thesis Have to do with Knowledge? ICLR 2025. A study showing the knowledge neuron thesis is an oversimplification and knowledge emerges from coordinated activation across distributed components. ↩
- Zheng, Z., Wang, Y., Huang, Y., Song, S., Tang, B., Xiong, F., Li, Z. (2024). Attention Heads of Large Language Models: A Survey. arXiv. A survey classifying attention heads into Knowledge Recalling, In-Context Identification, Latent Reasoning, and Expression Preparation stages. ↩
- Dhuliawala, S., Komeili, M., Xu, J., et al. (2023). Chain-of-Verification Reduces Hallucination in Large Language Models. ACL 2024. A 4-step verification process for reducing hallucination. ↩
- Sharma, M., Tong, M., et al. (2023). Towards Understanding Sycophancy in Language Models. Anthropic / Oxford. A study showing preference models favor sycophantic responses. ↩
- Hong, K., Troynikov, A., Huber, J. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Research. An 18-model analysis showing irrelevant information actively causes harm. ↩
- Laban, P., Hayashi, H., Zhou, Y., Neville, J. (2025). LLMs Get Lost In Multi-Turn Conversation. arXiv. Average 39% performance drop in multi-turn and the Forget-Me-Not re-injection technique. ↩
- Shuster, K., Poff, S., Chen, M., Kiela, D., Weston, J. (2021). Retrieval Augmentation Reduces Hallucination in Conversation. EMNLP Findings. A foundational study proving retrieval-grounded generation reduces knowledge hallucination. ↩
- Garber, G., et al. (2024). Can Knowledge Graphs Reduce Hallucinations in LLMs? A Survey. NAACL 2024. A survey showing named entity grounding measurably reduces hallucination. ↩
- Anil, R., et al. (2025). The FACTS Grounding Leaderboard. Google Research. A grounding benchmark using hallucinated named entities ratio as primary metric. ↩
- Schulhoff, S., et al. (2024). The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv. A comprehensive survey of 58 prompting techniques and 1,565 papers, defining structured formatting as a “meta-technique.” ↩
- Anthropic (2026). Use XML Tags to Structure Your Prompts. Anthropic Docs. Official documentation stating Claude is trained to recognize XML tag structures. ↩ ↩2
- OpenAI (2026). Structured Outputs. OpenAI Docs. Structured Outputs documentation providing JSON Schema adherence guarantees. ↩ ↩2
- He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., Hasan, S. (2024). Does Prompt Formatting Have Any Impact on LLM Performance? arXiv. A study showing format changes can create up to 40% performance difference. ↩
- Elnashar, A., White, J., Schmidt, D. (2025). Enhancing Structured Data Generation with GPT-4o. Frontiers in AI. JSON, YAML, CSV format comparison across GPT-4o, Claude, and Gemini. ↩
- Sclar, M., et al. (2024). Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design. arXiv. 76 accuracy point format sensitivity in LLaMA-2-13B in few-shot settings. ↩
- Samarawickrama, S., et al. (2024). POSIX: A Prompt Sensitivity Index For Large Language Models. arXiv. A study showing parameter scaling or instruction tuning alone does not reduce sensitivity. ↩
- StructEval (2025). Benchmarking LLMs’ Capabilities to Generate Structural Outputs. arXiv. Open-source models show wider performance gap than closed-source on complex structural tasks. ↩
- Geng, S., Cooper, H., et al. (2025). Generating Structured Outputs from Language Models: Benchmark and Studies. arXiv. JSONSchemaBench study showing constrained decoding speeds up generation by 50% and improves downstream quality by up to 4%. ↩
- 01 LLM knowledge storage is distributed: not single neurons, but coordinated activation patterns
- 02 Prompt format alone can create up to 40% performance difference
- 03 Knowledge anchors are named references that activate specific knowledge clusters in LLM training data
- 04 Structured prompting is a meta-technique: it improves all other prompting strategies
- 05 Open-source models are far more sensitive to format changes than closed-source; even a single few-shot example dramatically reduces the gap
- 06 Good prompts minimize LLM behavioral degradations (hallucination, sycophancy, task drift)
+ What is a knowledge anchor?
A knowledge anchor is a named reference that activates specific knowledge clusters in LLM training data. It can be a theory, framework, researcher, or methodology name. Saying 'apply Wroblewski's inline validation research' instead of 'do form validation' causes the model to activate that research's specific findings.
+ Should I use XML, JSON, or YAML?
There are model-specific preferences: Claude is optimized for XML tags, GPT-4o performs best with JSON Schema, YAML is advantageous for human-editable prompt templates. Common finding: regardless of format, structured prompts outperform plain text.
+ Why does prompt format matter more for open-source models?
Research shows open-source models can exhibit up to 76 accuracy point differences from format changes. Closed-source models (GPT-4+) are more format-resilient due to RLHF tuning. Even a single few-shot example in open-source dramatically reduces sensitivity.
+ How many knowledge anchors should I use?
In my practical experience, 3-4 anchors is the sweet spot. More than that adds noise to the context and dilutes the model's focus. One core principle + one anti-pattern + 1-2 domain-specific anchors is a balanced combination.
+ Do good prompts reduce LLM errors?
Yes. Research shows structured prompts reduce hallucination (CoVe technique), sycophancy (question formulation), instruction attenuation (Forget-Me-Not re-injection), and context rot (avoiding unnecessary information).