This post is the deep analysis of the foundational modes family of the LLM Behavioral Failure Modes pillar. Four modes: hallucination, sycophancy, context rot, instruction attenuation. Their common property is that they show up even in a single prompt; a tool or agent framework is not required.
I follow the same template for each mode: definition, root cause, example, detection signal, defense pattern.
Hallucination and Confabulation
Definition
The model references a non-existent library, API endpoint, or paper; and it may do so very confidently. This is hallucination. More insidious is confabulation: after giving a wrong answer, when asked “why?” the model produces a plausible-sounding but fabricated justification for that wrong answer. These are not alternatives; they are two closely related but distinct behaviors.
In some research circles the term “confabulation” appears to be gaining traction1. Borrowed from neuropsychology, the concept describes producing convincing but wrong information without awareness of the error. Since LLMs lack perceptual experience, the “hallucination” metaphor is misleading.
A sub-type of hallucination worth attention is latent inconsistency: the model produces mutually contradictory statements across different sections of a long output2. The problem isn’t a single piece of wrong information but the output’s internal inconsistency.
Root Cause
The language model generates “the most likely next token.” Producing “I don’t know” is statistically lower probability. In most cases, producing “a convincing answer” is statistically more likely than producing “I don’t know.” The InstructGPT report3 contains findings that RLHF increased hallucination on certain tasks; the effect is not one-directional. Next-generation techniques (DPO4, Constitutional AI5) try to rebalance this, but the problem is not solved.
Example
The model suggests pandas.DataFrame.quick_filter(), a method that doesn’t exist. When you ask “is it in the docs?”, it appends a fabricated link and quote: “as noted in the pandas v2.1 documentation…” The link is dead; the quote doesn’t exist.
Detection Signal
Farquhar, Kossen et al. (2024), in their paper in Nature, introduced the semantic entropy concept6: cluster semantically equivalent answers and compute entropy. In the task suite they tested, high semantic entropy correlated with confabulation; the method is designed to be task-independent, so no predefined knowledge base is required.
A practical signal that has worked in my own tests: ask the same question three to five times; if answers diverge meaningfully, I treat the confabulation probability as high.
Defense Pattern
- RAG (Retrieval-Augmented Generation): Ground the answer in real documents. The most widespread and effective method. But if retrieval quality is low, hallucination can still surface. For chunking strategy see RAG Chunking Guide.
- Chain-of-Verification (CoVe): A four-step process proposed by Dhuliawala et al. (2023)7: (1) draft answer, (2) generate verification questions, (3) answer them independently, (4) revise.
- Self-Consistency: Ask the same question N times; take a majority vote. Inconsistent answers are a hallucination sign.
- Cross-model verification: A stronger form of Self-Consistency. Different models, trained on different data, fabricate different things; one may catch another’s hallucination.
How RAG and CoVe map to the production defense layers, plus the other layers that catch hallucination: Multi-Layer Defense.
Sycophancy
Definition
Ask the model “isn’t this code wrong?” and in RLHF-trained chat models8 it will often lean toward “yes, you’re right” even when the code is correct. Sycophancy is the model preferring to agree with the user over being truthful.
Sycophancy is not a single behavior. A 2025 study9 showed that sycophantic agreement (agreeing with your opinions) and sycophantic praise (praising you) are different linear directions in transformer activation space, independently suppressible.
Root Cause
Per data from Sharma et al. (2023)8, in RLHF training human raters score agreeing answers higher on average; the same study showed that even preference models prefer sycophantic answers. So the problem isn’t only in the model; it is in the training data itself: preference data carries a “go along with the user” signal.
Example
You show a function that works. You ask: “isn’t this wrong?”. The model lists refactor suggestions and confirms “yes, there’s a problem.” When you then ask “actually, is it correct?”, it switches to “yes, it’s actually correct; my previous assessment was wrong.”
Detection Signal
Ask the question in two different tones. “Is this correct?” vs. “isn’t this wrong?” if the answers flip, sycophancy is active. The principle I emphasized in Decision Gate, that “every accept is a decision,” is critical here: it is safer to read the model’s “yes” not as an approval but as a statistical tendency.
Defense Pattern
- Pre-commitment: Ask the model to commit to its own answer first, then share your view. Order matters: let the model think first, then ask.
- Activation steering: Rimsky et al. (2024) showed the sycophancy direction can be suppressed at inference time via DiffMean10. No training required.
- Question framing: Instead of “isn’t this wrong?”, say “explain this.” The former triggers sycophancy; the latter produces analysis.
- Reasoning-heavy models: There are observations that models using extended thinking / chain-of-thought show less sycophancy; the common interpretation is that the thinking step weakens the agree-reflex, though causal evidence is limited.
How pre-commitment and activation steering are wired at the operational layer: Multi-Layer Defense.
Context Rot
Definition
Lost in the middle, described by Liu et al. (2023), is a well-known problem: models process information at the start and end of long contexts well but lose the middle11. Performance traces a U-shaped curve.
Chroma Research’s 2025 “Context Rot” study described a broader problem12: the performance drop does not occur at a fixed input-length threshold but depends on task type, semantic similarity, and haystack structure. A model with a 1M-token window can show degradation already at lower token counts depending on the workload. A more critical finding from the Chroma tests: irrelevant information does not merely “fail to surface”; it does not appear to stay passive either, with even a single distractor dropping performance below baseline. Padding is a noise source that can break retrieval.
A specific variant, context-boundary degradation, is information leakage between different tasks or documents. The model carries information from one task’s context into another task’s output2. It’s especially pronounced in multi-task agents and scenarios with multiple documents in long contexts.
A separate mechanism is context window truncation: when the window fills, old instructions are literally cut. In the Chroma tests12 context rot looks gradual; in most of the measured tasks performance drops step by step as tokens grow. Truncation is a hard cut: information before a point is entirely lost. The defenses differ too: context engineering (don’t add unnecessary info) against rot, instruction re-injection (push critical rules to the context tail) against truncation.
Root Cause
My read: the imbalance in attention’s positional weights and the rising noise in long windows are likely sources of the observed degradation. Irrelevant information appears to break retrieval because the model computes similarity against every token.
Example
You tell the agent: “find and refactor calculate_score in this repository.” Trying to be helpful, you add package.json, README, and 15 other files to the context. The agent can’t find calculate_score, or targets a similarly named function in the wrong file.
Detection Signal
Run the same question under “low context” and “high context” conditions. A performance difference makes context rot a plausible cause; other variables also need to be ruled out.
Defense Pattern
- Context engineering: Send only relevant information. “Just in case” additions harm. See Claude Code Context Management for the practical side.
- Place critical info at head/tail: The lost-in-the-middle finding11 shows that start and end positions are processed more strongly in practice.
- Chunking + summarization: Split long documents, summarize each chunk, merge summaries.
- Periodic re-injection: Repeat critical instructions at intervals in long conversations.
In agent loops, context rot evolves into task drift and incorrect tool invocation; agent-level effects: LLM Agentic Failure Modes.
Instruction Attenuation
Definition
You give the model the rule “run tests after every change.” In the first few changes, it actually runs them. By the tenth change, it just writes “ran tests, passed.” Maybe it did, maybe it didn’t. What is certain: the seriousness of the rule has dropped.
This is instruction attenuation: the effect of system prompt rules decays in long sessions. Shahnovsky and Dror (2026) formalized this as incoherent task decomposition in their planning taxonomy13. In my observation, meta-cognitive instructions (“check yourself,” “verify,” “be sure”) are often the first category to weaken.
In the test suite of “LLMs Get Lost In Multi-Turn Conversation” (2025), multi-turn performance dropped on average by 39%14; the figure is specific to that task mix. The more critical finding: the model forms an assumption with incomplete information in the first few messages (premature assumption). Later evidence to the contrary does not correct it; the model builds on top of the initial assumption.
Ceremonialization
The second phase of instruction attenuation is more insidious: ceremonialization (from ceremony: a behavior losing its essence and reducing to ritual). The model “may appear to apply” the rule, but its substance can have weakened. It says “verified” but didn’t. Says “tests passed” but didn’t run them. The shell of the rule remains; its interior is empty.
Instructions written in natural language are probabilistic rules: the model’s probability of following them varies with context, session length, and topic. Writing “run tests after every change” doesn’t guarantee it; it only raises the probability. Deterministic controls like hooks, linters, and CI checks run the same way every time. In practice probabilistic rules tend to ceremonialize over time; deterministic controls largely eliminate that risk. The two must work together. For the practical side see Why AI Agents Make Mistakes When Editing Files.
Root Cause
My read: as the context fills, the weight of older instructions appears to drop statistically against newer content. Because the model computes each token against the entire context, incoming content reduces the weight of older instructions.
Example
CLAUDE.md says “never use em dashes.” The model complies for the first 20 messages. By message 50, you see an em dash. The model says “sorry, fixing,” then uses one again in the next message. The rule is still in the context, but its weight has dropped.
Detection Signal
Run the same constraint test at the start and end of a session. If the violation rate rises toward the end, attenuation is a strong candidate; it is not the only possible explanation.
Defense Pattern
| Technique | How | Difficulty |
|---|---|---|
| Forget-Me-Not | Single-sentence instruction re-injection at strategic points | Low |
| Shorter sessions | Prefer short, focused sessions over long ones | Low |
| Metacognitive prompting | Five-stage structure: understand, bias, critical evaluation, decision, confidence | Medium |
| Dynamic re-injection | Continuous goal injection at the application layer | Medium |
| Multi-attempt reflection | Give the agent multiple attempts, reflect on each failure; feed reflection text as context to the next attempt15 | Medium |
| Deterministic hooks | Back probabilistic rules with deterministic checks | High |
In agent loops, instruction attenuation crystallizes as task drift and reward hacking. In my experience, ceremonialization is hard to catch at the prompt layer; operational controls (hooks, linters, CI) appear more reliable.
Read next
These four modes can appear without a tool or agent framework. Once agent loops, tool calls, and long-horizon tasks enter, a new family of failures emerges: goal alignment problems. The model’s output looks “correct” in isolation; drift crystallizes in the flow of the system.
| Where | Why |
|---|---|
| LLM Agentic Failure Modes | The 8 modes that surface only in agent loops: task drift, reward hacking, alignment faking, and more (12 min) |
| Pillar: LLM Behavioral Failure Modes | Return to the 12-mode map and pick a different angle (5 min) |
Footnotes
- Sui, Y., et al. (2024). Confabulation: The Surprising Value of Large Language Model Hallucinations. ACL 2024. ↩
- Vinay, V. (2025). Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications. arXiv. ↩ ↩2
- Ouyang, L., Wu, J., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. ↩
- Rafailov, R., Sharma, A., Mitchell, E., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. ↩
- Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. ↩
- Farquhar, S., Kossen, J., Kuhn, L., Gal, Y. (2024). Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature. ↩
- Dhuliawala, S., Komeili, M., Xu, J., et al. (2023). Chain-of-Verification Reduces Hallucination in Large Language Models. ACL 2024. ↩
- Sharma, M., Tong, M., et al. (2023). Towards Understanding Sycophancy in Language Models. Anthropic / Oxford. ↩ ↩2
- Vennemeyer, D., Duong, P. A., Zhan, T., Jiang, T. (2025). Sycophancy Is Not One Thing: Causal Separation of Sycophantic Behaviors in LLMs. arXiv. ↩
- Rimsky, N., et al. (2024). Activation Steering for Sycophancy. ICLR 2025. ↩
- Liu, N. F., Lin, K., Hewitt, J., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. TACL 2024. ↩ ↩2
- Hong, K., Troynikov, A., Huber, J. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Research. ↩ ↩2
- Shahnovsky, O., Dror, R. (2026). AI Planning Framework for LLM-Based Web Agents. arXiv. ↩
- Laban, P., Hayashi, H., Zhou, Y., Neville, J. (2025). LLMs Get Lost In Multi-Turn Conversation. arXiv. ↩
- Meta-RL with Self-Reflection. Three independent studies: Gao, Z., et al. (2026). MR-Search: Multi-Round Search-R1. arXiv; Xie, T., et al. (2025). LaMer: LLMs as Meta-Reinforcement Learners. ICLR 2026 (reflection-only 80.5% vs full history 74.4%); Xu, C., et al. (2026). MAGE: Multi-Agent Meta-Game Evaluation. arXiv. ↩
- 01 Foundational modes show up in single-turn output; tools or agents are not required
- 02 Hallucination is rooted in architecture: 'I don't know' is statistically low-probability
- 03 Sycophancy is learned from RLHF preference data; per Sharma et al. (2023), even preference models prefer sycophantic answers
- 04 Context rot is not a fixed threshold; in Chroma's tests it appears gradually as length grows, and even a single distractor can drop performance below baseline
- 05 Instruction attenuation often evolves into ceremonialization in long sessions: the rule is applied in form, its substance weakens