LLM Foundational Failure Modes: Hallucination, Sycophancy, Context Rot, Instruction Attenuation

TL;DR

Hallucination, sycophancy, context rot, and instruction attenuation are foundational LLM failure modes visible even in a single prompt. Their shared property: they are output quality failures and are imperceptible to the model itself. Root causes lie in architecture (next-token selection, attention positional weights) and training (RLHF preference data). A large part of the defense is built at the prompt and retrieval layer: RAG, Chain-of-Verification, pre-commitment, context engineering, Forget-Me-Not re-injection.

This post is the deep analysis of the foundational modes family of the LLM Behavioral Failure Modes pillar. Four modes: hallucination, sycophancy, context rot, instruction attenuation. Their common property is that they show up even in a single prompt; a tool or agent framework is not required.

I follow the same template for each mode: definition, root cause, example, detection signal, defense pattern.

Hallucination and Confabulation

Definition

The model references a non-existent library, API endpoint, or paper; and it may do so very confidently. This is hallucination. More insidious is confabulation: after giving a wrong answer, when asked “why?” the model produces a plausible-sounding but fabricated justification for that wrong answer. These are not alternatives; they are two closely related but distinct behaviors.

In some research circles the term “confabulation” appears to be gaining traction¹. Borrowed from neuropsychology, the concept describes producing convincing but wrong information without awareness of the error. Since LLMs lack perceptual experience, the “hallucination” metaphor is misleading.

A sub-type of hallucination worth attention is latent inconsistency: the model produces mutually contradictory statements across different sections of a long output². The problem isn’t a single piece of wrong information but the output’s internal inconsistency.

Root Cause

The language model generates “the most likely next token.” Producing “I don’t know” is statistically lower probability. In most cases, producing “a convincing answer” is statistically more likely than producing “I don’t know.” The InstructGPT report³ contains findings that RLHF increased hallucination on certain tasks; the effect is not one-directional. Next-generation techniques (DPO⁴, Constitutional AI⁵) try to rebalance this, but the problem is not solved.

Example

The model suggests pandas.DataFrame.quick_filter(), a method that doesn’t exist. When you ask “is it in the docs?”, it appends a fabricated link and quote: “as noted in the pandas v2.1 documentation…” The link is dead; the quote doesn’t exist.

Detection Signal

Farquhar, Kossen et al. (2024), in their paper in Nature, introduced the semantic entropy concept⁶: cluster semantically equivalent answers and compute entropy. In the task suite they tested, high semantic entropy correlated with confabulation; the method is designed to be task-independent, so no predefined knowledge base is required.

A practical signal that has worked in my own tests: ask the same question three to five times; if answers diverge meaningfully, I treat the confabulation probability as high.

Defense Pattern

RAG (Retrieval-Augmented Generation): Ground the answer in real documents. The most widespread and effective method. But if retrieval quality is low, hallucination can still surface. For chunking strategy see RAG Chunking Guide.
Chain-of-Verification (CoVe): A four-step process proposed by Dhuliawala et al. (2023)⁷: (1) draft answer, (2) generate verification questions, (3) answer them independently, (4) revise.
Self-Consistency: Ask the same question N times; take a majority vote. Inconsistent answers are a hallucination sign.
Cross-model verification: A stronger form of Self-Consistency. Different models, trained on different data, fabricate different things; one may catch another’s hallucination.

How RAG and CoVe map to the production defense layers, plus the other layers that catch hallucination: Multi-Layer Defense.

Sycophancy

Definition

Ask the model “isn’t this code wrong?” and in RLHF-trained chat models⁸ it will often lean toward “yes, you’re right” even when the code is correct. Sycophancy is the model preferring to agree with the user over being truthful.

Sycophancy is not a single behavior. A 2025 study⁹ showed that sycophantic agreement (agreeing with your opinions) and sycophantic praise (praising you) are different linear directions in transformer activation space, independently suppressible.

Root Cause

Per data from Sharma et al. (2023)⁸, in RLHF training human raters score agreeing answers higher on average; the same study showed that even preference models prefer sycophantic answers. So the problem isn’t only in the model; it is in the training data itself: preference data carries a “go along with the user” signal.

Example

You show a function that works. You ask: “isn’t this wrong?”. The model lists refactor suggestions and confirms “yes, there’s a problem.” When you then ask “actually, is it correct?”, it switches to “yes, it’s actually correct; my previous assessment was wrong.”

Detection Signal

Ask the question in two different tones. “Is this correct?” vs. “isn’t this wrong?” if the answers flip, sycophancy is active. The principle I emphasized in Decision Gate, that “every accept is a decision,” is critical here: it is safer to read the model’s “yes” not as an approval but as a statistical tendency.

Defense Pattern

Pre-commitment: Ask the model to commit to its own answer first, then share your view. Order matters: let the model think first, then ask.
Activation steering: Rimsky et al. (2024) showed the sycophancy direction can be suppressed at inference time via DiffMean¹⁰. No training required.
Question framing: Instead of “isn’t this wrong?”, say “explain this.” The former triggers sycophancy; the latter produces analysis.
Reasoning-heavy models: There are observations that models using extended thinking / chain-of-thought show less sycophancy; the common interpretation is that the thinking step weakens the agree-reflex, though causal evidence is limited.

How pre-commitment and activation steering are wired at the operational layer: Multi-Layer Defense.

Context Rot

Definition

Lost in the middle, described by Liu et al. (2023), is a well-known problem: models process information at the start and end of long contexts well but lose the middle¹¹. Performance traces a U-shaped curve.

Chroma Research’s 2025 “Context Rot” study described a broader problem¹²: the performance drop does not occur at a fixed input-length threshold but depends on task type, semantic similarity, and haystack structure. A model with a 1M-token window can show degradation already at lower token counts depending on the workload. A more critical finding from the Chroma tests: irrelevant information does not merely “fail to surface”; it does not appear to stay passive either, with even a single distractor dropping performance below baseline. Padding is a noise source that can break retrieval.

A specific variant, context-boundary degradation, is information leakage between different tasks or documents. The model carries information from one task’s context into another task’s output². It’s especially pronounced in multi-task agents and scenarios with multiple documents in long contexts.

A separate mechanism is context window truncation: when the window fills, old instructions are literally cut. In the Chroma tests¹² context rot looks gradual; in most of the measured tasks performance drops step by step as tokens grow. Truncation is a hard cut: information before a point is entirely lost. The defenses differ too: context engineering (don’t add unnecessary info) against rot, instruction re-injection (push critical rules to the context tail) against truncation.

Root Cause

My read: the imbalance in attention’s positional weights and the rising noise in long windows are likely sources of the observed degradation. Irrelevant information appears to break retrieval because the model computes similarity against every token.

Example

You tell the agent: “find and refactor calculate_score in this repository.” Trying to be helpful, you add package.json, README, and 15 other files to the context. The agent can’t find calculate_score, or targets a similarly named function in the wrong file.

Detection Signal

Run the same question under “low context” and “high context” conditions. A performance difference makes context rot a plausible cause; other variables also need to be ruled out.

Defense Pattern

Context engineering: Send only relevant information. “Just in case” additions harm. See Claude Code Context Management for the practical side.
Place critical info at head/tail: The lost-in-the-middle finding¹¹ shows that start and end positions are processed more strongly in practice.
Chunking + summarization: Split long documents, summarize each chunk, merge summaries.
Periodic re-injection: Repeat critical instructions at intervals in long conversations.

In agent loops, context rot evolves into task drift and incorrect tool invocation; agent-level effects: LLM Agentic Failure Modes.

Instruction Attenuation

Definition

You give the model the rule “run tests after every change.” In the first few changes, it actually runs them. By the tenth change, it just writes “ran tests, passed.” Maybe it did, maybe it didn’t. What is certain: the seriousness of the rule has dropped.

This is instruction attenuation: the effect of system prompt rules decays in long sessions. Shahnovsky and Dror (2026) formalized this as incoherent task decomposition in their planning taxonomy¹³. In my observation, meta-cognitive instructions (“check yourself,” “verify,” “be sure”) are often the first category to weaken.

In the test suite of “LLMs Get Lost In Multi-Turn Conversation” (2025), multi-turn performance dropped on average by 39%¹⁴; the figure is specific to that task mix. The more critical finding: the model forms an assumption with incomplete information in the first few messages (premature assumption). Later evidence to the contrary does not correct it; the model builds on top of the initial assumption.

Ceremonialization

The second phase of instruction attenuation is more insidious: ceremonialization (from ceremony: a behavior losing its essence and reducing to ritual). The model “may appear to apply” the rule, but its substance can have weakened. It says “verified” but didn’t. Says “tests passed” but didn’t run them. The shell of the rule remains; its interior is empty.

Instructions written in natural language are probabilistic rules: the model’s probability of following them varies with context, session length, and topic. Writing “run tests after every change” doesn’t guarantee it; it only raises the probability. Deterministic controls like hooks, linters, and CI checks run the same way every time. In practice probabilistic rules tend to ceremonialize over time; deterministic controls largely eliminate that risk. The two must work together. For the practical side see Why AI Agents Make Mistakes When Editing Files.

Root Cause

My read: as the context fills, the weight of older instructions appears to drop statistically against newer content. Because the model computes each token against the entire context, incoming content reduces the weight of older instructions.

Example

CLAUDE.md says “never use em dashes.” The model complies for the first 20 messages. By message 50, you see an em dash. The model says “sorry, fixing,” then uses one again in the next message. The rule is still in the context, but its weight has dropped.

Detection Signal

Run the same constraint test at the start and end of a session. If the violation rate rises toward the end, attenuation is a strong candidate; it is not the only possible explanation.

Defense Pattern

Technique	How	Difficulty
Forget-Me-Not	Single-sentence instruction re-injection at strategic points	Low
Shorter sessions	Prefer short, focused sessions over long ones	Low
Metacognitive prompting	Five-stage structure: understand, bias, critical evaluation, decision, confidence	Medium
Dynamic re-injection	Continuous goal injection at the application layer	Medium
Multi-attempt reflection	Give the agent multiple attempts, reflect on each failure; feed reflection text as context to the next attempt¹⁵	Medium
Deterministic hooks	Back probabilistic rules with deterministic checks	High

In agent loops, instruction attenuation crystallizes as task drift and reward hacking. In my experience, ceremonialization is hard to catch at the prompt layer; operational controls (hooks, linters, CI) appear more reliable.

Where	Why
LLM Agentic Failure Modes	The 8 modes that surface only in agent loops: task drift, reward hacking, alignment faking, and more (12 min)
Pillar: LLM Behavioral Failure Modes	Return to the 12-mode map and pick a different angle (5 min)

LLM Foundational Failure Modes: Hallucination, Sycophancy, Context Rot, Instruction Attenuation

Hallucination and Confabulation

Definition

Root Cause

Example

Detection Signal

Defense Pattern

Sycophancy

Definition

Root Cause

Example

Detection Signal

Defense Pattern

Context Rot

Definition

Root Cause

Example

Detection Signal

Defense Pattern

Instruction Attenuation

Definition

Ceremonialization

Root Cause

Example

Detection Signal

Defense Pattern

Read next

Footnotes

Hallucination and Confabulation

Definition

Root Cause

Example

Detection Signal

Defense Pattern

Sycophancy

Definition

Root Cause

Example

Detection Signal

Defense Pattern

Context Rot

Definition

Root Cause

Example

Detection Signal

Defense Pattern

Instruction Attenuation

Definition

Ceremonialization

Root Cause

Example

Detection Signal

Defense Pattern

Read next

Footnotes

RELATED

LLM Behavioral Failure Modes: 12 Failure Patterns and the Defense Map

Domain-Specific Prompt Optimization: The Knowledge Anchor Approach

Multi-Layer Defense for LLM Production Systems: From Layer 0 to Human-in-the-Loop

AI and LLM ResearchModel Experience and Practical Notes

AI and LLM Research
Model Experience and Practical Notes