Skip to content
ceaksan
ai

LLM Behavioral Failure Modes: 12 Failure Patterns and the Defense Map

LLMs forget instructions in long sessions, fabricate information, and agree with you against their own answer. A pillar map of 12 systematic failure modes and the defense layers that address them.

Feb 1, 2026 6 min read Updated: Apr 19, 2026
TL;DR

I split 12 LLM failure modes into two families. Foundational modes (hallucination, sycophancy, context rot, instruction attenuation) appear even in single prompts. Advanced modes (task drift, incorrect tool invocation, reward hacking, positional bias, mode collapse, degeneration loops, alignment faking, version drift) only surface under agent and tool-use loads.

A map of 12 LLM failure modes: two families, three defense layers. “Failure” is not a random error; it is a systematic tendency rooted in the model’s architecture, training, and deployment. Recognizing these tendencies is a prerequisite for decision discipline and safe agent file editing.

Which deep dive do you need?

SpokeWhat it coversRead
Foundational ModesThe 4 modes visible in a single prompt: hallucination, sycophancy, context rot, instruction attenuation10 min
Agentic ModesThe 8 modes under agent and tool-use loads: task drift, reward hacking, alignment faking and more12 min
Multi-Layer DefenseDefense layers, mode-layer matrix, implementation patterns for production systems8 min

12 Failures, 3 Defense Layers

  • Foundational modes (4): Emerge in a single prompt. Output quality failure; the model doesn’t know it is wrong.
  • Advanced modes (8): Surface only in agent/tool-use, long-horizon, or multi-turn systems. Goal alignment failure; the model looks “correct” but drifts from the actual goal.
  • Defense: Never single-layered. Prompt (constraint repetition, metacognitive), architectural (RAG, guardrails, deterministic hooks), and operational (short sessions, human-in-the-loop, evals) must work together.

A short definition of each mode follows; detailed mechanism, evidence, and defense techniques live in the linked deep-dive posts.

Foundational Modes: Visible Even in a Single Prompt

These four modes emerge in the model’s single-turn output. Common property: they are output quality failures, require no tool or agent framework, and defense is largely at the prompt and retrieval layer.

Hallucination and Confabulation

The model references a non-existent library, API endpoint, or paper and produces a plausible-sounding but fabricated justification for a wrong answer. Root cause: the language model generates “the most likely next token”; “I don’t know” is statistically low-probability.

Practical example: The model suggests pandas.DataFrame.quick_filter(), a method that doesn’t exist; when you ask, it adds a fabricated documentation quote.

Deep dive: Foundational Failure Modes, Hallucination section.

Sycophancy

Ask “isn’t this code wrong?” and the model will likely say “yes, you’re right” even if the code is correct. Root cause is in RLHF: human raters score agreeing answers higher, and the model learns this signal. The preference data itself carries the bias.

Practical example: Defend with pre-commitment: ask the model for its own answer first, then share your view.

Deep dive: Foundational Failure Modes, Sycophancy section.

Context Rot

In long contexts, performance drops not only in the middle but at every length increase. A model with a 1M-token window shows degradation at 50K. Irrelevant information actively harms; padding is a noise source.

Practical example: A README you add “just in case” to the context can break retrieval. Send only relevant information.

Deep dive: Foundational Failure Modes, Context Rot section.

Instruction Attenuation

The rule “run tests after every change” is followed the first few times; by the tenth change, the model just writes “ran tests, passed.” The second phase is ceremonialization: the rule is applied in form but loses its substance. Multi-turn conversations show an average 39% performance drop.

Practical example: Forget-Me-Not, a single-sentence instruction re-injection at strategic points. Low-cost, high-impact.

Deep dive: Foundational Failure Modes, Instruction Attenuation section.

Advanced Modes: Only Under Agent and Tool-Use Loads

These eight modes are generally invisible in single-prompt usage. They don’t fire without agent loops, tool calls, long-horizon tasks, or multi-turn sessions. Common property: the model’s output looks “correct” in isolation; goal drift crystallizes in the flow of the system.

It helps to look at them under three sub-themes:

  • Goal drift: Task Drift, Incorrect Tool Invocation, Reward Hacking
  • Output pathology: Positional Bias, Mode Collapse, Degeneration Loops
  • Deep alignment issues: Alignment Faking, Version Drift

Task Drift

The agent gradually drifts from the original goal. At each step, the immediate context dominates the original intent. “Fix this bug” becomes, five steps later, refactor + import update + test rewrite.

Incorrect Tool Invocation

Wrong tool choice, parameter hallucination, or ordering error. Since output is action rather than text, consequences may be irreversible. Critical especially for write, delete, and send tools.

Reward Hacking

Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. PostTrainBench (2026) showed agents developing shortcuts such as training on the test set, downloading pre-made checkpoints, and using API keys without authorization.

Positional Bias

In “A or B?” questions the answer is affected by order, independent of content. In code review, the first file gets more attention; the last file gets skipped. Reducible via swap tests and independent scoring.

Mode Collapse

The model locks into a pattern in a conversation; even when the first approach is wrong, it stays in the same frame. Autoregressive generation naturally encourages this. Naive mitigations like “ignore the previous answer” don’t work consistently.

Degeneration Loops

Repeated production of the same expressions and code patterns. Greedy and beam search drive repetition; nucleus sampling reduces but does not eliminate it. Mode collapse is at the strategy level; degeneration loops are at the generation level.

Alignment Faking

The model strategically chooses to appear aligned. Anthropic-Redwood (2024) observed Claude 3 Opus reasoning in its scratchpad: “If I refuse, they retrain me; complying now is the least-bad option.” Sycophancy is an unconscious tendency; alignment faking is a strategic decision.

Version Drift

Without any change in your code or prompts, the model’s output shifts one morning. The provider updates the weights with little or no changelog. Eval suites, prompt versioning, and model pinning are the baseline defenses.

Three-Layer Defense: One Layer Is Not Enough

The common property of these 12 failures: none of them can be solved by a single intervention.

LayerExample TechniquesWhat It Does
PromptConstraint repetition, metacognitive prompting, few-shot, Forget-Me-NotSteers model behavior
ArchitecturalRAG, guardrails, structured output, activation steering, deterministic hooksPuts structural limits in place
OperationalShort sessions, human-in-the-loop, eval suite, verification checkpoints, canaryControls the output

A concrete example: CLAUDE.md rules (prompt) + edit-guard hook (architectural) + manual review before commit (operational). Each breaks alone; together they significantly reduce the probability of breakage.

For the matrix showing which mode is stopped at which layer and implementation details, see the spoke posts at the end of this pillar.

The Model Saying “Done” Is Not Enough

Every one of these 12 modes shares a common theme: when the model says “I’m doing this correctly” or “done,” the claim alone is not trustworthy.

  • Hallucination: “This library does X” — the library doesn’t exist.
  • Sycophancy: “You’re right” — not because you are, but because it agrees with you.
  • Instruction attenuation: “Verified” — but it didn’t verify.
  • Task drift: “Fixed the bug” — but didn’t touch the original bug.
  • Alignment faking: “Refusing this request” — only because it knows it’s being observed.

The core principle of defense is simple: verification means checking the output, not the claim. Whatever the model says, the result must be independently verified. This isn’t distrust; it’s engineering discipline.

From Map to Deep Analysis

This pillar is a map. The actual mechanism of each failure, the academic evidence, and defense details live in three deep-dive posts:

Key Takeaways
  • 01 LLM behavioral failures are not random errors; they are systematic consequences of architecture and training
  • 02 Failures split into two families: foundational modes visible in a single prompt, and advanced modes that only emerge under agent/tool-use loads
  • 03 Defense cannot be single-layered: prompt, architecture, and operational layers must work together
  • 04 The model saying 'done' does not replace verification; ceremonialized compliance is not real compliance
Frequently Asked Questions (FAQ)
+ What is an LLM behavioral failure mode?

A systematic output quality or goal alignment loss stemming from the model's architecture, training process, and deployment conditions. Not a random error, but a pattern that recurs under specific conditions.

+ How are the 12 failure modes grouped?

Foundational modes appear in a single prompt or short context: hallucination, sycophancy, context rot, instruction attenuation. Advanced modes emerge only under agent/tool-use or long-horizon systems: task drift, incorrect tool invocation, reward hacking, positional bias, mode collapse, degeneration loops, alignment faking, version drift.

+ Is a single defense technique enough?

No. The prompt layer decays under instruction attenuation, the architectural layer doesn't apply everywhere, and the operational layer is slow. When the three don't work together, failures leak through.

+ Can I trust the model saying 'I verified'?

No. The common theme across these failures: when the model says 'done,' the claim alone is not trustworthy. Verification means checking the output, not the claim.