Skip to content
ceaksan

Context Engineering for AI Coding Agents: From Static Documents to a Living Ecosystem

CLAUDE.md and architecture.md are not enough. A four-layer context engineering ecosystem combining semantic code search, knowledge base, decision governance, and learning loops. Based on real project experience.

Feb 15, 2026 10 min read
TL;DR

A single CLAUDE.md file is not enough for an AI agent to work correctly. I built a four-layer ecosystem: (1) static references (CLAUDE.md + architecture.md), (2) JIT semantic search (mcp-code-search + dnomia-knowledge), (3) decision governance (/court + ADRs), (4) learning loops (forge retro). In this post I explain what each layer does, how I applied it in a real project (3,600-line architecture.md, 155 ADRs), and why tools like Repomix became unnecessary.

The biggest productivity loss when working with AI coding agents is not writing code, it’s re-introducing the project every session. A CLAUDE.md file is a good start but it’s not enough. In this post, I explain the four-layer context engineering ecosystem I built through real project experience.

Problem: The Agent Starts From Scratch Every Session

When working on a large project with Claude Code, Cursor, or GitHub Copilot, you experience this cycle:

  1. Agent starts scanning code
  2. Makes wrong inferences due to surface-level scanning (confuses similar names in different modules)
  3. Context window starts filling with tool outputs
  4. You correct, agent applies the correction
  5. Session ends, context is lost
  6. Next session: go to 1

2025-2026 research1 shows that models perform significantly better when fed structured, persistent reference points compared to repo scanning. This gave rise to the “context engineering” discipline: maximizing the signal-to-noise ratio in the agent’s context window.

But a single CLAUDE.md file doesn’t solve this alone. You need to systematically manage what is where in the project, why it’s that way, and when to access that information.

Four-Layer Ecosystem

The structure I built through experimentation in a real SaaS project (event tracking platform, monorepo, 992 source files, 155 ADRs):

LayerWhatHowWhen
1. Static referencesCLAUDE.md + architecture.mdLoaded automatically at session startAlways
2. JIT searchmcp-code-search + dnomia-knowledgeSemantic search when agent needs itOn demand
3. Decision governance/court + ADRsBefore new features or architectural decisionsAt decision time
4. Learning loopforge retroAfter completing workAt work completion

Each layer feeds the next. Static references are the agent’s starting point, JIT search is the deepening tool, decision governance ensures consistency, and the learning loop keeps the entire system current.

Layer 1: Static References

CLAUDE.md: Giving Instructions to the Agent

CLAUDE.md is the file the agent automatically reads at the start of every session. Its content is “what to do, how to do it” instructions:

  • Performance rules (“don’t use barrel imports”, “use Promise.all for parallel calls”)
  • Workflow rules (“enter plan mode”, “write spec”, “create ADR”)
  • Deployment commands (fully executable strings)
  • Boundaries (do/don’t lists)

CLAUDE.md’s job is to manage the agent. Not to describe the project’s structure.

architecture.md: Describing the Project’s Reality

Discovering this distinction took me several sessions. I tried putting project structure, module maps, data flow into CLAUDE.md. The file grew, readability dropped, and the agent started confusing which information was a rule versus a reference.

Solution: CLAUDE.md for agent instructions, architecture.md for project reference. They complement each other but live in separate files.

architecture.md contents:

SectionWhat It Describes
Stack and DependenciesTechnology stack with exact version numbers
Monorepo StructureDirectory tree, file sizes, responsibilities
Module MapEach module’s responsibility, dependencies, key files
Data FlowHow data flows through the system (edge to DB, DB to destination)
Data ModelStructured summary of the Prisma schema
InfrastructurePlatform, region, ports, proxy, tunnel information
Architectural DecisionsMajor decisions and why they were made (with ADR references)
Performance RulesActual application state (applied/not applied)
Code HotspotsMost frequently changed files based on git change frequency
Related NotesCross-references to vault and repo documentation

In my project, this file reached ~3,600 lines. Sounds like a lot, but the agent doesn’t read the entire file every time. It jumps to the section it needs. The structure, thanks to heading hierarchy, makes this easy.

Repomix Experience: Why It Became Unnecessary

While writing architecture.md, I also tried Repomix. Repomix is a tool that packages the codebase into a single Markdown file, extracting function signatures with Tree-sitter.

Results:

ModeFile CountTokensAssessment
Full (entire repo)9921.9M10x the context window
Compressed (entire repo)9921.25MStill unusable
Compressed (TS/TSX only)48990KTheoretically fits but half the context

Repomix’s directory tree and git-change-count ranking were useful. I extracted hotspot files from there. But its main value is in tools that “can’t read files” (ChatGPT web interface, web Claude). Claude Code already reads files directly. Structured architecture.md + JIT search provides more targeted information than Repomix’s flat dump.

The Practical Impact of Separation

Before architecture.md, the agent’s investigation of the Inngest proxy worker structure took ~20 minutes (SSH attempts, API endpoint guessing, port scanning). After architecture.md, the agent directly reads the relevant section: container name, ports, tunnel routes, env vars. All in one place.

Similarly, answering “why did we switch to Neon?” used to require scanning 155 ADRs. Now architecture.md’s “Architectural Decisions” section has a summary with the ADR-127 reference, and the agent opens the ADR file if needed.

Static references are loaded every session, but not all information fits in static files. In a 992-file codebase, the answer to “where is consent checking done?” might be spread across 4-5 different files. Writing this into architecture.md would bloat the file. Having the agent grep for it every time consumes context window.

Solution: a semantic search layer where the agent can pull only relevant information on demand.

mcp-code-search: Semantic Search for Code

mcp-code-search is a semantic code search server that connects to Claude Code via MCP (Model Context Protocol).

How it works:

Directory scan -> Tree-sitter AST parse -> Chunk (function/class/method) -> Embed (jina-v2-base-code) -> LanceDB -> Hybrid search

When the agent says “find authentication middleware” or “rate limiting implementation”, unlike grep, it performs semantic matching. It finds the “rate-limiter.ts” file but can also surface files with unrelated names that use the “token bucket” pattern.

FeatureDetail
ChunkingTree-sitter AST (40+ languages)
Embeddingjina-embeddings-v2-base-code (768 dim, code-focused)
StorageLanceDB (local, zero network)
SearchHybrid: vector similarity + FTS, RRF merge (k=60)
Incremental indexingHash-based, only changed files

Why grep isn’t enough: For “where is consent checking done?”, grep searches for the word consent and returns 47 results. mcp-code-search approaches the same question semantically and returns the 5-10 most relevant chunks. 10 snippets enter the context window instead of 47 files.

dnomia-knowledge: Semantic Search for Knowledge Base

dnomia-knowledge is a knowledge management MCP server that indexes Markdown, MDX, and code files.

How it differs from mcp-code-search:

mcp-code-searchdnomia-knowledge
FocusCode filesMarkdown + code + web content
Embeddingjina-v2-base-code (code-focused)multilingual-e5-base (multilingual)
StorageLanceDBSQLite + FTS5 + sqlite-vec
ChunkingAST-based (function/class)Heading-based (## and ###)
Extra featuresFind similar codeKnowledge graph, web indexing

They work together: the agent directs architectural questions to dnomia-knowledge and implementation questions to mcp-code-search. Both are connected via MCP, and the agent decides which is more appropriate. dnomia-knowledge also performs developer interaction tracking: tracking which files are read most, which searches return zero results, and applying interaction boost to personalize search results.

Progressive Disclosure: Revealing Information Gradually

Even a 1M token context window degrades in performance when filled with too much information2. That’s why “pull what’s needed” beats “load everything”.

In the ecosystem, this works as follows:

  1. Session start: CLAUDE.md + architecture.md loaded automatically (static, always needed)
  2. First question: Agent reads the relevant section from architecture.md (jump-to pointer)
  3. Deepening: Agent sends semantic query to mcp-code-search or dnomia-knowledge (JIT)
  4. Decision needed: Agent opens the ADR file (on-demand)

At each step, only the needed information enters the context. This is the opposite of Repomix’s “put everything in one file” approach.

Layer 3: Decision Governance

Knowing the codebase structure and being able to search isn’t enough. If you can’t answer “why are we using Inngest instead of Redis?”, the agent might one day want to add Redis. Or try to solve a problem using a method you previously evaluated and rejected.

/court: Structured Evaluation

/court is an evaluation skill that runs before new features or architectural decisions. It applies the Decision Gate framework’s 8 criteria and delivers a verdict of GO, DEFER, or KILL.

But /court’s real value for context engineering: every decision is recorded as an ADR. The answer to “why did we make this decision?” doesn’t get lost in session-based context. When the agent returns to the same topic in the future, it can read the previous evaluation and its rationale.

ADRs: Living Constraints

Architecture Decision Records are usually thought of as passive logs. But in the agent ecosystem, they’re active constraints:

  • Architectural boundaries: “No direct DB queries from the collect worker, must go through Hyperdrive” (ADR-062)
  • Data processing constraints: “PII cannot be logged in plaintext, AES-256-GCM + blind index” (ADR-137)
  • Rejected alternatives: “Redis cache evaluated, rejected due to operational burden” (ADR-128)

The agent sees the summary in architecture.md’s “Architectural Decisions” section. If detail is needed, it opens the ADR file. This prevents the same discussion from recurring.

My project has 155 ADRs. Instead of writing each one into architecture.md, I summarized 12 major decisions in 4 categories and added the ADR index as a reference. The agent starts from the summary, deepens if needed.

”Kernel of Truth” Workflow

I didn’t write all 155 ADRs from scratch. Most were created with the “Kernel of Truth” pattern: I write one sentence (“We switched to Neon because there was a Docker port bypass security vulnerability”), the agent expands it into a structured ADR format. Writing effort is minimal, but the decision record is permanent.

Layer 4: Learning Loop

The first three layers provide information. The fourth layer keeps information current.

Forge Retro: Extracting Patterns from Completed Work

The last step of the Forge pipeline, /retro, extracts permanent patterns from completed features:

  1. Read the court decision (why did we GO?)
  2. Examine the implementation (what changed?)
  3. Check critique findings (what issues came up?)
  4. If a permanent pattern exists, add it to CLAUDE.md or architecture.md
  5. Clean up temporary information

Critical point: retro doesn’t grow the knowledge base, it prunes it. It says “this pattern repeated 3 times, it should be a rule” and adds it to CLAUDE.md. It says “this was a temporary workaround” and deletes it. Upsert logic, not append.

Closing the Loop

Session start: CLAUDE.md + architecture.md loaded
    |
Working: deepening via JIT search
    |
Decision: evaluation via /court -> ADR record
    |
Work complete: pattern extraction via /retro
    |
Update: CLAUDE.md / architecture.md updated
    |
Next session: current references loaded

With each iteration, references become slightly more accurate, slightly more current. The agent does slightly less discovery and slightly more production each session.

Research Foundation

I didn’t invent this ecosystem from scratch. Findings compiled from 2025-2026 research (academic papers, Gemini Deep Research, GPT-4o analysis, Kimi K2.5 research) formed the foundation:

Codified Context approach1: A three-layer system tested in a 108,000-line C# project (Hot Memory, Domain Expert Agents, Cold Memory). Critical finding: documentation is infrastructure, it requires maintenance like code.

AGENTS.md ecosystem: Different config files for different AI tools (CLAUDE.md, AGENTS.md, .cursorrules) but all serving the same purpose: giving the agent structured context. “Nearest-Wins” model: root file provides global standards, subdirectory files provide local guidance.

Hybrid approach consensus: All sources converge on the same point: “what” is auto-generated (schema, types, dependency graph), “why” is human-written (design decisions, constraints, trade-offs). Together, they give the agent the full picture.

Progressive disclosure: Opening information only at the moment of need rather than loading it all at once. Jump-to pointers, executable search commands, nested overrides.

What Worked and What Didn’t in Practice

InvestmentResult
architecture.md (~3,600 lines)Agent’s discovery time dropped significantly. Especially for infrastructure questions (ports, proxy, tunnel), direct reference instead of trial-and-error.
ADR index (155 decisions)Recurring discussions ended. Being able to say “we evaluated this before” is very valuable.
mcp-code-searchMore accurate than grep, especially for “find all places that do X” queries.
dnomia-knowledgeVery useful for vault notes and documentation search. Complementary when combined with code search.
/courtIn a codebase audit, 6 out of 28 tasks were eliminated or deferred. Filters bad ideas early.
RepomixBecame unnecessary except for hotspot analysis. Claude Code can already read files.
Forge retroNot enough data yet (new). Concept is correct, too early for impact measurement.

Template

If you want to adapt this structure to your own project, here’s the minimum starting set:

Small projects (single service, 50-100 files):

  • CLAUDE.md (rules + commands)
  • A single-page structure summary in architecture.md is sufficient

Medium projects (monorepo, 200-500 files):

  • CLAUDE.md + architecture.md (separate files)
  • mcp-code-search (semantic code search)
  • ADRs (for major decisions)

Large projects (500+ files, multiple services):

  • All four layers
  • architecture.md with module map, data flow, infrastructure, architectural decisions
  • dnomia-knowledge (documentation + code search)
  • /court + forge pipeline

In every case: CLAUDE.md gives instructions, architecture.md describes reality. This distinction is fundamental. If you want to start with a structured template rather than writing architecture.md from scratch, check out the Living Architecture template I derived from these experiences.

References

Open Source Tools

Academic and Industry Sources

  • Codified Context: Infrastructure for AI Agents in a Complex Codebase (arxiv.org/html/2602.20478v1)
  • AgenticAKM: Agentic Architecture Knowledge Management (arxiv.org/html/2602.04445v1)
  • AGENTS.md Standard (aihero.dev)
  • C4 Model (c4model.com)
  • Repomix (github.com/yamadashy/repomix)

Footnotes

  1. “Codified Context: Infrastructure for AI Agents in a Complex Codebase” (arxiv.org/html/2602.20478v1). A three-layer system tested in a 108,000-line C# project. 2
  2. Context window performance degradation at higher fill ratios has been observed across multiple benchmarks. “Lost in the Middle” (Liu et al., 2023) was among the first studies to document this phenomenon.
Key Takeaways
  • 01 CLAUDE.md tells the agent what to do, architecture.md describes the project's actual structure. They serve different purposes and cannot replace each other.
  • 02 Instead of loading all information into the context window, pulling only what's needed via JIT (just-in-time) semantic search is more efficient.
  • 03 Structured evaluation before decisions (/court) and recording decisions as ADRs ensures the agent remains consistent in the future.
  • 04 Without a learning loop, the ecosystem stagnates. Forge retro extracts permanent patterns from completed work and updates the rules, closing the loop.
Frequently Asked Questions (FAQ)
+ What is context engineering?

The discipline of maximizing the signal-to-noise ratio in an AI agent's context window. Giving the agent the right information, at the right time, in the right format. A systematic infrastructure approach beyond prompt engineering.

+ What is the difference between CLAUDE.md and architecture.md?

CLAUDE.md gives instructions to the agent: 'don't use barrel imports', 'use Promise.all for parallel calls'. architecture.md describes the project's reality: which module is where, how data flows, which decisions were made and why. One says 'how to work', the other says 'what you're working with'.

+ Are tools like Repomix unnecessary?

For tools that can read files like Claude Code, yes. Repomix packages a 992-file repo into 1.9M tokens, 10x the context window. Even the compressed version is 90K tokens. Structured architecture.md + JIT search provides the same information much more efficiently. Repomix is still useful for tools that can't read files (ChatGPT web, web Claude) to provide repo context.

+ How much effort does building this ecosystem require?

Writing architecture.md from scratch takes a few sessions (in my case ~3,600 lines). But the agent itself helps: scanning and summarizing ADRs, analyzing the codebase to produce module maps. After the initial investment, maintenance cost is low because only the relevant section gets updated with each major change.

+ Is this much structure necessary for a solo developer?

It's even more necessary for a solo developer. If you had a team, someone could tell you 'we made that decision for this reason'. Working solo, the AI agent is your teammate, but it starts from scratch every session. This ecosystem prevents the agent from rediscovering the project every time.