The Problem: Skills Work, Memory Doesn’t
When working with Claude Code, each skill does its job well. /court evaluates decisions, /review examines code, /plan creates structure. But there’s no connection between them.
I make a decision, come back to the same topic two weeks later, and can’t remember why I made it. An issue found during critique resurfaces three features later in the same pattern. At the end of a sprint, I repeat the same mistakes because I never noted what worked and what didn’t.
The problem isn’t the skills. The problem is that skills are session-scoped. A session ends, context vanishes, the next session starts from zero. On the other hand, a session might have progressed on the wrong context entirely.
Result: A Feature’s Journey Through the Pipeline
Before diving into technical details, I want to show how a feature progresses through Forge. Say I need to add a new payment integration.
The first step happens outside the pipeline: I create a spec using plan mode and interactive mode. What needs to be done, why, constraints, which files will be affected. This spec becomes the pipeline’s input.
0. Spec preparation (pre-pipeline)
-> Spec created via plan mode + interactive mode
-> What, why, constraints, affected files defined
1. /court paddle-recurring-billing
Sent the spec to Gemini, Kimi, and Claude
-> Evaluation from three different perspectives
-> Verdict: GO (7.2/10)
-> forge/decisions/2026-03-05_paddle-recurring-billing.md written
2. /implement paddle-recurring-billing
-> Verified court GO and spec from memory
-> Wrote execution plan, waited for my approval
-> Developed in worktree, ran tests
-> forge/active/paddle-recurring-billing_implement.md written
3. /critique
-> Read git diff (not the implementer's self-assessment)
-> Found 3 MUST_FIX, 2 NICE_TO_HAVE
-> Verdict: REJECTED
-> forge/active/paddle-recurring-billing_critique.md written
4. /implement (again, with critique feedback)
-> Fixed MUST_FIX items
-> New critique: PASS
5. /retro paddle-recurring-billing
-> Wrote "Paddle webhook signature verification must always be the first step"
pattern to core-rules/workflow.md
-> Checked for conflicts with existing rules
-> Moved active notes to archive
Every step read the previous steps’ output from memory. No skill started from scratch. The pattern that emerged in retro will be checked by critique in every future Paddle implementation.
Why Something New
There are existing ways to solve this problem. But each has a limit in my development workflow.
Enterprise SDLC Tools
Jira, Linear, Notion workflows. Designed for teams, high overhead for a solo developer. Opening tickets, managing boards, updating statuses create friction instead of value for one person. I already have a knowledge management system built on Obsidian, adding a separate project management tool on top means synchronization overhead between two different sources. That said, it can be easily integrated via MCP.
Multi-Agent Systems
Autonomous systems where multiple agents trigger each other. Powerful in theory, but two issues in a solo context: loss of control and debugging difficulty. When one agent produces faulty output, other agents use it as input, creating an error chain. A single person can struggle to trace and fix that chain.
Manual Note-Taking
Writing down every decision, every review by hand. It works but requires discipline and isn’t structural. Connecting notes, detecting contradictions, cleaning outdated information are all manual processes. Over time the note pile grows but usability drops. Constantly carrying and keeping notes current while working remotely also demands significant effort past a certain point.
| Approach | Strength | Weakness (solo context) |
|---|---|---|
| Enterprise SDLC | Structural, traceable | Overhead for one person |
| Multi-agent | Autonomous, parallel | Loss of control, hard to debug |
| Manual notes | Simple, zero dependencies | Not structural, hard to maintain |
| Forge | Structural, memory-backed, HITL | Claude Code dependency |
The Choice: Four Skills, Shared Memory, Human-in-the-Loop
Forge’s design principles:
- Human-in-the-loop. I trigger every step. Skills don’t call each other, no dynamic routing. Predictability comes before cleverness.
- Memory replaces handoffs. Skills don’t pass context to each other. They write to and read from a shared Obsidian vault. Any skill can access any prior decision.
- Adversarial review is mandatory. Critique can’t say “everything looks good.” A minimum finding count is structurally enforced.
- Retro prunes memory. The knowledge base doesn’t grow forever. Lasting patterns are extracted, the rest is archived.
- Sequential, not parallel. Solo context. One skill at a time, full attention, no split-context cost.
Pipeline
Spec preparation happens outside the pipeline. Plan mode and interactive mode define what needs to be done, why, and the constraints. The resulting spec becomes court’s input.
spec (pre-pipeline, plan + interactive mode)
-> court (evaluate)
-> implement (develop)
-> critique (challenge)
-> retro (consolidate)
-> court (next decision...)
If critique returns REJECTED, it loops back to implement. This cycle can repeat until PASS is achieved.
SPEC -> DECIDED -> IMPLEMENTING -> CRITIQUING -> PASS/REJECTED -> RETRO -> ARCHIVED
^ |
| REJECTED |
+----------------+
Court: Evaluation
The pipeline’s entry point. The prepared spec is evaluated here. Scoring across eight criteria: Benefit, Necessity, Burden, Conflict, Performance, Security, Bottleneck, Currency.
Multi-perspective input is provided through Gemini and Kimi MCPs. Each AI evaluates the same spec from a different angle. Result: GO, DEFER, or KILL.
The decision, along with its rationale and dissent, is written to the forge/decisions/ folder. Any skill can read this decision later.
Implement: Development
Development starts after court gives GO. But first, a gate:
- Read court decision from memory and verify GO
- Read spec from memory
- Write execution plan and wait for user approval
- After approval, develop in worktree
- Run tests
- Write memory note (changes, files, decisions, test results)
- Present commit message draft
The --hotfix flag can bypass court. But this decision is logged and audited during retro: “Should this hotfix have gone through court?”
Critique: Adversarial Review
Runs after implement finishes. Examines the code like a senior auditor. Structural rules:
- “Everything looks good” is forbidden
- Minimum 2 risks, 1 performance issue, 1 edge case required
- Every finding with file:line reference
- Form your own opinion from git diff before reading the implementer’s self-assessment
- Classify every finding as MUST_FIX or NICE_TO_HAVE
- Don’t suggest architectural changes, architecture is court’s job
Stack-specific checks: N+1 queries, missing select_related for Django. Hook rules, dependency arrays for React. Missing indexes, JSONB queries for PostgreSQL.
Pre-mortem analysis: “This code went to production, something broke. What broke?” 10x load, malicious input, external service failure, context loss after 6 months.
Verdict: Any MUST_FIX means REJECTED. Otherwise PASS.
Retro: Consolidation
Runs after a feature is completed. Does three things:
Pattern extraction: Extracts lasting rules from the implement and critique cycle. Things like “Paddle webhook signature verification must always be the first step.” These rules are written to stack-based files: core-rules/react.md, core-rules/django.md, core-rules/postgres.md, core-rules/workflow.md.
Conflict detection: If a new pattern conflicts with an existing rule, it detects it. Doesn’t resolve automatically, presents it to me: keep the existing rule, replace with the new one, or scope both by context.
Cleanup: Flags active notes older than 30 days for archive or deletion. The knowledge base doesn’t grow, it gets pruned. Upsert, not append.
Memory Structure
All skills write to and read from a shared Obsidian vault. Access is provided through basic-memory MCP 1.
forge/
decisions/ <- /court outputs (ADRs)
active/ <- /implement and /critique notes (work in progress)
core-rules/ <- lasting patterns from /retro
react.md
django.md
postgres.md
workflow.md
archive/ <- completed work, moved by /retro
File naming: YYYY-MM-DD_[feature-name]_[stage].md
Every note carries status in its frontmatter: status: DECIDED, status: IMPLEMENTING, status: CRITIQUING, verdict: PASS. Any skill can query any note’s status.
This approach converges with Karpathy’s later LLM Wiki proposal2: raw sources + LLM-maintained markdown wiki + workflow schema. Forge’s skill-shared vault with status frontmatter is the skill-driven instance of the same idea.
Ecosystem
Forge works standalone but reaches its full potential with other tools.
| Tool | Role in Forge | Required? |
|---|---|---|
| decision-gate 3 | Court skill, pipeline entry | Yes (for court) |
| dnomia-knowledge 1 | Shared memory layer | No (auto-memory fallback exists) |
| mcp-code-search 4 | Codebase analysis for critique | No (grep/glob is sufficient) |
| chief-of-staff 5 | Upstream task triage | No (manual selection is enough) |
| edit-guard 6 | Safe file editing during implement | No (3+ edit rule is enough) |
Which combination makes sense by context:
| Context | Recommended stack |
|---|---|
| Quick feature (< 1 day) | forge + decision-gate |
| New project setup | forge + decision-gate + dnomia-knowledge + mcp-code-search |
| Daily operations | chief-of-staff + forge + dnomia-knowledge |
| Large refactoring | forge + mcp-code-search + edit-guard |
| Minimal (pipeline only) | forge standalone (auto-memory fallback) |
What We Consciously Rejected
During Forge’s design, some proposals from Gemini and Kimi were rejected. Each was a good idea but didn’t fit my constraints.
| Proposal | Source | Why rejected |
|---|---|---|
| Event-driven architecture | Kimi | Over-engineering for a sequential pipeline |
| State machine class | Kimi | Markdown + folder structure is sufficient |
| 70 agent personas | agency-agents 7 | 4 skills for solo dev, not 70 agents |
| Dynamic routing | Gemini + agency-agents | Human-in-the-loop, not AI |
| Rule DAG (YAML) | Kimi | Markdown + conflict check is simpler |
Separate /metrics skill | Kimi | Premature optimization, let the need emerge |
| ”Angry engineer” persona 8 | Gemini | Structural constraints beat roleplay |
What Forge Doesn’t Do
- No autonomous agent chaining. Skills don’t call each other.
- No real-time agent negotiation. Sequential pipeline, not a debate simulator.
- No dynamic routing. Pipeline order is fixed.
- No complex state management. State lives in memory notes, not a database.
- Not enterprise SDLC. 4 skills, not 70 agents. Designed for one person to ship product.
Setup
git clone https://github.com/ceaksan/forge.git
cd forge
./install.sh
install.sh symlinks skills to ~/.claude/skills/forge/. Updates are automatic.
# Usage
/court [topic] # Evaluation
/implement [feature-name] # Development
/implement --hotfix [description] # Quick fix, court bypassed
/critique # Adversarial review
/retro [feature-name] # Consolidation
Closing
Forge isn’t the right solution for everyone. For team-based projects, Jira or Linear workflows are more appropriate. For simple bug fixes, pipeline overhead is unnecessary.
In my case, working solo, I needed a structure to carry decision rationale, review findings, and learnings across sessions. Forge is that structure. Four skills, shared memory, human-in-the-loop.
The project is open source on GitHub 9.
I covered how Forge fits into the broader picture in the Context Engineering Ecosystem post.
Footnotes
- dnomia-knowledge: basic-memory MCP configuration and Obsidian vault integration ↩ ↩2
- Karpathy, LLM Wiki. Shared a few weeks after Forge was published: a persistent, LLM-maintained markdown wiki as a RAG alternative with a three-layer architecture (sources, wiki, schema). ↩
- decision-gate: Multi-AI perspective evaluation tribunal ↩
- mcp-code-search: Local semantic code search MCP server ↩
- chief-of-staff: Local AI assistant that prepares daily workflow ↩
- edit-guard: Claude Code file editing safety plugin ↩
- agency-agents: 70+ agent prompt collection with NEXUS orchestration system ↩
-
Gemini proposed assigning an “angry engineer” persona to the
/critiqueskill. The idea: if the agent acts like an angry engineer, it would produce harsher reviews. Rejected because roleplay yields inconsistent results. Structural constraints (mandatory finding output, pre-mortem analysis) were chosen instead. ↩ - Forge GitHub repository ↩
- 01 Cross-skill memory sharing ensures every step has access to prior decisions
- 02 Adversarial review is mandatory: critique must find at least 2 risks, 1 performance issue, 1 edge case
- 03 Retro prunes the knowledge base into lasting patterns rather than letting it grow indefinitely
- 04 Human-in-the-loop: every step requires human approval, skills never trigger each other
+ What is Forge?
A memory-backed decision-to-delivery pipeline designed for solo developers. Four steps running as Claude Code skills: evaluation (court), development (implement), adversarial review (critique), and knowledge consolidation (retro). Spec preparation happens pre-pipeline via plan mode and interactive mode. Every step reads from and writes to a shared Obsidian vault.
+ How does Forge differ from a normal code review?
A normal code review works in isolation, unaware of prior decisions. Forge's critique skill is pipeline-aware: it reads the court decision and implement notes from memory. It also enforces adversarial output, saying 'looks good' is structurally forbidden. A minimum of 4 findings is mandatory.
+ Is Forge an autonomous agent system?
No. Human-in-the-loop: the human triggers every step manually. Skills don't call each other, there's no dynamic routing. Predictability comes before cleverness. State lives in shared memory (Obsidian vault), not in a database.
+ Which projects does it integrate with?
decision-gate (court skill), dnomia-knowledge (memory layer), mcp-code-search (code analysis for critique), chief-of-staff (upstream task triage), and edit-guard (safe file editing). None are required, Forge works standalone too.
+ What does retro do?
Extracts lasting patterns from completed features, checks for conflicts with existing rules, and flags notes older than 30 days for cleanup. The knowledge base doesn't grow, it gets pruned. Upsert logic, not append.