Forge: The Pipeline That Stops You From Coding Before Deciding

Q: What is Forge?

A memory-backed decision-to-delivery pipeline designed for solo developers. Four steps running as Claude Code skills: evaluation (court), development (implement), adversarial review (critique), and knowledge consolidation (retro). Spec preparation happens pre-pipeline via plan mode and interactive mode. Every step reads from and writes to a shared Obsidian vault.

Q: How does Forge differ from a normal code review?

A normal code review works in isolation, unaware of prior decisions. Forge's critique skill is pipeline-aware: it reads the court decision and implement notes from memory. It also enforces adversarial output, saying 'looks good' is structurally forbidden. A minimum of 4 findings is mandatory.

Q: Is Forge an autonomous agent system?

No. Human-in-the-loop: the human triggers every step manually. Skills don't call each other, there's no dynamic routing. Predictability comes before cleverness. State lives in shared memory (Obsidian vault), not in a database.

Q: Which projects does it integrate with?

decision-gate (court skill), dnomia-knowledge (memory layer), mcp-code-search (code analysis for critique), chief-of-staff (upstream task triage), and edit-guard (safe file editing). None are required, Forge works standalone too.

Q: What does retro do?

Extracts lasting patterns from completed features, checks for conflicts with existing rules, and flags notes older than 30 days for cleanup. The knowledge base doesn't grow, it gets pruned. Upsert logic, not append.

TL;DR

Claude Code skills work in isolation, decisions are session-scoped, and learnings from past mistakes vanish. Forge solves this with a memory-backed human-in-the-loop pipeline: spec preparation (pre-pipeline), /court (evaluate), /implement (develop), /critique (challenge), /retro (consolidate). Every skill reads from and writes to a shared Obsidian vault, no step starts from scratch. Open source.

The Problem: Skills Work, Memory Doesn’t

When working with Claude Code, each skill does its job well. /court evaluates decisions, /review examines code, /plan creates structure. But there’s no connection between them.

I make a decision, come back to the same topic two weeks later, and can’t remember why I made it. An issue found during critique resurfaces three features later in the same pattern. At the end of a sprint, I repeat the same mistakes because I never noted what worked and what didn’t.

The problem isn’t the skills. The problem is that skills are session-scoped. A session ends, context vanishes, the next session starts from zero. On the other hand, a session might have progressed on the wrong context entirely.

Result: A Feature’s Journey Through the Pipeline

Before diving into technical details, I want to show how a feature progresses through Forge. Say I need to add a new payment integration.

The first step happens outside the pipeline: I create a spec using plan mode and interactive mode. What needs to be done, why, constraints, which files will be affected. This spec becomes the pipeline’s input.

0. Spec preparation (pre-pipeline)
   -> Spec created via plan mode + interactive mode
   -> What, why, constraints, affected files defined

1. /court paddle-recurring-billing
   Sent the spec to Gemini, Kimi, and Claude
   -> Evaluation from three different perspectives
   -> Verdict: GO (7.2/10)
   -> forge/decisions/2026-03-05_paddle-recurring-billing.md written

2. /implement paddle-recurring-billing
   -> Verified court GO and spec from memory
   -> Wrote execution plan, waited for my approval
   -> Developed in worktree, ran tests
   -> forge/active/paddle-recurring-billing_implement.md written

3. /critique
   -> Read git diff (not the implementer's self-assessment)
   -> Found 3 MUST_FIX, 2 NICE_TO_HAVE
   -> Verdict: REJECTED
   -> forge/active/paddle-recurring-billing_critique.md written

4. /implement (again, with critique feedback)
   -> Fixed MUST_FIX items
   -> New critique: PASS

5. /retro paddle-recurring-billing
   -> Wrote "Paddle webhook signature verification must always be the first step"
      pattern to core-rules/workflow.md
   -> Checked for conflicts with existing rules
   -> Moved active notes to archive

Every step read the previous steps’ output from memory. No skill started from scratch. The pattern that emerged in retro will be checked by critique in every future Paddle implementation.

Why Something New

There are existing ways to solve this problem. But each has a limit in my development workflow.

Enterprise SDLC Tools

Jira, Linear, Notion workflows. Designed for teams, high overhead for a solo developer. Opening tickets, managing boards, updating statuses create friction instead of value for one person. I already have a knowledge management system built on Obsidian, adding a separate project management tool on top means synchronization overhead between two different sources. That said, it can be easily integrated via MCP.

Multi-Agent Systems

Autonomous systems where multiple agents trigger each other. Powerful in theory, but two issues in a solo context: loss of control and debugging difficulty. When one agent produces faulty output, other agents use it as input, creating an error chain. A single person can struggle to trace and fix that chain.

Manual Note-Taking

Writing down every decision, every review by hand. It works but requires discipline and isn’t structural. Connecting notes, detecting contradictions, cleaning outdated information are all manual processes. Over time the note pile grows but usability drops. Constantly carrying and keeping notes current while working remotely also demands significant effort past a certain point.

Approach	Strength	Weakness (solo context)
Enterprise SDLC	Structural, traceable	Overhead for one person
Multi-agent	Autonomous, parallel	Loss of control, hard to debug
Manual notes	Simple, zero dependencies	Not structural, hard to maintain
Forge	Structural, memory-backed, HITL	Claude Code dependency

The Choice: Four Skills, Shared Memory, Human-in-the-Loop

Forge’s design principles:

Human-in-the-loop. I trigger every step. Skills don’t call each other, no dynamic routing. Predictability comes before cleverness.
Memory replaces handoffs. Skills don’t pass context to each other. They write to and read from a shared Obsidian vault. Any skill can access any prior decision.
Adversarial review is mandatory. Critique can’t say “everything looks good.” A minimum finding count is structurally enforced.
Retro prunes memory. The knowledge base doesn’t grow forever. Lasting patterns are extracted, the rest is archived.
Sequential, not parallel. Solo context. One skill at a time, full attention, no split-context cost.

Pipeline

Spec preparation happens outside the pipeline. Plan mode and interactive mode define what needs to be done, why, and the constraints. The resulting spec becomes court’s input.

spec (pre-pipeline, plan + interactive mode)
  -> court (evaluate)
    -> implement (develop)
      -> critique (challenge)
        -> retro (consolidate)
          -> court (next decision...)

If critique returns REJECTED, it loops back to implement. This cycle can repeat until PASS is achieved.

SPEC -> DECIDED -> IMPLEMENTING -> CRITIQUING -> PASS/REJECTED -> RETRO -> ARCHIVED
                       ^                |
                       |   REJECTED     |
                       +----------------+

Court: Evaluation

The pipeline’s entry point. The prepared spec is evaluated here. Scoring across eight criteria: Benefit, Necessity, Burden, Conflict, Performance, Security, Bottleneck, Currency.

Multi-perspective input is provided through Gemini and Kimi MCPs. Each AI evaluates the same spec from a different angle. Result: GO, DEFER, or KILL.

The decision, along with its rationale and dissent, is written to the forge/decisions/ folder. Any skill can read this decision later.

Implement: Development

Development starts after court gives GO. But first, a gate:

Read court decision from memory and verify GO
Read spec from memory
Write execution plan and wait for user approval
After approval, develop in worktree
Run tests
Write memory note (changes, files, decisions, test results)
Present commit message draft

The --hotfix flag can bypass court. But this decision is logged and audited during retro: “Should this hotfix have gone through court?”

Critique: Adversarial Review

Runs after implement finishes. Examines the code like a senior auditor. Structural rules:

“Everything looks good” is forbidden
Minimum 2 risks, 1 performance issue, 1 edge case required
Every finding with file:line reference
Form your own opinion from git diff before reading the implementer’s self-assessment
Classify every finding as MUST_FIX or NICE_TO_HAVE
Don’t suggest architectural changes, architecture is court’s job

Stack-specific checks: N+1 queries, missing select_related for Django. Hook rules, dependency arrays for React. Missing indexes, JSONB queries for PostgreSQL.

Pre-mortem analysis: “This code went to production, something broke. What broke?” 10x load, malicious input, external service failure, context loss after 6 months.

Verdict: Any MUST_FIX means REJECTED. Otherwise PASS.

Retro: Consolidation

Runs after a feature is completed. Does three things:

Pattern extraction: Extracts lasting rules from the implement and critique cycle. Things like “Paddle webhook signature verification must always be the first step.” These rules are written to stack-based files: core-rules/react.md, core-rules/django.md, core-rules/postgres.md, core-rules/workflow.md.

Conflict detection: If a new pattern conflicts with an existing rule, it detects it. Doesn’t resolve automatically, presents it to me: keep the existing rule, replace with the new one, or scope both by context.

Cleanup: Flags active notes older than 30 days for archive or deletion. The knowledge base doesn’t grow, it gets pruned. Upsert, not append.

Memory Structure

All skills write to and read from a shared Obsidian vault. Access is provided through basic-memory MCP ¹.

forge/
  decisions/        <- /court outputs (ADRs)
  active/           <- /implement and /critique notes (work in progress)
  core-rules/       <- lasting patterns from /retro
    react.md
    django.md
    postgres.md
    workflow.md
  archive/          <- completed work, moved by /retro

File naming: YYYY-MM-DD_[feature-name]_[stage].md

Every note carries status in its frontmatter: status: DECIDED, status: IMPLEMENTING, status: CRITIQUING, verdict: PASS. Any skill can query any note’s status.

This approach converges with Karpathy’s later LLM Wiki proposal²: raw sources + LLM-maintained markdown wiki + workflow schema. Forge’s skill-shared vault with status frontmatter is the skill-driven instance of the same idea.

Ecosystem

Forge works standalone but reaches its full potential with other tools.

Tool	Role in Forge	Required?
decision-gate ³	Court skill, pipeline entry	Yes (for court)
dnomia-knowledge ¹	Shared memory layer	No (auto-memory fallback exists)
mcp-code-search ⁴	Codebase analysis for critique	No (grep/glob is sufficient)
chief-of-staff ⁵	Upstream task triage	No (manual selection is enough)
edit-guard ⁶	Safe file editing during implement	No (3+ edit rule is enough)

Which combination makes sense by context:

Context	Recommended stack
Quick feature (< 1 day)	forge + decision-gate
New project setup	forge + decision-gate + dnomia-knowledge + mcp-code-search
Daily operations	chief-of-staff + forge + dnomia-knowledge
Large refactoring	forge + mcp-code-search + edit-guard
Minimal (pipeline only)	forge standalone (auto-memory fallback)

What We Consciously Rejected

During Forge’s design, some proposals from Gemini and Kimi were rejected. Each was a good idea but didn’t fit my constraints.

Proposal	Source	Why rejected
Event-driven architecture	Kimi	Over-engineering for a sequential pipeline
State machine class	Kimi	Markdown + folder structure is sufficient
70 agent personas	agency-agents ⁷	4 skills for solo dev, not 70 agents
Dynamic routing	Gemini + agency-agents	Human-in-the-loop, not AI
Rule DAG (YAML)	Kimi	Markdown + conflict check is simpler
Separate `/metrics` skill	Kimi	Premature optimization, let the need emerge
”Angry engineer” persona ⁸	Gemini	Structural constraints beat roleplay

What Forge Doesn’t Do

No autonomous agent chaining. Skills don’t call each other.
No real-time agent negotiation. Sequential pipeline, not a debate simulator.
No dynamic routing. Pipeline order is fixed.
No complex state management. State lives in memory notes, not a database.
Not enterprise SDLC. 4 skills, not 70 agents. Designed for one person to ship product.

Setup

git clone https://github.com/ceaksan/forge.git
cd forge
./install.sh

install.sh symlinks skills to ~/.claude/skills/forge/. Updates are automatic.

# Usage
/court [topic]                    # Evaluation
/implement [feature-name]        # Development
/implement --hotfix [description] # Quick fix, court bypassed
/critique                        # Adversarial review
/retro [feature-name]            # Consolidation

Closing

Forge isn’t the right solution for everyone. For team-based projects, Jira or Linear workflows are more appropriate. For simple bug fixes, pipeline overhead is unnecessary.

In my case, working solo, I needed a structure to carry decision rationale, review findings, and learnings across sessions. Forge is that structure. Four skills, shared memory, human-in-the-loop.

The project is open source on GitHub ⁹.

I covered how Forge fits into the broader picture in the Context Engineering Ecosystem post.

Footnotes

dnomia-knowledge: basic-memory MCP configuration and Obsidian vault integration ↩ ↩²
Karpathy, LLM Wiki. Shared a few weeks after Forge was published: a persistent, LLM-maintained markdown wiki as a RAG alternative with a three-layer architecture (sources, wiki, schema). ↩
decision-gate: Multi-AI perspective evaluation tribunal ↩
mcp-code-search: Local semantic code search MCP server ↩
chief-of-staff: Local AI assistant that prepares daily workflow ↩
edit-guard: Claude Code file editing safety plugin ↩
agency-agents: 70+ agent prompt collection with NEXUS orchestration system ↩
Gemini proposed assigning an “angry engineer” persona to the /critique skill. The idea: if the agent acts like an angry engineer, it would produce harsher reviews. Rejected because roleplay yields inconsistent results. Structural constraints (mandatory finding output, pre-mortem analysis) were chosen instead. ↩
Forge GitHub repository ↩

Version Info

v0.3.0Repo →

Changelog

v0.3.02026-03-19dnomia-knowledge integration, trace-informed review, prompt-optimize bridging

v0.2.02026-02-15Hybrid swarm critique design, review pipeline improvements

v0.1.02026-01-15Initial release: prompt optimization pipeline

Key Takeaways

01 Cross-skill memory sharing ensures every step has access to prior decisions
02 Adversarial review is mandatory: critique must find at least 2 risks, 1 performance issue, 1 edge case
03 Retro prunes the knowledge base into lasting patterns rather than letting it grow indefinitely
04 Human-in-the-loop: every step requires human approval, skills never trigger each other

Frequently Asked Questions (FAQ)

+ What is Forge?

A memory-backed decision-to-delivery pipeline designed for solo developers. Four steps running as Claude Code skills: evaluation (court), development (implement), adversarial review (critique), and knowledge consolidation (retro). Spec preparation happens pre-pipeline via plan mode and interactive mode. Every step reads from and writes to a shared Obsidian vault.

+ How does Forge differ from a normal code review?

A normal code review works in isolation, unaware of prior decisions. Forge's critique skill is pipeline-aware: it reads the court decision and implement notes from memory. It also enforces adversarial output, saying 'looks good' is structurally forbidden. A minimum of 4 findings is mandatory.

+ Is Forge an autonomous agent system?