Skip to content
ceaksan

Decision Gate v2: Multi-AI Spec Tribunal

Is a single AI's evaluation sufficient? An open-source Claude Code skill that runs the Decision Gate framework with multiple independent AIs in adversarial mode: /court. Add Gemini and Kimi as jurors, prevent rubber-stamping.

Jan 29, 2026 1 min read
TL;DR

Asking a single AI 'is this good?' is like grading your own exam. The /court skill has multiple independent AIs (Claude + Gemini + Kimi) evaluate Decision Gate's 8 criteria in adversarial mode. If any juror FAILs on Security or Conflict, the verdict is an automatic KILL.

Having a single AI both propose and evaluate creates confirmation bias. The /court skill has multiple independent AIs (Claude + Gemini + Kimi + etc.) evaluate Decision Gate’s 8 criteria in adversarial mode. If any juror FAILs on Security or Conflict, the verdict is an automatic KILL. In a real codebase audit, 6 out of 28 tasks were eliminated or deferred through this system.

Quick ReferenceValue
SetupSingle file, zero dependencies
Number of jurors1-3 (Claude + optional Gemini, Kimi)
Knockout criteriaSecurity, Conflict
Verdict outputsGO / DEFER / KILL
Sourceceaksan/decision-gate

In the Decision Gate article, I defined an 8-criteria evaluation framework. The framework worked, but there was a problem: the evaluator and the proposer were the same AI. This is a structural confirmation bias.

With this article, I turned Decision Gate into an open-source tool. Then, to make it more comprehensive and consistent, I shaped it into /court. As a Claude Code skill, /court runs multiple and preferably independent AIs in adversarial mode for spec evaluation.

The Problem: Single AI, Single Perspective

When you have an AI write a spec and then ask the same AI “is this spec good?”, you will most likely get a “yes, it’s good” response. This is the rubber-stamp problem.

Rubber stamp, in its figurative sense, describes a person or institution that approves any decision automatically, without deep examination or questioning. It is a term used for “token” institutions or individuals who legitimize the orders of others (superiors, other actors, etc.).

I experienced this in a real codebase audit. When the same errors and problematic approaches kept recurring (typically by the third iteration), I stop and start analyzing the approach itself. At this stage, I include myself but also bring in AIs with different training backgrounds in parallel. If I had a single AI evaluate all decisions, most likely all 28 would have been “GO.” Instead, when I had 3 independent AIs (Claude, Gemini, Kimi, models I found consistent and current at the time) apply Decision Gate criteria separately:

ResultCountMeaning
GO22Implement
DEFER4Not now, move to backlog
KILL2Don’t do it

6 tasks were eliminated or deferred. This means preventing wasted work that would have consumed a significant portion of the day on meaningless effort.

Why Does Rubber-Stamping Happen?

Typical traps LLMs fall into when evaluating:

  1. Confirmation bias: Tendency to find their own suggestions “good” within the context of their training data
  2. Sunk cost fallacy: “So much has already been written, canceling would be wasteful” thinking
  3. Optimism bias: “Nothing will go wrong” assumption, ignoring edge cases
  4. Consistency pressure: Tendency to stay loyal to previous decisions, despite errors
  5. Task-driven evaluation: “Let me quickly finish the task, reach a result, and move on to the next step” thinking

The solution: separate evaluation from proposal and get multiple independent perspectives. This approach aligns with the decision management principles I discussed in the ADR and OpenSpec article: documenting decisions and opening them to independent evaluation.

/court: Multi-AI Tribunal

/court is a Claude Code skill. It is triggered by typing /court and performs a Decision Gate evaluation on a spec file or topic.

Architecture

/court docs/specs/my-feature.md
        |
        v
  1. Spec Quality Gate
     (Is there sufficient information?)
        |
        v
  2. Claude evaluation
     (8 criteria, orchestrator)
        |
        v
  3. Gemini evaluation (MCP)
     (Adversarial mode, security/performance focus)
        |
        v
  4. Kimi evaluation (MCP)
     (Adversarial mode, architecture/necessity focus)
        |
        v
  5. Synthesis + veto check
        |
        v
  6. Verdict: GO / DEFER / KILL

Juror Roles

Each AI carries a different perspective:

AIRoleFocus
ClaudeOrchestrator + JudgeAll 8 criteria, final verdict
GeminiSecurity & Performance SpecialistDeep technical review, adversarial mode
KimiArchitecture & Necessity AnalystWide context analysis (128K), structural concerns

Adversarial Protocol

External jurors operate with this system prompt:

You are a hostile senior engineer reviewing a proposal. Your job is to find problems, not to approve. You must identify at least one concrete concern or explicitly state “no concerns found after adversarial review”.

This structurally prevents rubber-stamping. The juror’s job is not to approve but to aggressively question.

Veto System

Security and Conflict criteria are defined as must-meet (knockout). If any juror FAILs on either of these:

  • Regardless of other scores
  • Regardless of how high the average is
  • The verdict is an automatic KILL

This is a non-negotiable rule. Security and compliance are not subjects for compromise.

Graceful Degradation

MCP tools may not always be available. /court handles this with graceful degradation:

SituationOperating Mode
All availableFull 3-juror tribunal
Gemini only2-juror tribunal
Kimi only2-juror tribunal
None availableSolo evaluation (explicitly noted in output)

Setup

1. Install the Skill

mkdir -p ~/.claude/skills/court && curl -sL \
  https://raw.githubusercontent.com/ceaksan/decision-gate/main/SKILL.md \
  -o ~/.claude/skills/court/SKILL.md

2. (Optional) Install Jurors

Installing Gemini and Kimi MCP servers is optional. If not installed, Claude evaluates solo.

Gemini MCP:

npm install -g @anthropic-ai/gemini-mcp
claude mcp add gemini -s user -- gemini-mcp

Kimi MCP:

You can write your own MCP server or use an existing Kimi MCP implementation.

3. (Optional) Project Configuration

Create .court.yml at your project root:

criteria:
  must_meet: [Security, Conflict]
  scored: [Benefit, Necessity, Burden, Performance, Bottleneck, Currency]
  threshold:
    go: 6
    defer: 4
jurors:
  gemini: true
  kimi: true
context_sources:
  - docs/architecture.md
  - CLAUDE.md

Usage

Spec File Evaluation

/court docs/specs/my-feature.md

Output is appended to the end of the file as ## Court Verdict.

Topic Evaluation

/court "Should we add Redis cache for session storage?"

Output is printed to the terminal, not written to a file.

Sample Output

## Court Verdict

**Date:** 2026-03-07
**Jurors:** Claude (orchestrator), Gemini (security/performance)
**Verdict:** DEFER
**Score:** 5.3/10

| Criterion   | Claude | Gemini | Avg  |
| ----------- | ------ | ------ | ---- |
| Security    | PASS   | PASS   | PASS |
| Conflict    | PASS   | PASS   | PASS |
| Benefit     | 7      | 6      | 6.5  |
| Necessity   | 4      | 3      | 3.5  |
| Burden      | 5      | 4      | 4.5  |
| Performance | 7      | 7      | 7.0  |
| Bottleneck  | 6      | 5      | 5.5  |
| Currency    | 6      | 5      | 5.5  |

### Dissenting Opinions

- **Gemini:** Necessity score 3. Current in-memory cache handles the load. Redis adds operational complexity with no proven need.

### Rationale

Redis caching improves performance but the current load does not require it. Re-evaluate when p99 latency exceeds SLA thresholds.

Why This Approach?

Alternatives and Why They Were Rejected

ApproachAdvantageDisadvantageDecision
CLI toolSeparation of concernsNew dependency, build step, maintenance costRejected
MCP serverStandard protocolOver-engineering, unnecessary complexityRejected
Claude Code skillZero dependencies, single fileTied to ClaudeSelected

The skill approach offers the lowest overhead for a solo developer: a single markdown file, no build step, no dependencies. Copy ~/.claude/skills/court/SKILL.md, type /court, it works.

MCP: The Inter-AI Communication Layer

Model Context Protocol (MCP) is a standard protocol that enables Claude Code to communicate with external tools. In this context, MCP allows Claude to “invoke” Gemini and Kimi.

Claude Code
    |
    |-- mcp__gemini__gemini-query --> Gemini
    |
    |-- mcp__kimi__kimi_query --> Kimi

Each juror evaluates independently. Claude synthesizes the results and presents a verdict recommendation.

Design Decisions

Why 3 AIs?

A single AI creates confirmation bias. Even with two AIs, the concept of “majority” is weak. With three AIs:

  • If any two agree, that’s a strong signal
  • If all three disagree, the situation requires careful examination
  • If one AI disagrees, it is recorded as a “dissenting opinion”

Why Adversarial?

There is an important difference between saying “evaluate this proposal” and “find the problems in this proposal.” The former creates a positive response bias, the latter triggers critical thinking.

The adversarial protocol explicitly instructs jurors to “find problems.” This structurally reduces the rubber-stamp rate.

Why Veto?

Security and compliance should not be subject to scored evaluations. If an endpoint has an auth bypass, the verdict should be KILL even if the benefit is 10/10. The veto power of must-meet criteria reflects this reality.

Why Not Red Team/Blue Team?

While preparing this article, I ran /court itself through the jury process. Kimi proposed an alternative architecture: Blue Team (Claude) proposes solutions, Red Team (Gemini) only attacks, and a deterministic rules engine serves as Judge. This approach would eliminate synthesis bias entirely.

I evaluated the proposal against Decision Gate criteria:

CriterionAssessmentResult
BenefitEliminates synthesis bias8/10
NecessityCurrent veto system is already deterministic (Security/Conflict binary PASS/FAIL)4/10
BurdenDeterministic rules engine = separate tool, new dependency, maintenance overhead3/10
ConflictDirectly contradicts the single-file, zero-dependency design decisionFAIL

Verdict: KILL. A valid architectural critique, but transforming a single-file skill into a platform architecture creates a bigger problem than it solves for a solo developer.

This evaluation is itself evidence: the adversarial process that /court proposes maintains its consistency by running even its own alternatives through the same framework.

Limitations

  • Claude dependency: The skill only works in Claude Code. Adaptation is required for use in other AI IDEs.
  • MCP setup: Installing Gemini and Kimi MCP servers requires additional steps. However, they are optional.
  • Synthesis layer risk: Claude brings the proposal, relays it to jurors, and synthesizes the results, but the final decision belongs to the human. The risk lies in the synthesis stage: when summarizing juror outputs, Claude may construct a narrative that leans toward its own evaluation. For this reason, reviewing jurors’ raw outputs independently of the synthesis is important.
  • Relative independence: “Independent AIs” is a relative term. All models are trained on similar web corpora and carry similar alignment biases. Adversarial mode is a prompt-engineering level separation, not structural independence. Systematic blind spots (e.g., those originating from shared training data) may correlate across jurors.
  • Subjective scoring: The 1-10 scores given by AIs are not deterministic. Different runs on the same spec produce different scores. This is sufficient for capturing general trends, but should not be treated as exact numbers.
  • Context limits: Very large spec files may push the context limits of Gemini and Kimi in particular.

Conclusion

The first version of Decision Gate was a mental checklist: 8 criteria, 30-second evaluation. In this version, I turned the framework into a tool:

  • 3 independent AIs instead of one with adversarial evaluation
  • Veto system with no compromise on security and compliance
  • Graceful degradation so it works even without MCP
  • Open source with single-file installation

GitHub: ceaksan/decision-gate

Setup:

mkdir -p ~/.claude/skills/court && curl -sL \
  https://raw.githubusercontent.com/ceaksan/decision-gate/main/SKILL.md \
  -o ~/.claude/skills/court/SKILL.md

Other articles in the Decision Gate series:

Key Takeaways
  • 01 Single-AI evaluation carries rubber-stamp risk; multiple independent AIs running in adversarial mode produce higher-quality decisions
  • 02 The /court skill installs by copying a single file into Claude Code, requires zero dependencies
  • 03 Jurors are invoked via MCP (Model Context Protocol); Gemini and Kimi are optional, Claude evaluates solo if they are unavailable
  • 04 In a real codebase audit, 6 out of 28 tasks were eliminated or deferred through this system
Frequently Asked Questions (FAQ)
+ What is /court?

/court is a skill (slash command) for Claude Code. It evaluates spec files or technical decisions against Decision Gate's 8 criteria. Optionally adds Gemini and Kimi as independent jurors.

+ Why use multiple AIs?

Having a single AI both propose and evaluate creates confirmation bias. Different models catch different weaknesses: Gemini tends to be stronger on security and performance, Kimi on architecture and necessity analysis.

+ What is MCP and why is it needed?

Model Context Protocol is a standard protocol that enables Claude Code to communicate with external tools. Gemini and Kimi are invoked as jurors through MCP servers. If MCP is not installed, Claude evaluates solo.

+ How does the veto system work?

Security and Conflict criteria are defined as must-meet (knockout). If any juror FAILs on either of these two criteria, the verdict is an automatic KILL regardless of other scores.

+ How do I set it up in my own project?

Run mkdir -p ~/.claude/skills/court && curl -sL https://raw.githubusercontent.com/ceaksan/decision-gate/main/SKILL.md -o ~/.claude/skills/court/SKILL.md to install. Optionally configure with .court.yml at your project root.