Decision Gate v2: Multi-AI Spec Tribunal

TL;DR

Asking a single AI 'is this good?' is like grading your own exam. The /court skill has multiple independent AIs (Claude + Gemini + Kimi) evaluate Decision Gate's 8 criteria in adversarial mode. If any juror FAILs on Security or Conflict, the verdict is an automatic KILL.

Having a single AI both propose and evaluate creates confirmation bias. The /court skill has multiple independent AIs (Claude + Gemini + Kimi + etc.) evaluate Decision Gate’s 8 criteria in adversarial mode. If any juror FAILs on Security or Conflict, the verdict is an automatic KILL. In a real codebase audit, 6 out of 28 tasks were eliminated or deferred through this system.

Quick Reference	Value
Setup	Single file, zero dependencies
Number of jurors	1-3 (Claude + optional Gemini, Kimi)
Knockout criteria	Security, Conflict
Verdict outputs	GO / DEFER / KILL
Source	ceaksan/decision-gate

In the Decision Gate article, I defined an 8-criteria evaluation framework. The framework worked, but there was a problem: the evaluator and the proposer were the same AI. This is a structural confirmation bias.

With this article, I turned Decision Gate into an open-source tool. Then, to make it more comprehensive and consistent, I shaped it into /court. As a Claude Code skill, /court runs multiple and preferably independent AIs in adversarial mode for spec evaluation.

The Problem: Single AI, Single Perspective

When you have an AI write a spec and then ask the same AI “is this spec good?”, you will most likely get a “yes, it’s good” response. This is the rubber-stamp problem.

Rubber stamp, in its figurative sense, describes a person or institution that approves any decision automatically, without deep examination or questioning. It is a term used for “token” institutions or individuals who legitimize the orders of others (superiors, other actors, etc.).

I experienced this in a real codebase audit. When the same errors and problematic approaches kept recurring (typically by the third iteration), I stop and start analyzing the approach itself. At this stage, I include myself but also bring in AIs with different training backgrounds in parallel. If I had a single AI evaluate all decisions, most likely all 28 would have been “GO.” Instead, when I had 3 independent AIs (Claude, Gemini, Kimi, models I found consistent and current at the time) apply Decision Gate criteria separately:

Result	Count	Meaning
GO	22	Implement
DEFER	4	Not now, move to backlog
KILL	2	Don’t do it

6 tasks were eliminated or deferred. This means preventing wasted work that would have consumed a significant portion of the day on meaningless effort.

Why Does Rubber-Stamping Happen?

Typical traps LLMs fall into when evaluating:

Confirmation bias: Tendency to find their own suggestions “good” within the context of their training data
Sunk cost fallacy: “So much has already been written, canceling would be wasteful” thinking
Optimism bias: “Nothing will go wrong” assumption, ignoring edge cases
Consistency pressure: Tendency to stay loyal to previous decisions, despite errors
Task-driven evaluation: “Let me quickly finish the task, reach a result, and move on to the next step” thinking

The solution: separate evaluation from proposal and get multiple independent perspectives. This approach aligns with the decision management principles I discussed in the ADR and OpenSpec article: documenting decisions and opening them to independent evaluation.

/court: Multi-AI Tribunal

/court is a Claude Code skill. It is triggered by typing /court and performs a Decision Gate evaluation on a spec file or topic.

Architecture

/court docs/specs/my-feature.md
        |
        v
  1. Spec Quality Gate
     (Is there sufficient information?)
        |
        v
  2. Claude evaluation
     (8 criteria, orchestrator)
        |
        v
  3. Gemini evaluation (MCP)
     (Adversarial mode, security/performance focus)
        |
        v
  4. Kimi evaluation (MCP)
     (Adversarial mode, architecture/necessity focus)
        |
        v
  5. Synthesis + veto check
        |
        v
  6. Verdict: GO / DEFER / KILL

Juror Roles

Each AI carries a different perspective:

AI	Role	Focus
Claude	Orchestrator + Judge	All 8 criteria, final verdict
Gemini	Security & Performance Specialist	Deep technical review, adversarial mode
Kimi	Architecture & Necessity Analyst	Wide context analysis (128K), structural concerns

Adversarial Protocol

External jurors operate with this system prompt:

You are a hostile senior engineer reviewing a proposal. Your job is to find problems, not to approve. You must identify at least one concrete concern or explicitly state “no concerns found after adversarial review”.

This structurally prevents rubber-stamping. The juror’s job is not to approve but to aggressively question.

Veto System

Security and Conflict criteria are defined as must-meet (knockout). If any juror FAILs on either of these:

Regardless of other scores
Regardless of how high the average is
The verdict is an automatic KILL

This is a non-negotiable rule. Security and compliance are not subjects for compromise.

Graceful Degradation

MCP tools may not always be available. /court handles this with graceful degradation:

Situation	Operating Mode
All available	Full 3-juror tribunal
Gemini only	2-juror tribunal
Kimi only	2-juror tribunal
None available	Solo evaluation (explicitly noted in output)

Setup

1. Install the Skill

mkdir -p ~/.claude/skills/court && curl -sL \
  https://raw.githubusercontent.com/ceaksan/decision-gate/main/SKILL.md \
  -o ~/.claude/skills/court/SKILL.md

2. (Optional) Install Jurors

Installing Gemini and Kimi MCP servers is optional. If not installed, Claude evaluates solo.

Gemini MCP:

npm install -g @anthropic-ai/gemini-mcp
claude mcp add gemini -s user -- gemini-mcp

Kimi MCP:

You can write your own MCP server or use an existing Kimi MCP implementation.

3. (Optional) Project Configuration

Create .court.yml at your project root:

criteria:
  must_meet: [Security, Conflict]
  scored: [Benefit, Necessity, Burden, Performance, Bottleneck, Currency]
  threshold:
    go: 6
    defer: 4
jurors:
  gemini: true
  kimi: true
context_sources:
  - docs/architecture.md
  - CLAUDE.md

Usage

Spec File Evaluation

/court docs/specs/my-feature.md

Output is appended to the end of the file as ## Court Verdict.

Topic Evaluation

/court "Should we add Redis cache for session storage?"

Output is printed to the terminal, not written to a file.

Sample Output

## Court Verdict

**Date:** 2026-03-07
**Jurors:** Claude (orchestrator), Gemini (security/performance)
**Verdict:** DEFER
**Score:** 5.3/10

| Criterion   | Claude | Gemini | Avg  |
| ----------- | ------ | ------ | ---- |
| Security    | PASS   | PASS   | PASS |
| Conflict    | PASS   | PASS   | PASS |
| Benefit     | 7      | 6      | 6.5  |
| Necessity   | 4      | 3      | 3.5  |
| Burden      | 5      | 4      | 4.5  |
| Performance | 7      | 7      | 7.0  |
| Bottleneck  | 6      | 5      | 5.5  |
| Currency    | 6      | 5      | 5.5  |

### Dissenting Opinions

- **Gemini:** Necessity score 3. Current in-memory cache handles the load. Redis adds operational complexity with no proven need.

### Rationale

Redis caching improves performance but the current load does not require it. Re-evaluate when p99 latency exceeds SLA thresholds.

Why This Approach?

Alternatives and Why They Were Rejected

Approach	Advantage	Disadvantage	Decision
CLI tool	Separation of concerns	New dependency, build step, maintenance cost	Rejected
MCP server	Standard protocol	Over-engineering, unnecessary complexity	Rejected
Claude Code skill	Zero dependencies, single file	Tied to Claude	Selected

The skill approach offers the lowest overhead for a solo developer: a single markdown file, no build step, no dependencies. Copy ~/.claude/skills/court/SKILL.md, type /court, it works.

MCP: The Inter-AI Communication Layer

Model Context Protocol (MCP) is a standard protocol that enables Claude Code to communicate with external tools. In this context, MCP allows Claude to “invoke” Gemini and Kimi.

Claude Code
    |
    |-- mcp__gemini__gemini-query --> Gemini
    |
    |-- mcp__kimi__kimi_query --> Kimi

Each juror evaluates independently. Claude synthesizes the results and presents a verdict recommendation.

Design Decisions

Why 3 AIs?

A single AI creates confirmation bias. Even with two AIs, the concept of “majority” is weak. With three AIs:

If any two agree, that’s a strong signal
If all three disagree, the situation requires careful examination
If one AI disagrees, it is recorded as a “dissenting opinion”

Why Adversarial?

There is an important difference between saying “evaluate this proposal” and “find the problems in this proposal.” The former creates a positive response bias, the latter triggers critical thinking.

The adversarial protocol explicitly instructs jurors to “find problems.” This structurally reduces the rubber-stamp rate.

Why Veto?

Security and compliance should not be subject to scored evaluations. If an endpoint has an auth bypass, the verdict should be KILL even if the benefit is 10/10. The veto power of must-meet criteria reflects this reality.

Why Not Red Team/Blue Team?

While preparing this article, I ran /court itself through the jury process. Kimi proposed an alternative architecture: Blue Team (Claude) proposes solutions, Red Team (Gemini) only attacks, and a deterministic rules engine serves as Judge. This approach would eliminate synthesis bias entirely.

I evaluated the proposal against Decision Gate criteria:

Criterion	Assessment	Result
Benefit	Eliminates synthesis bias	8/10
Necessity	Current veto system is already deterministic (Security/Conflict binary PASS/FAIL)	4/10
Burden	Deterministic rules engine = separate tool, new dependency, maintenance overhead	3/10
Conflict	Directly contradicts the single-file, zero-dependency design decision	FAIL

Verdict: KILL. A valid architectural critique, but transforming a single-file skill into a platform architecture creates a bigger problem than it solves for a solo developer.

This evaluation is itself evidence: the adversarial process that /court proposes maintains its consistency by running even its own alternatives through the same framework.

Limitations

Claude dependency: The skill only works in Claude Code. Adaptation is required for use in other AI IDEs.
MCP setup: Installing Gemini and Kimi MCP servers requires additional steps. However, they are optional.
Synthesis layer risk: Claude brings the proposal, relays it to jurors, and synthesizes the results, but the final decision belongs to the human. The risk lies in the synthesis stage: when summarizing juror outputs, Claude may construct a narrative that leans toward its own evaluation. For this reason, reviewing jurors’ raw outputs independently of the synthesis is important.
Relative independence: “Independent AIs” is a relative term. All models are trained on similar web corpora and carry similar alignment biases. Adversarial mode is a prompt-engineering level separation, not structural independence. Systematic blind spots (e.g., those originating from shared training data) may correlate across jurors.
Subjective scoring: The 1-10 scores given by AIs are not deterministic. Different runs on the same spec produce different scores. This is sufficient for capturing general trends, but should not be treated as exact numbers.
Context limits: Very large spec files may push the context limits of Gemini and Kimi in particular.

Conclusion

The first version of Decision Gate was a mental checklist: 8 criteria, 30-second evaluation. In this version, I turned the framework into a tool:

3 independent AIs instead of one with adversarial evaluation
Veto system with no compromise on security and compliance
Graceful degradation so it works even without MCP
Open source with single-file installation

GitHub: ceaksan/decision-gate

Setup:

mkdir -p ~/.claude/skills/court && curl -sL \
  https://raw.githubusercontent.com/ceaksan/decision-gate/main/SKILL.md \
  -o ~/.claude/skills/court/SKILL.md

Other articles in the Decision Gate series:

Decision Gate: The Missing Piece of Vibe Coding (framework introduction)
ADR, OpenSpec and Spec-Driven Development: How Court’s ADR output and spec-first approach relate to the decision process
LLM Behavioral Failure Modes: Positional bias and other failure modes that multi-AI tribunal aims to address
AI Agent Protocols Guide: How the A2A protocol standardizes multi-agent communication, relation to the tribunal approach

Key Takeaways

01 Single-AI evaluation carries rubber-stamp risk; multiple independent AIs running in adversarial mode produce higher-quality decisions
02 The /court skill installs by copying a single file into Claude Code, requires zero dependencies
03 Jurors are invoked via MCP (Model Context Protocol); Gemini and Kimi are optional, Claude evaluates solo if they are unavailable
04 In a real codebase audit, 6 out of 28 tasks were eliminated or deferred through this system

Frequently Asked Questions (FAQ)

+ What is /court?

/court is a skill (slash command) for Claude Code. It evaluates spec files or technical decisions against Decision Gate's 8 criteria. Optionally adds Gemini and Kimi as independent jurors.

+ Why use multiple AIs?

Having a single AI both propose and evaluate creates confirmation bias. Different models catch different weaknesses: Gemini tends to be stronger on security and performance, Kimi on architecture and necessity analysis.

+ What is MCP and why is it needed?

Model Context Protocol is a standard protocol that enables Claude Code to communicate with external tools. Gemini and Kimi are invoked as jurors through MCP servers. If MCP is not installed, Claude evaluates solo.

+ How does the veto system work?

Security and Conflict criteria are defined as must-meet (knockout). If any juror FAILs on either of these two criteria, the verdict is an automatic KILL regardless of other scores.

+ How do I set it up in my own project?

Run mkdir -p ~/.claude/skills/court && curl -sL https://raw.githubusercontent.com/ceaksan/decision-gate/main/SKILL.md -o ~/.claude/skills/court/SKILL.md to install. Optionally configure with .court.yml at your project root.

developer-tools ai