baro — Docs

baro runs on three end-to-end LLM paths — Claude Code, OpenAI Codex CLI, and Mozaik-native OpenAI. Same DAG, same orchestration, three completely different transports underneath. Plus a hybrid preset that mixes them per phase.

baro 0.44 ships with three end-to-end LLM paths plus a hybrid preset that mixes them per phase. The --llm flag picks how every agent in the run — Architect, Planner, Critic, Surgeon, Story Agents — talks to its model.

baro --llm claude  "Your goal"     # default; every phase via Claude Code CLI
baro --llm codex   "Your goal"     # every phase via OpenAI Codex CLI
baro --llm openai  "Your goal"     # every phase via Mozaik-native OpenAI Responses
baro --llm hybrid  "Your goal"     # Claude on Architect/Planner/Surgeon, Codex on Story/Critic

The DAG, the event bus, the participant set, the prompts — all identical across providers. The only thing that moves is which transport every call rides on.

What each flag actually does

--llm claude (the default) does not call Anthropic's API directly. It shells out to the Claude Code CLI (claude --print in headless mode) for every phase. Auth comes from your active Claude Code session.

--llm codex shells out to the OpenAI Codex CLI (codex exec --json) for every phase, with the --dangerously-bypass-approvals-and-sandbox flag so the agent can write outside the project root (necessary for .git/ commits). Auth comes from your active Codex CLI session, billed against ChatGPT Pro/Plus subscription.

--llm openai calls the OpenAI Responses API directly through Mozaik's native runner — no subprocess, no CLI in between. Auth comes from an OPENAI_API_KEY environment variable. Every token is billed per-call at retail.

--llm hybrid is a preset, equivalent to:

baro --architect-llm claude \
     --planner-llm  claude \
     --story-llm    codex  \
     --critic-llm   codex  \
     --surgeon-llm  claude \
     "Your goal"

Same orchestration, but the Story phase (the parallel agents that do the actual writing) and the Critic phase (the per-turn evaluator on the highest-volume call in a run) route through Codex, while the small upstream phases that decide what to build route through Claude. The result on the BaroEventForwarder refactor: 3% Claude 5h cap + 3% Codex 5h cap = 6% total subscription footprint, vs 14% on pure-Claude or 5% on pure-Codex alone. Diff was indistinguishable from pure-Codex, plan tightness was indistinguishable from pure-Claude.

Picking a backend

Pick this	When
`--llm claude`	Single small focused change, plan-tightness matters more than budget. Bills against Claude Max subscription.
`--llm codex`	Greenfield work, small features, exploratory generation. ~3–11× cheaper per equivalent run than Claude depending on read/write ratio. Bills against ChatGPT Pro/Plus subscription.
`--llm openai`	You want raw wall-clock speed and have an OpenAI API key. Mozaik-native calls return faster per request than either CLI subprocess. Roughly $1–2 per mid-sized run with cache hits.
`--llm hybrid`	Anything serious. Refactors, multi-file changes, anything where the upstream plan matters as much as the downstream writes. This is the recommended default for production runs starting 0.44.

Full side-by-side benchmark across three tasks (feature add / refactor / greenfield landing): I tested Claude Code vs OpenAI Codex in my parallel agent setup. Then I built a hybrid.

Per-phase routing

Each phase has its own --<phase>-llm flag. Per-phase overrides take precedence over the top-level --llm, which acts as the default for any phase you don't explicitly override.

# Hybrid by hand — equivalent to --llm hybrid
baro --architect-llm claude \
     --planner-llm  claude \
     --story-llm    codex  \
     --critic-llm   codex  \
     --surgeon-llm  claude \
     "Your goal"

# Claude planning, Codex everywhere else
baro --llm codex --architect-llm claude --planner-llm claude "Your goal"

# Codex story work but keep Claude reviewing
baro --llm claude --story-llm codex "Your goal"

# Three-provider hybrid (because you can)
baro --architect-llm claude --planner-llm openai --story-llm codex \
     --critic-llm codex --surgeon-llm claude "Your goal"

The Mozaik bus shape is identical regardless of which CLI produced the event, so participants on different providers can coexist in the same DAG without translation layers — every agent_state, function_call, story_result event looks the same downstream.

Model defaults per phase

Every phase has a routed default, picked to match the workload of that phase. Reasoning-heavy phases get the flagship model; the per-turn verdict phase gets the cheap one.

Phase	`--llm claude`	`--llm codex`	`--llm openai`
Architect	`opus`	`gpt-5.5` (Codex CLI default)	`gpt-5.5`
Planner	`opus`	`gpt-5.5`	`gpt-5.5`
Story Agent	`opus`	`gpt-5.5`	`gpt-5.5`
Critic	`haiku`	`gpt-5.5` (Codex CLI uses one model per session)	`gpt-5.4-mini`
Surgeon	`opus`	`gpt-5.5`	`gpt-5.5`

Critic stays on the cheap tier on Claude/OpenAI paths because it runs once per agent per turn (highest-volume call in a run) and its verdict is a structured PASS/FAIL — flagship reasoning doesn't move that needle. Codex CLI doesn't let you swap models mid-session, so on the Codex path every phase shares whatever model your Codex CLI is configured for (default gpt-5.5 with reasoning_effort=high).

Overriding the model

Each phase has its own --*-model flag if you want to override the routed default without touching the rest:

baro --llm openai \
  --architect-model gpt-5.5 \
  --planner-model gpt-5.4 \
  --story-model gpt-5.5 \
  --critic-model gpt-5.4-nano \
  "Your goal"

Available OpenAI models (Mozaik 3.10): gpt-5.5, gpt-5.4, gpt-5.4-mini, gpt-5.4-nano. Available Claude models (via Claude Code): opus, sonnet, haiku. Available Codex models: whatever your Codex CLI session is configured for (typically gpt-5.5).

If you want one model pinned across every phase, use --model instead — it overrides every per-phase default:

baro --llm openai --model gpt-5.4 "Your goal"     # whole run on 5.4
baro --llm claude --model sonnet "Your goal"      # whole run on Sonnet

To opt out of routing entirely (no per-phase defaults, no per-phase overrides — every phase falls back to the SDK's own default), pass --no-model-routing.

Setting up each backend

Claude Code path (--llm claude)

# Install + auth Claude Code CLI
npm install -g @anthropic-ai/claude-code
claude   # walks you through login

# Verify baro sees it
baro --doctor

Authenticated sessions bill against your active Claude plan.

Codex CLI path (--llm codex, --llm hybrid)

# Install + auth Codex CLI
npm install -g @openai/codex
codex   # walks you through ChatGPT login

# Verify baro sees it
baro --doctor

Authenticated sessions bill against your ChatGPT Pro/Plus subscription.

Mozaik-native OpenAI (--llm openai)

If your shell already has OPENAI_API_KEY exported, baro picks it up automatically:

export OPENAI_API_KEY=sk-...
baro --llm openai "Your goal"

If you don't have it in your environment, baro detours through an interactive API-key entry screen before planning starts. Whatever you type stays in baro's memory for the duration of the run — it is not persisted to disk and is not logged.

--llm hybrid needs both Claude CLI and Codex CLI authenticated. baro will fail early at planning if either is missing.

Codex tool-call profile

One measurable difference between Codex and Claude: across three real benchmark tasks, Codex makes 3–4× as many tool calls as Claude to land equivalent work. Reads, greps, list-files, file-writes, then more reads to verify the writes. This is real model behavior, not a missing-context artifact — verified by mirroring CLAUDE.md to AGENTS.md so neither side has a context advantage (Codex CLI reads AGENTS.md at project root the way Claude Code reads CLAUDE.md). The ratio stays within noise either way.

The tool-call appetite is part of why Codex is so cheap per equivalent diff (each tool call is small, each token bills at a fraction of Claude's rate) and also part of why the hybrid preset routes the architecture through Claude — Codex's tendency to over-decompose and over-abstract upstream is the same instinct that makes it good at the parallel write work downstream.

What stays the same on all paths

The Architect → Planner → Story → Critic → Surgeon → Finalizer pipeline is identical.
The DAG shape is decided by the Planner; all providers produce DAGs of comparable structural quality on the same goal.
The Critic loop runs on all paths — agent output gets per-turn graded against the story's acceptance criteria regardless of provider.
The Finalizer composes the same PR body shape regardless of provider.
--quick, --parallel, --timeout, --resume, --intra-level-delay, --no-* toggles all work identically.
Audit logs are JSONL on the same Mozaik bus shape regardless of which CLI produced the event — kaleidoskop replay works on all provider mixes.

The orchestration is what baro built. The model is replaceable, and as of 0.44, mixable per phase.

LLM providers