LLM providers
baro runs on three end-to-end LLM paths — Claude Code, OpenAI Codex CLI, and Mozaik-native OpenAI. Same DAG, same orchestration, three completely different transports underneath. Plus a hybrid preset that mixes them per phase.
baro 0.44 ships with three end-to-end LLM paths plus a hybrid preset that mixes them per phase. The --llm flag picks how every agent in the run — Architect, Planner, Critic, Surgeon, Story Agents — talks to its model.
baro --llm claude "Your goal" # default; every phase via Claude Code CLI
baro --llm codex "Your goal" # every phase via OpenAI Codex CLI
baro --llm openai "Your goal" # every phase via Mozaik-native OpenAI Responses
baro --llm hybrid "Your goal" # Claude on Architect/Planner/Surgeon, Codex on Story/CriticThe DAG, the event bus, the participant set, the prompts — all identical across providers. The only thing that moves is which transport every call rides on.
What each flag actually does
--llm claude (the default) does not call Anthropic's API directly. It shells out to the Claude Code CLI (claude --print in headless mode) for every phase. Auth comes from your active Claude Code session.
--llm codex shells out to the OpenAI Codex CLI (codex exec --json) for every phase, with the --dangerously-bypass-approvals-and-sandbox flag so the agent can write outside the project root (necessary for .git/ commits). Auth comes from your active Codex CLI session, billed against ChatGPT Pro/Plus subscription.
--llm openai calls the OpenAI Responses API directly through Mozaik's native runner — no subprocess, no CLI in between. Auth comes from an OPENAI_API_KEY environment variable. Every token is billed per-call at retail.
--llm hybrid is a preset, equivalent to:
baro --architect-llm claude \
--planner-llm claude \
--story-llm codex \
--critic-llm codex \
--surgeon-llm claude \
"Your goal"Same orchestration, but the Story phase (the parallel agents that do the actual writing) and the Critic phase (the per-turn evaluator on the highest-volume call in a run) route through Codex, while the small upstream phases that decide what to build route through Claude. The result on the BaroEventForwarder refactor: 3% Claude 5h cap + 3% Codex 5h cap = 6% total subscription footprint, vs 14% on pure-Claude or 5% on pure-Codex alone. Diff was indistinguishable from pure-Codex, plan tightness was indistinguishable from pure-Claude.
Picking a backend
| Pick this | When |
|---|---|
--llm claude | Single small focused change, plan-tightness matters more than budget. Bills against Claude Max subscription. |
--llm codex | Greenfield work, small features, exploratory generation. ~3–11× cheaper per equivalent run than Claude depending on read/write ratio. Bills against ChatGPT Pro/Plus subscription. |
--llm openai | You want raw wall-clock speed and have an OpenAI API key. Mozaik-native calls return faster per request than either CLI subprocess. Roughly $1–2 per mid-sized run with cache hits. |
--llm hybrid | Anything serious. Refactors, multi-file changes, anything where the upstream plan matters as much as the downstream writes. This is the recommended default for production runs starting 0.44. |
Full side-by-side benchmark across three tasks (feature add / refactor / greenfield landing): I tested Claude Code vs OpenAI Codex in my parallel agent setup. Then I built a hybrid.
Per-phase routing
Each phase has its own --<phase>-llm flag. Per-phase overrides take precedence over the top-level --llm, which acts as the default for any phase you don't explicitly override.
# Hybrid by hand — equivalent to --llm hybrid
baro --architect-llm claude \
--planner-llm claude \
--story-llm codex \
--critic-llm codex \
--surgeon-llm claude \
"Your goal"
# Claude planning, Codex everywhere else
baro --llm codex --architect-llm claude --planner-llm claude "Your goal"
# Codex story work but keep Claude reviewing
baro --llm claude --story-llm codex "Your goal"
# Three-provider hybrid (because you can)
baro --architect-llm claude --planner-llm openai --story-llm codex \
--critic-llm codex --surgeon-llm claude "Your goal"The Mozaik bus shape is identical regardless of which CLI produced the event, so participants on different providers can coexist in the same DAG without translation layers — every agent_state, function_call, story_result event looks the same downstream.
Model defaults per phase
Every phase has a routed default, picked to match the workload of that phase. Reasoning-heavy phases get the flagship model; the per-turn verdict phase gets the cheap one.
| Phase | --llm claude | --llm codex | --llm openai |
|---|---|---|---|
| Architect | opus | gpt-5.5 (Codex CLI default) | gpt-5.5 |
| Planner | opus | gpt-5.5 | gpt-5.5 |
| Story Agent | opus | gpt-5.5 | gpt-5.5 |
| Critic | haiku | gpt-5.5 (Codex CLI uses one model per session) | gpt-5.4-mini |
| Surgeon | opus | gpt-5.5 | gpt-5.5 |
Critic stays on the cheap tier on Claude/OpenAI paths because it runs once per agent per turn (highest-volume call in a run) and its verdict is a structured PASS/FAIL — flagship reasoning doesn't move that needle. Codex CLI doesn't let you swap models mid-session, so on the Codex path every phase shares whatever model your Codex CLI is configured for (default gpt-5.5 with reasoning_effort=high).
Overriding the model
Each phase has its own --*-model flag if you want to override the routed default without touching the rest:
baro --llm openai \
--architect-model gpt-5.5 \
--planner-model gpt-5.4 \
--story-model gpt-5.5 \
--critic-model gpt-5.4-nano \
"Your goal"Available OpenAI models (Mozaik 3.10): gpt-5.5, gpt-5.4, gpt-5.4-mini, gpt-5.4-nano.
Available Claude models (via Claude Code): opus, sonnet, haiku.
Available Codex models: whatever your Codex CLI session is configured for (typically gpt-5.5).
If you want one model pinned across every phase, use --model instead — it overrides every per-phase default:
baro --llm openai --model gpt-5.4 "Your goal" # whole run on 5.4
baro --llm claude --model sonnet "Your goal" # whole run on SonnetTo opt out of routing entirely (no per-phase defaults, no per-phase overrides — every phase falls back to the SDK's own default), pass --no-model-routing.
Setting up each backend
Claude Code path (--llm claude)
# Install + auth Claude Code CLI
npm install -g @anthropic-ai/claude-code
claude # walks you through login
# Verify baro sees it
baro --doctorAuthenticated sessions bill against your active Claude plan.
Codex CLI path (--llm codex, --llm hybrid)
# Install + auth Codex CLI
npm install -g @openai/codex
codex # walks you through ChatGPT login
# Verify baro sees it
baro --doctorAuthenticated sessions bill against your ChatGPT Pro/Plus subscription.
Mozaik-native OpenAI (--llm openai)
If your shell already has OPENAI_API_KEY exported, baro picks it up automatically:
export OPENAI_API_KEY=sk-...
baro --llm openai "Your goal"If you don't have it in your environment, baro detours through an interactive API-key entry screen before planning starts. Whatever you type stays in baro's memory for the duration of the run — it is not persisted to disk and is not logged.
--llm hybrid needs both Claude CLI and Codex CLI authenticated. baro will fail early at planning if either is missing.
Codex tool-call profile
One measurable difference between Codex and Claude: across three real benchmark tasks, Codex makes 3–4× as many tool calls as Claude to land equivalent work. Reads, greps, list-files, file-writes, then more reads to verify the writes. This is real model behavior, not a missing-context artifact — verified by mirroring CLAUDE.md to AGENTS.md so neither side has a context advantage (Codex CLI reads AGENTS.md at project root the way Claude Code reads CLAUDE.md). The ratio stays within noise either way.
The tool-call appetite is part of why Codex is so cheap per equivalent diff (each tool call is small, each token bills at a fraction of Claude's rate) and also part of why the hybrid preset routes the architecture through Claude — Codex's tendency to over-decompose and over-abstract upstream is the same instinct that makes it good at the parallel write work downstream.
What stays the same on all paths
- The Architect → Planner → Story → Critic → Surgeon → Finalizer pipeline is identical.
- The DAG shape is decided by the Planner; all providers produce DAGs of comparable structural quality on the same goal.
- The Critic loop runs on all paths — agent output gets per-turn graded against the story's acceptance criteria regardless of provider.
- The Finalizer composes the same PR body shape regardless of provider.
--quick,--parallel,--timeout,--resume,--intra-level-delay,--no-*toggles all work identically.- Audit logs are JSONL on the same Mozaik bus shape regardless of which CLI produced the event — kaleidoskop replay works on all provider mixes.
The orchestration is what baro built. The model is replaceable, and as of 0.44, mixable per phase.