Claude Code Pipeline

The AI writes the code. Static analysis, test suites, and adversarial review decide if it's any good.

Solo project · Built from autumn 2025, actively evolving

claude code bash sonarqube git worktrees

A process framework that lets one person handle what normally requires coordination across several roles, from requirements through QA, by structuring how AI agents work and how their output gets verified. For larger work, execution graphs coordinate multiple features through the pipeline with dependency ordering, cross-feature context, and mixed autonomy.

How it started

In autumn 2025 I started building with GitHub Copilot. Prompt, fix, prompt, fix. I tried Cursor, then settled on Claude Code. The pipeline grew from there: each frustration became a feature. Repetitive prompts became reusable commands. Drifting settings became a central config repo. Inconsistent QA became deterministic gates. Manual approvals on work that didn't need me became autopilot. None of it was planned upfront. Each project fed back into the pipeline, and the pipeline fed back into the next project.

What it does

Instead of ad-hoc prompting, the pipeline runs a structured development workflow. 13 core commands and 13 specialist AI agents handle the workflow: requirements, UX design, architecture, implementation, static analysis, manual testing, quality assurance, and deployment, backed by ~20 supporting commands for project planning, utilities, and shortcuts. Every phase produces committed artifacts. Every decision is traceable in git history.

For larger work, execution graphs break a project into dependency-ordered tasks, each running through the same pipeline independently, some unattended, some with human checkpoints.

Guardrails

AI agents with filesystem access need hard boundaries, not guidelines. The pipeline's security model has three layers. At the kernel level, a macOS Seatbelt sandbox restricts all file writes to the project directory and /tmp. Credential paths like ~/.ssh and ~/.aws are blocked from reads entirely. At the application level, permission deny rules prevent the AI from editing shell config, its own settings, or credential files. At the git level, any tracked file change is reversible. The boundaries hold regardless of what the model does.

This is a solo setup, but the layers are the same ones any organisation will need when AI agents operate beyond a sandbox notebook: kernel-level containment (a macOS sandbox here, containers in a team setting), application-level policy, and version control as the reversibility guarantee.

How the workflow runs

In practice, the work splits like this:

LLM handles: Requirements analysis, UX reasoning, architectural decisions, code generation, test writing, documentation
Deterministic tools handle: Type checking (mypy, tsc), linting (ruff, eslint), code quality (SonarQube), test execution (pytest, vitest), formatting
The pipeline enforces: TDD (write the failing test first, a pre-commit hook blocks violations), phase-gated commits, and an evaluator-optimizer loop where no agent marks its own work

The LLM can sound confident about broken code. The type checker doesn't care how confident it sounds. The whole system is built on that gap.

Two ways to run it

The commands are the same. How they're orchestrated differs:

Flow mode. For work that needs human judgement: UI features, ambiguous requirements, anything where I want to review output before it propagates. I run /start flow my-feature in a Claude Code session, it creates an isolated git worktree, and I move through phases conversationally, approving each one.
Autopilot mode. For backend work where the deterministic gates are sufficient. The entire pipeline runs unattended in tmux via autopilot.sh. Same agents, same review cycles, same static analysis gates, but all human checkpoints auto-approved. This is one of the things the Agent Dashboard was built to monitor.

Individual commands also work standalone without the full pipeline, useful for hotfixes, documentation updates, or quick refactors where the ceremony is overkill.

Terminal showing autopilot commit phase: 1383 tests passing, pre-flight checks clear, merge to main, and autopilot auto-approving the checkpoint

Autopilot auto-approving a commit checkpoint. Tests pass, merge to main, cleanup.

Here's what a feature looks like going through the full pipeline:

Create isolated git worktree and feature branch.

01 Requirements LLM

Requirements and acceptance criteria. In flow mode, UX design and user flows are added. Five agents review independently. No hard gates, all creative work.

human approves before advancing · auto-approved in autopilot

02 Architecture LLM

Module design, dependency ordering. Five specialist agents review the plan, fixer resolves findings, reviewers re-verify. Cycles until clean.

human approves before advancing · auto-approved in autopilot

03 Implementation LLM + Deterministic

TDD: write the failing test first, then code, then refactor. Linters and type checkers (ruff, mypy, tsc) run inline during implementation. Static analysis then runs SonarQube and coverage enforcement. LLM fixes what tools flag.

human approves before advancing · auto-approved in autopilot

04 Quality Human + LLM + Deterministic

Human verifies in browser. Five agents review implementation. Deterministic gate: all tests pass or it doesn't ship.

human approves before advancing · auto-approved in autopilot

05 Release

Merge feature branch to main, push, clean up worktree. If the feature is part of an execution graph, update the plan to mark the task as complete.

Git log showing two feature branches merged to main, each with phase-labeled commits: define requirements, architect plan, team review, implement, static analysis, team QA, finalize for merge

Git log showing the audit trail. Every phase produces a traceable commit on the feature branch.

No agent marks its own work

The same evaluator-optimizer pattern runs in plan review (/team-review), quality assurance (/team-qa), and project planning (/plan-project --team). Specialist agents review independently, a separate fixer resolves their findings, and the original reviewers run again to verify. The 13 agents span architect, analyst, lead developer, security auditor, QA engineer, performance engineer, UX designer, code reviewer, DevOps engineer, dependency auditor, test explorer, fixer, and a canary for diagnostics. No agent ever approves its own fixes. If mypy disagrees with the code, it doesn't matter how eloquently the agent argues otherwise.

Every command knows what came before

Each phase runs in a fresh session. No conversation history carries over. Instead, each command loads exactly the committed artifacts it needs. This is context engineering applied to an agent harness: the pipeline controls what each agent sees, rather than letting context accumulate across a long conversation.

/implement reads the requirements and architecture plan. /team-qa reads the full chain: what was required, what was designed, what was built, and what static analysis found. But none of them inherit the reasoning or mistakes from an earlier phase's session. Every phase starts clean, with curated context from git.

Requirements /ba /ux

acceptance criteria, scope, UX design

Architecture /plan /team-review

modules, dependencies, file paths

Code /implement

TDD against acceptance criteria, inline lint + type checking

Verification /static-analysis

SonarQube scan, coverage enforcement, baseline regression

Quality /manualtest /team-qa

verifies code against the full artifact chain

Each command reads what was committed before it. The chain is cumulative: /team-qa sees everything from /ba through /static-analysis. No context is passed between commands manually. It's all in git.

Execution graphs: coordinating what a single pipeline can't

"Plans are worthless, but planning is everything." — Dwight D. Eisenhower, remarks to the National Defense Executive Reserve, 1957

The pipeline above handles one feature at a time. A real module might need a data layer, an API, business logic, and a UI, each depending on the others, each running through its own requirements-to-QA cycle.

Execution graphs are the coordination layer for this. Via /plan-project, work gets broken into structured YAML plans with phases, tasks, dependency chains, and quality gates. Each task is a feature that enters the pipeline independently. Tasks within a phase can run in parallel. Tasks across phases respect dependency ordering. Each task also carries an autopilot eligibility flag. Backend work with clear acceptance criteria runs unattended, while UI work stays in flow mode with human checkpoints. The same team-based review pattern applies here: /plan-project --team brings up to eight specialist agents into the planning process itself, poking at the scope and dependencies before any code is written.

I'm not naive about upfront planning. Nobody can predict everything at the start, and detailed plans drift the moment they meet reality. But execution graphs are useful for a different reason: they force me to think about scope, sequencing, and dependencies before writing code. Even when the plan changes (and it will), having made the plan reveals the right questions early. When reality does drift from the plan, typically after a manual test phase that surfaces unexpected adjustments, /plan-project --update reconciles the graph with what actually happened, preserving completed work while re-sequencing what remains. The cost of replanning is minutes, not days. That changes how willing I am to change direction mid-project.

A simplified example of what an execution graph looks like in practice:

01 Foundation gate: all tests pass

Data layer autopilot

API layer autopilot

Configuration autopilot

All three run in parallel, no dependencies between them.

02 Core logic gate: tests + static analysis

Business logic autopilot · after data layer

Integration tests autopilot · after business logic

Sequential, each task unblocks the next.

03 Interface gate: e2e tests + manual QA

UI components flow · after API layer

End-to-end wiring flow · after business logic

Parallel, both need human review, so both run in flow mode.

Each task enters the full pipeline independently, /start through /commit. Tasks marked autopilot run unattended. Tasks marked flow pause for human checkpoints. The gate between phases must pass before downstream work begins.

Context flows across the graph

The execution graph doesn't just order tasks. It also controls what each task knows about the others. When a task depends on completed work, /ba and /ux automatically load the predecessor's artifacts. If the data layer shipped before the UI layer starts, /ba for the UI knows what the data layer promised and what it actually delivered. /ux picks up the component patterns and data schemas that were established.

This isn't implicit. The execution plan declares dependencies explicitly, and only completed dependencies are loaded. Three documents cross the boundary between features: what was promised, what was delivered, and what patterns were established. Implementation details stay inside the feature that built them.

Data layer completed DONE

REQUIREMENTS.md what was promised

QA_REPORT.md what was delivered

DESIGN.md patterns and schemas

UI layer starting depends on: data layer

/ba reads predecessor output knows what the data layer promised and delivered

/ux reads predecessor design picks up component patterns and data schemas

Only three documents cross the feature boundary. The execution plan declares which dependencies to load. Implementation details stay encapsulated.

The execution graph also serves as the bridge to the Agent Dashboard. The dashboard's plan viewer renders the graph as a visual task tracker, showing which tasks are done, in progress, or blocked. For larger projects, the dashboard becomes less about monitoring individual autopilot runs and more about navigating the complexity of the overall plan: which phases are complete, what's blocked, where the project actually stands.

The Agent Dashboard's plan viewer showing an execution graph rendered as a visual task tracker with phases collapsed and expanded, progress bars, task states, and gate evaluation results

Execution graph rendered in the Agent Dashboard's plan viewer

Reflections

There's a maturity curve to working this way. Early on I wanted the pipeline close, human checkpoints on everything, full permission prompts, reviewing every output. As the setup matured and I saw enough cycles pass the gates cleanly, I started loosening the grip: broader permissions, lighter review modes, tasks flagged for autopilot. I didn't plan that progression. It happened because trust builds from evidence, and the pipeline generates evidence at every phase. Getting autopilot to work was a concrete example: the first version ran with every permission dangerously approved just to prove the flow could complete end-to-end. Then I narrowed it down, reviewing the logs to see what the pipeline actually needed, removing everything it didn't, until the permission surface matched the real requirements. Turns out it's easier to loosen a working system than to guess the right constraints upfront.

This works for a solo builder. Scaling it to a team would be a different problem. The principles carry over (phase gates, deterministic verification, no self-review), but the commands and agents are shaped around how I work, not how a team coordinates. The interesting part isn't the tooling. It's the underlying questions: how do I structure AI work so that output is verifiable? How do I build trust in autonomous processes incrementally? Where does human judgement add value, and where does it just add latency? Those questions don't change when the team gets bigger. The answers just get harder.

What went wrong

Autopilot went through two redesigns in about two days. The first version tried to scrape tmux terminal output to track what the pipeline was doing. I wanted the full history including hooks, permission prompts, everything. It kept breaking on ANSI escape codes and partial output. I went through many iterations trying to make it work before discovering that Claude Code already writes a structured log of every action it takes. Once I switched to reading that data, the core of autopilot came together quickly. Adding the non-Claude parts like phase transitions and dashboard integration was straightforward after that. The lesson was simple: I spent two days fighting scraped terminal text when the structured data was already there.

The command surface area grew organically as new needs appeared, and the supporting commands probably need consolidation. Sometimes the right engineering decision is to delete things, even things that work.

View on GitHub

tomashermansen-lang/claude-code-pipeline

→

Next project

Agent Dashboard

Real-time monitoring for autonomous AI pipelines