Orchemist Launch Series • Part 5 of 5

Published APRIL 25, 2026 · 13 min read

From Factory to Dark Factory: The Orchemist Roadmap (And Why V1 Will Build V2)

By Conny Lazo

Builder of AI orchestras. Project Manager. Shipping things with agents.

April 25, 2026

13 min read

#AI#Orchemist#roadmap#Go#dark factory#open source

The factory has learned to diagnose its own fevers. What it hasn't learned — yet — is to run while I sleep. That's the roadmap.

Over four previous articles, I've walked you through how Orchemist went from a first pipeline run on February 15 to a system that ships code with zero manual keystrokes. Two months in, the numbers have moved: 350+ issues shipped, 7,000+ tests passing, 230+ pull requests merged, 300+ pipeline runs across content, coding, research, and compliance workflows, a 0.993 average quality score, and still zero lines of manual code. Today we talk about where this machine is going — and why I trust the trajectory, because most of it is already proven, and the parts that aren't are honest about being unproven.

This is earned ambition, not pitch deck ambition. Every claim below is either shipped, in progress, or explicitly labeled as planned. I'll tell you which is which.

Where We Are: Level 4, Self-Healing

Sprint 3 shipped twelve issues, averaged a 0.993 quality score, and added 443 new tests — but the headline wasn't the numbers. The headline was what happened when things broke.

Three self-healing components went live: a DiagnosisEngine that classifies why a pipeline failed, a RegressionDetector that identifies which commit broke things using file-overlap scoring, and a RegressionFixer that spawns a fix pipeline automatically. A SafetyGuard prevents infinite fix loops — three attempts max, then a human gets a tap on the shoulder. And confidence-gated routing decides what merges automatically, what needs review, and what gets retried.

The routing table is simple and honest: score above 0.95 with passing tests and no security findings? Auto-merge. Between 0.70 and 0.95? Human review. Below 0.70? Auto-retry with an escalation strategy — different model, split the task, add context. Below 0.40? Reject outright.

That's the theory. Here's what actually happened.

Incident #383: The Scoring Bug. A pipeline scored 0.400 on legitimate builds. Panic-inducing number. Root cause? The rubric expected raw code output, but the pipeline was producing summaries. The grader failed the exam because it brought the wrong answer key. Classic case of judging a fish by its ability to climb a tree — except I built the tree, then forgot what kind of fish I had. Fix: change the rubric, not the pipeline. Lesson: know what you're measuring.

Incident #393: Branch Drift. A builder agent modified main's working tree instead of the feature branch. Broke the CLI for five minutes. The surgeon operated on himself — the circuit it was supposed to protect. Root cause: no git checkout in the implementation phase. Fix: template update, not code change. The factory diagnosed its own process failure.

These aren't embarrassing stories. They're evidence. When a machine makes a mistake, diagnoses the mistake, and fixes the mistake without a human typing a single character, that's exactly the capability the roadmap is scaling.

The self-healing chain — CI failure diagnosed, regression fixer spawned, confidence-gated routing decides auto-merge vs review vs retry vs reject

But let's be clear about the current score. The ROADMAP defines an Autonomy Index — the percentage of work that completes without human intervention. Today, that number is effectively near zero for end-to-end workflows. Every pipeline is manually triggered. Every result is manually routed. The machinery for autonomy is built and tested; the trust calibration that lets it run unsupervised is Sprint 4's job. The car is assembled. We're calibrating the brakes.

MCP Server: The Pipeline Becomes Infrastructure

Here's a question: what happens when the pipeline stops being something you use and starts being something that runs?

The answer is MCP — the Model Context Protocol — and it's how Orchemist graduates from CLI tool to invisible substrate.

Orchemist v2 exposes pipeline operations as MCP tools: orchemist/launch_pipeline, orchemist/get_status, orchemist/list_templates. What this means in practice: Claude Code, Cursor, Continue, or any MCP-capable development environment can trigger Orchemist pipelines natively — without orch launch, without a terminal, without knowing Orchemist exists.

Today you type a command. Tomorrow, your coding agent sees you open a GitHub issue and asks if you'd like to run the coding pipeline. The day after, you don't see it at all — the pipeline just runs. The interface that disappears is the interface that works.

Beyond MCP, v2 also exposes an A2A (Agent-to-Agent) endpoint with a published agent card describing capabilities: pipeline execution, code review, quality scoring, git lifecycle management. Other AI agents can invoke Orchemist as a peer — not through a CLI, not through a webhook, but as an equal in a network of agents that negotiate work between themselves.

The foundation is already shipped. The REST API and SSE streaming work today in v1 — FastAPI backend, live progress events, template CRUD. V2 layers MCP, A2A, WebSocket, and GitHub webhook endpoints on top of that existing surface. Backward compatible. Everything you built with the v1 API still works.

A factory you have to watch isn't done yet. MCP is how Orchemist stops being watched.

Orchemist as MCP server and A2A peer — Claude Code, Cursor, Continue, the Orchemist IDE, GitHub webhooks, and peer agents all invoking pipeline tools

Web Dashboard & Visual Builder

The CLI is fine. orch watch is great. But there are questions a scrolling terminal cannot answer: Is this template's success rate trending up or down? Where do pipelines spend the most time? Which phases fail most often?

The Web Dashboard answers those questions visually. It's not a cosmetic layer — it's the data surface that makes autonomous trust calibration possible. You can't improve what you can't measure, and you can't trust what you can't see.

Here's what's already shipped: a dashboard with template cards, run details with live progress streaming, and template CRUD via the API. Here's what's coming:

Visual Pipeline Builder — drag phases onto a canvas, connect them with transition edges, export valid YAML. No more hand-editing template files. The complexity of pipeline design drops from "read the schema docs" to "connect boxes." You can drag boxes around and call yourself a software architect. The assistant just makes sure the boxes connect correctly.

Run Analytics Dashboard — success rates by template, average phase duration, token usage trends, failure hotspots. This is the data that feeds confidence-based routing and self-healing. The dashboard isn't decorative; it's the nervous system.

Human-in-the-Loop Approval Queue — a reviewer sees phase output, rubric, and score side by side. Click Approve, Revise, or Reject. The HITL pause/resume machinery is already in the sequencer; the UI makes it accessible to someone who isn't watching a terminal.

Template Marketplace — browse, search, install, and rate community templates. Now that Orchemist is open source, templates become the currency of the ecosystem.

The v2 API foundation underneath all of this is a Go rewrite: chi HTTP router replaces FastAPI, Go's native http.Flusher replaces sse-starlette, and the Next.js web UI compiles directly into the binary via go:embed. No separate static file server. No deployment headaches.

The web dashboard, visual pipeline builder, and approval queue — template cards with success rates, drag-and-drop phase canvas exporting valid YAML, HITL review list

The Dark Factory Vision: Level 5 Autonomy

This is the section where I either sound visionary or delusional. I'll let the specifics decide.

The ROADMAP defines five levels of autonomy:

Level 4 (today): 0% Autonomy Index. Every pipeline is manually triggered and reviewed. The engine works; the human decides everything.

Level 4.5: 20–40% Autonomy Index. Event triggers work. Humans review most results, but routine high-confidence work starts merging on its own.

Level 4.8: 50–70% Autonomy Index. High-confidence work auto-merges. Failures self-heal. Humans focus on the interesting problems.

Level 5: 80–95% Autonomy Index. Issues become deployments with minimal human touchpoints. The dark factory.

Note the ceiling: not 100%. The ROADMAP is explicit: "Some work should always require human judgment." This isn't a limitation. It's a design decision.

Here's the North Star scenario, pulled directly from the roadmap:

9:47 PM — Someone opens issue #892: "CSV export truncates rows > 10,000." Orchemist classifies it (confidence: 0.91), spawns the coding pipeline. 9:52 PM — Spec agent reads the issue, identifies affected files, produces a plan scoring 0.88 — above the auto-proceed threshold. 10:14 PM — Implementation complete. Review flags a missing edge case. Fix pipeline runs. Re-review scores APPROVE. Tests pass. PR #347 created with full provenance. 10:16 PM — Confidence 0.93 exceeds the auto-merge threshold of 0.90. CI green. PR merges. Issue closes. 10:18 PM — Post-merge CI passes. Deployment pipeline promotes to staging. Smoke tests pass. 10:22 PM — Change ships to production.

11:30 PM — Sentry reports a new error. Orchemist detects the regression, creates issue #893, spawns a fix pipeline. By midnight, resolved.

Meanwhile — two content pipelines ran for the blog. A documentation pipeline updated the API reference. A research pipeline flagged its output for human review because confidence was 0.74, below the auto-publish threshold.

You review three items that needed human judgment. Everything else handled itself. The factory ran in the dark.

That's the vision. And to be clear about its scope: this has been proven on one codebase — Orchemist's own. Whether it generalizes to messy, legacy, multi-team repos is Sprint 5's question. We're not pretending otherwise.

Now here's the honesty about guardrails: they're as important as the automation. Non-negotiable rules: no auto-merge without CI green, period. No production deploy without a staging soak. Human-reviewable audit trail for every merge. Budget hard caps that cannot be overridden. Trust starts at zero for new repos and templates. And the kill switch: orch factory stop pauses all autonomous triggers immediately.

The ROADMAP also lists "retry loops that burn $1,000+" as a known risk with high likelihood. That's the kind of honesty you only get from someone who watched a machine spend an uncomfortable afternoon trying to retry its way through a rate limit. The guardrails aren't theoretical. They're scar tissue.

The five levels of autonomy — Level 4 today with near-zero Autonomy Index, Level 5 the dark factory at 80–95%, with an explicit ceiling below 100% by design

V2: The Factory Builds the Factory

Here's where the recursive mind-bending happens.

V1 is Python. It works. It has shipped 350+ issues. But Python's Global Interpreter Lock means that the entire concurrency.py — 597 lines of it — is workarounds for the fact that ThreadPoolExecutor provides concurrent I/O but not parallel execution. The queue.py module adds another 623 lines. Together with progress tracking and heartbeat monitoring, that's 1,994 lines of code that exist solely because Python's concurrency model isn't native to the problem.

V2 is Go. Goroutines replace ThreadPoolExecutor. Channels replace the queue. context.Context replaces abort_event = threading.Event(). Those 1,994 lines don't get rewritten — they disappear, because Go's standard library provides them for free.

The numbers across the full codebase: roughly 20,000 lines of Python become an estimated 10,500 lines of Go. A 47% reduction. The resulting binary is a single ~20MB file with zero external dependencies. Install it with curl | sh. Run it with ./orch serve. That binary embeds the CLI, daemon, API server, web UI, template engine, git integration, SQLite store, and plugin host. Everything.

LLM calls stay in Python — that's where the ecosystem is. The Go core communicates with Python executors via gRPC over Unix socket or TCP. Fast compiled core for orchestration and scheduling; Python plugins for LLM calls and custom scoring. The Terraform provider model: the core is fast, the plugins bring breadth.

Now here's the part that closes the loop on the entire series.

The ARCHITECTURE.md lists its intended audience: "AI code generation agents building v2 from this spec." I still wrote every word of that architecture document — the machine builds from human intent, not from nothing. But the manual's reader isn't a human developer. It's the Opus agents that V1's coding pipeline will spawn. The plan is for V1 to generate V2: to read the architecture spec, plan the implementation, write the Go code, review it, test it, and ship it — using the same pipeline that built V1.

That pipeline hasn't run yet. When it does, it'll be the most important test Orchemist has ever faced. The student becomes the teacher. The tool builds the better tool. This is the recursive payoff for every piece of infrastructure described in this series.

V1 Python monolith evolving into V2 Go core with Python plugin executors over gRPC — 47% line reduction, single 20MB binary, GIL workarounds disappear

The Closing Ledger

Here are the numbers, because numbers are the only honest language left.

350+ issues shipped. 7,000+ tests passing. 0.993 average quality score. 230+ pull requests merged. 300+ pipeline runs across content, coding, research, and compliance workflows. Zero lines of manual code — only DX fixes typed by a human hand. Two months from first pipeline run to a self-healing orchestration engine with a roadmap to full autonomy. All of this from a single builder in Vienna, four executor backends to pick from (Anthropic, OpenRouter, OpenClaw, dry-run), and a conviction that the factory should build itself.

The closing ledger — 350+ issues, 7,000+ tests, 0.993 average score, 230+ PRs, 300+ pipeline runs, zero manual code, two months, four executor backends

Orchemist is now open source under the MIT License: github.com/ToscanAI/orchemist. pip install orchemist. The Go rewrite described above is the next chapter — a parallel track that will eventually replace the Python core. If you've read this series and thought "I want that for my projects," the invitation is open.

This is the end of the five-part series. We started with an idea: what if you never typed code again, and the software still got better? We explained why unchecked AI output is dangerous. We built the trust pattern that grades before it ships. We anchored it in the spec discipline that makes AI-written code reviewable. We showed you the doors into the system — CLI today, chatbot today, IDE soon — and the four executor backends behind them. And now we've drawn the map to the dark factory — the place where issues become deployments and the lights stay off because nobody needs to be there.

The factory runs. The tests pass. The future builds itself.

Sources

[S1] LOGBOOK.md — The Orchemist Chronicles Location: Internal project logbook. Authors: Conny Lazo & Toscan (AI). Documented events from February–March 2026. Primary firsthand source for all sprint data, incident reports, and project metrics.

[S2] ROADMAP.md — Orchemist Roadmap: Level 3 → Level 5 (Dark Factory) Location: ToscanAI/orchemist, ROADMAP.md. Maintainer: Conny Lazo. Primary source for autonomy levels, phase definitions, risk table, and the North Star scenario.

[S3] ARCHITECTURE.md — Orchemist Architecture Document Location: ToscanAI/orchemist, docs/ARCHITECTURE.md. Design specification covering pipeline sequencing, executors, MCP/A2A integration, and module structure. Intended audience: humans and AI code-generation agents alike.

Conny Lazo is a builder and AI-powered IT leader based in Vienna. Portfolio: connylazo.com

← Previous

Getting Orchemist Running in 10 Minutes — And Why You'll Talk to It Instead of Typing Commands

More from “Orchemist Launch Series”

Part

I Published AI Content Without Challenging It. Then I Built a System That Won't Let Me Do It Again.

I knew about AI hallucinations. I just didn't challenge my own output. The trust problem nobody in AI is solving — and why more agents just means more chaos. Part 1 of the Orchemist Launch Series.

Part

Orchemist Doesn't Just Write Code — It's a Trust Factory for Anything AI Touches

A universal pipeline engine that works for code, content, slides, research, and anything else where 'good enough' isn't good enough. Part 2 of the Orchemist Launch Series.

Part

When AI Writes the Code, Who Checks the Homework? Behavior-Driven Development for the Agent Era

BDD sat on the shelf for twenty years. Then AI started writing production code, and we realized: if no human wrote it, the only thing between you and shipping 'return true' to production is a behavioral spec. Part 3 of the Orchemist Launch Series.

Part

Getting Orchemist Running in 10 Minutes — And Why You'll Talk to It Instead of Typing Commands

CLI for builders. Chatbot for operators. Both for people who've been burned by 'it works on my machine.' Part 4 of the Orchemist Launch Series.

Back to Blog