Published FEBRUARY 28, 2026 · 16 min read

Building Your Own AI Orchestration Engine: From Tinkerer to Railroad Builder

By Conny Lazo

Builder of AI orchestras. Project Manager. Shipping things with agents.

February 28, 2026

16 min read

#AI#orchestration#multi-agent#pipeline#open source#software engineering

Building Your Own AI Orchestration Engine: From Tinkerer to Railroad Builder

I am a day-one ChatGPT user who spent three years working my way up the AI stack until I'd built what a reasonable person might describe as the world's most elaborate LinkedIn post generator. My own orchestration engine — because apparently three and a half weeks of obsessive development wasn't a personality trait, it was a research methodology.

If that sentence doesn't make you reconsider what you're optimizing your agents for, the next 2,800 words probably won't help either — but I'm going to try.

The Tinkerer Who Couldn't Stop Taking Things Apart

I got my first console at age four. By twelve, I had my first computer, and instead of playing games on it like a normal child, I was taking it apart. My parents were delighted, presumably because they had always wanted a child who would void warranties as a hobby.

This pattern has never changed. Whenever I encounter new technology, my first instinct is to dismount it, remount it, and play with it until I understand how it works. I wouldn't say that I'm an engineer. But I would say that I am an engineer at the same time. This is a perfectly coherent statement if you don't think about it too hard.

By professional background, I'm a project manager and system architect. I worked at Allegro, where I managed billing systems, authorization flows, and led a development team. I understand ERDs, data flows, algorithms — I can have a full technical conversation with any developer in the room. I just can't write the code myself. Think of it as being fluent in a language you can only speak, never write. Except the language is Python, and the accent is YAML.

When ChatGPT launched, I didn't just use it — I started climbing the AI stack. First prompting, then chaining, then building multi-agent systems. I built Text Grand Central, a multi-agent translation system, before realizing I needed something more fundamental: not just agents that could do things, but a framework that ensured those things were done well.

On February 5, 2026, I spawned my first AI agent session. In the voice brief I recorded for this article, I called it "the day Toscan was born." Toscan runs on Claude Opus 4.6, and from the first hour of working together, I immediately saw the drawbacks of working without a framework. Here was this extraordinarily capable AI — fast, articulate, knowledgeable — making confident mistakes with the serene self-assurance of a junior developer who just passed their first code review.

I see Toscan as exactly that: a junior developer. He makes mistakes, and sometimes he is very sure of those mistakes. He's very confident about what he's doing. He thinks he's right. If you've ever managed developers, you've met this person. The difference is that Toscan works twenty-four hours a day and doesn't ask for equity.

The solution wasn't to make Toscan less confident. The solution was to build rails — a framework that would catch the confident mistakes before they shipped. I needed a railroad.

The Railroad Metaphor (and Why I'm Shoveling Coal)

Railroad switching yard — chaos tamed by infrastructure

The metaphor I keep coming back to is railroads. The orchestration engine isn't the train. It's the tracks, the switches, the signaling system. Without tracks, a train is just a very expensive piece of metal sitting in a field. With tracks, it goes where you need it to go, at speed, without falling off a cliff.

I want to provide the railroad system — the framework — so that AI-generated code and content runs on tracks, not off cliffs.

Now, I could have used LangGraph. I could have used CrewAI or Microsoft's Agent Framework (which merged AutoGen and Semantic Kernel in October 2025, and promptly put both predecessor frameworks into maintenance mode — security patches only) (Microsoft Learn, 2025). I could have used Dify's visual builder and had something running by lunchtime.

But then I'd have a working pipeline and zero understanding of why it worked. This seemed like exactly the wrong tradeoff for a person who dismantled their first computer at age twelve.

I want to understand what I'm doing. I want to understand what I'm building. I want to use what I'm building. The act of building forces understanding. Using someone else's framework gives results without insight. There are three orchestration engines out there that serious people use. I am not trying to build the fourth. I am trying to understand what it feels like to build one, which turns out to be completely different from using one, in the same way that understanding how to cook is different from knowing how to eat.

Vienna coffee shops are full of people with orchestration opinions and no orchestration engines. I decided to be the opposite.

Architecture: YAML, Model Tiers, and 889 Tests

The engine is YAML-based. Pipeline logic lives in YAML files, not scattered across Python functions. Think of it as Docker Compose for AI pipelines: if you can read it, you can run it, version it, and trust it. Every phase's purpose is visible and auditable — no buried logic, no hidden prompts, no "it works but nobody knows why."

The core architecture runs on three principles:

Phase-based sequencing. Each pipeline moves through defined phases, each with a specific agent, model, and prompt template. The content pipeline I use for articles like this one has eight phases: research, writing, fact-checking, mechanical revision, editorial revision, red team, a mandatory twenty-four-hour cooldown, and human review where I — the human — actually publish. The agent never publishes. That's a feature, not a limitation.

Model tier routing. Not every task needs the most powerful model. Haiku handles mechanical work — fixing typos, reformatting structure, tasks where speed matters more than judgment. Sonnet handles research and writing — synthesis tasks that require reasoning but not the deepest kind. Opus handles judgment — code review, editorial review, quality assessment, anything where getting it wrong is costly. This isn't just cost optimization; it's architectural honesty about what different tasks actually require.

Circuit breaker patterns. The engine tracks failure counts per task-type and model-tier combination. If Haiku fails at something, it escalates to Sonnet. If Sonnet fails, it escalates to Opus. The system learns where its limits are, not by becoming smarter, but by routing around its own weaknesses.

The whole thing runs on OpenClaw as the execution substrate, with a SQLite database for task queue management — WAL mode, thread-safe, because even a tinkerer's side project shouldn't lose data during a crash.

It has 889 tests. Most of them pass.

That last sentence is doing more work than all the other tests combined, but the point is real: the engine supports five task types, three executor modes, multi-provider support across Anthropic, OpenClaw sub-agents, and Gemini — and every piece of it is tested. Not because I'm obsessive about testing (I am), but because an untested orchestration engine is just a very organized way to generate unreliable output.

The coding pipeline follows the same philosophy: requirements interpretation, implementation, code review by Opus, automated fixes, test generation, security audit, and human review before merge. Every phase has a gate. Every gate has criteria. Nothing ships by vibes alone.

Atomic Scrum: Full Process at Lightning Speed

Atomic sprint board — compressed to pocket-watch size

Here's something I think is extremely important and rarely discussed in the AI-agent conversation: we work in a structured way. For everything we do, there is history. Issues. Tickets. Commits. Pull requests. Code reviews.

We basically have all the elements of scrum, in an atomic size, with lightning speed.

Every task gets a GitHub issue. No work starts without a ticket. The issue defines what, why, acceptance criteria, and which pipeline handles it. From there: issue to branch, branch to commit, commit to PR, PR to code review by Opus, and finally I approve the merge. At one point I was managing eight GitHub repos simultaneously, orchestrating thirteen agents overnight, and maintaining a sprint backlog that looked more like a doctoral thesis. I am, technically, a one-person engineering team. The "team" part is doing heavy lifting.

The key insight isn't that AI can follow scrum — it's that shrinking scrum to atomic size removes all the overhead that makes scrum painful for humans. There are no standups because there's no miscommunication to resolve. There are no sprint retrospectives because the version history is the retrospective. Each commit tells you what changed, each PR tells you why, each review tells you whether it should have. The AI doesn't need status updates — the commit log is the status update.

When I hit a complex issue that was too large in scope, I did what any good PM would do: I broke it into smaller pieces. The scrum principle works even when your entire development team is an AI agent. Especially then — because AI is better at executing small, well-defined tasks than large, ambiguous ones. Decomposition isn't project management ceremony. It's an architectural requirement.

The Dark Factory Is Coming (and Nobody Has the Quality Gates)

Automated factory floor — the dark factory runs itself

Let me give you two data points that seem contradictory but are both true.

Data point one: A study by METR, a model evaluation and threat research organization, found that AI-assisted developers were 19% slower on average than those coding without AI assistance. These same developers believed they were 24% faster. They were wrong about both the direction and the magnitude (METR, July 2025).

Data point two: StrongDM, a company with a three-person engineering team, is shipping production-quality software at scale using scenario-based testing and AI agents. Their benchmark: one thousand dollars per engineer per day in tokens, or your factory has room for improvement (StrongDM Engineering Blog, 2025).

Both are true. The difference is the framework — though I'll note that StrongDM is one data point, not a proof. The mechanism they're using — scenario-based testing, quality gates, AI replacing human review — is reproducible. Whether my particular implementation of it is the answer is still a hypothesis I'm testing.

Traditional companies are bolting AI onto human-designed processes — standups, manual code review, sprint planning — and discovering that it creates friction, not speed. The AI is fast, but the process around it was designed for humans who think slowly and communicate imperfectly. Adding a fast thinker to a slow process doesn't make the process fast. It makes the fast thinker frustrated and the process confused.

We are heading toward what some call the "dark factory" — a manufacturing term for factories that run without human workers on the floor. In software, this means engineers will stop writing code. AI will write it. This is either terrifying or liberating depending on how much you enjoy debugging at two in the morning.

We rarely see weavers making t-shirts today. The t-shirts are made by machines. There will always be engineers who want to write code, just as there are still carpenters and couture designers. But most commercial software will be AI-generated. Dan Shapiro's five-level framework for AI-assisted programming maps this progression, from Level 2 — AI as junior dev, where roughly 90% of "AI-native" teams are stuck — through to Level 5, the dark factory, where only a handful of teams operate today (Dan Shapiro, 2026).

The gap between Level 2 and Level 5 isn't talent. It isn't budget. It's quality infrastructure. The constraint has moved from implementation speed to specification quality. And almost nobody is building the frameworks to bridge that gap.

Gartner predicts that over 40% of agentic AI projects will be canceled by the end of 2027 (Gartner, June 2025). MIT and Fortune report that 95% of generative AI pilots are failing (Fortune/MIT, August 2025). Klarna's AI agent handled the equivalent workload of 853 full-time employees — and then Klarna started rehiring humans, because volume without quality creates a different kind of problem (Entrepreneur, May 2025; Nate Jones, Substack, February 2026).

The dark factory is coming. What's missing isn't the factory. It's the quality control.

Backlash as Building Material

I published an article about AI and literary translation. It received fifty-two comments, many from professional translators, and the general sentiment was not what I'd call enthusiastic agreement.

They were right.

Not about everything, but about enough. The fact-checking was shallow. Sources existed, but what those sources actually said wasn't always what I'd claimed they said. The backlash surfaced a real gap in my pipeline.

I wrote about this experience in "Building With The Bricks They Throw." The metaphor is literal: someone threw a brick. I looked at it. I used it.

The red team phase — Phase 6 of my content pipeline — exists specifically because of that translation article. Before it, the pipeline checked facts against sources. After it, the pipeline argues with its own output. A separate agent, given different instructions, tries to find holes. If it can't find them, the article is stronger. If it can, the article gets fixed before publication, not after.

This phase exists because I needed it — after writing about AI and literary translation, receiving well-founded critiques from professional translators, and realizing the gaps they identified were real.

The version history tells the whole story. Pipeline v2.1, v2.2, v2.3, v2.4, v2.5, v2.7 — each version number represents something that went wrong. The version history is the intent history.

And then came the best failure of all. Pipeline version 3.0 introduced seven agents, each optimized for risk reduction. The output was technically correct. Professionally polished. Indistinguishable from anyone.

The pipeline worked perfectly. It successfully automated the removal of everything that made me sound like me.

Every reviewer flagged the jokes as risks. Every editor smoothed out the rough edges. Nobody was tasked with protecting personality — so nobody did. I'd built a pipeline that could produce flawless content that no human would want to read.

The fix was YAML. I added a voice_style field to the pipeline configuration, updated the red team prompt to check for voice consistency alongside factual accuracy, and added a voice-check phase. The humor guardian — an agent specifically tasked with ensuring the writing doesn't sound like it came from a content mill — is now a first-class pipeline stage.

The scenario-based testing uses LLM judge graders: acceptance criteria defined in YAML with assertion gates and scored evaluations on a zero-to-one scale. Factual accuracy weighted at three, structural quality at two, tone at two, citation integrity at one. The agents never see the acceptance criteria while they work — the holdout principle, borrowed from machine learning evaluation, applied to content production.

Open Source, Web UI, and the Joy of Never Finishing

I'm going to make the engine entirely free. MIT license. Open source. Completely accessible to everyone.

The learning embedded in the system — 889 tests, eight content phases, the coding pipeline, the humor guardian, the red team structure — is not proprietary insight. It's lessons earned through real failures, encoded in a form that can be read, modified, versioned, and improved by anyone.

The engine is also going to have a web UI — a visual block builder where you can drag and drop pipeline phases without writing YAML. I want to make it accessible to everybody who wants to build software with AI agents but doesn't want to learn yet another configuration language. My wife Evy, who is a strategist and not a developer, used the same AI tools I use to build things for her work. If the tools are good enough for a strategist who's never written code, the framework should be too.

There's also a side agent planned that runs a simulated output before you commit to the full pipeline run, identifying gaps in your intent specification before you burn tokens on a real execution. Think of it as a dry run for your dry run — which is either brilliantly meta or evidence that I've been staring at pipeline architecture too long. Both interpretations are defensible.

Maybe I will never finish implementing the orchestrator. Maybe it's a never-ending loop. Three and a half weeks in, I'm aware this is early — I'm sharing the approach, not claiming victory. But every time I read about a new framework or process that could benefit it, I talk to Toscan and we implement it. The orchestrator keeps updating because the field keeps moving, and a tinkerer who stops tinkering is just someone with a lot of spare parts.

I'm not competing with LangGraph or CrewAI or AutoGen. I'm not building a product. I'm building understanding — and then sharing it with anyone who's curious enough to look.

The gap nobody fills right now is this: YAML-declarative pipeline definition with scenario testing built in, phase-level quality gates as first-class stages, and hardware-minimal deployment that runs on a Raspberry Pi. No existing framework offers all three. Not because it's technically impossible, but because the people building orchestration frameworks are optimizing for capability, not for quality assurance. They want to make agents more powerful. I want to make agents more reliable.

I started taking things apart at age four. Three and a half weeks ago, I started putting this one together. It has 889 tests, a humor guardian, a red team that argues with its own output, and a mandatory twenty-four-hour cooldown period because I learned the hard way that publishing at 2 AM feels brave and reads desperate.

The engine isn't finished. It might never be. But the railroad is being laid, one track at a time, and the trains are running on schedule.

If you're building with AI agents and tired of watching confident mistakes ship to production — or if you're a tinkerer who just wants to understand how these things work from the inside — the tracks will be open for everyone.

I'll be in the workshop. I've been there since I was four.

Sources

METR Study on AI-Assisted Developer Productivity — METR (Model Evaluation and Threat Research), July 10, 2025. Finding: AI-assisted developers 19% slower on average; self-assessed as 24% faster. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
StrongDM Software Factory Blog Post — StrongDM Engineering Blog, 2025. https://www.strongdm.com/blog/the-strongdm-software-factory-building-software-with-ai
Microsoft Agent Framework (AutoGen + Semantic Kernel merger) — Microsoft Learn, October 2025. https://learn.microsoft.com/en-us/agent-framework/overview/agent-framework-overview
Gartner: 40%+ of Agentic AI Projects Will Be Canceled by 2027 — Gartner, June 25, 2025. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
MIT/Fortune: 95% of Generative AI Pilots Failing — Fortune, August 2025. https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/
Klarna AI Agent and Rehiring — Entrepreneur, May 2025. https://www.entrepreneur.com/business-news/klarna-ceo-reverses-course-by-hiring-more-humans-not-ai/491396; Nate Jones, Substack, February 24, 2026. https://natesnewsletter.substack.com/p/klarna-saved-60-million-and-broke
Dan Shapiro's "The Five Levels: from Spicy Autocomplete to the Dark Factory" — Dan Shapiro's Blog, January 2026. https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/
"Intent Engineering: The Missing Discipline in AI Agent Development" — Conny Lazo, LinkedIn, February 26, 2026.
"Building With The Bricks They Throw" — Conny Lazo, LinkedIn, February 20, 2026.
"The Dark Factory Gap" — Conny Lazo, LinkedIn, 2026.

Back to Blog