Published MARCH 12, 2026 · 8 min read

4 AI Labs Built the Same System Without Talking to Each Other

By Conny Lazo

Builder of AI orchestras. Project Manager. Shipping things with agents.

March 12, 2026

8 min read

#AI#agents#orchestration#multi-agent#convergence#harness

Four companies walked into a bar. Different models, different cultures, different business incentives. They all ordered the same drink.

That's not a punchline. That's what Nate B Jones argues happened in early 2026: Anthropic, Google DeepMind, OpenAI, and Cursor independently arrived at the same four-step architecture for making AI agents actually useful — decompose the work, parallelize the execution, verify the outputs, iterate toward completion (Nate B Jones, 2026). Nobody coordinated. Nobody copied. Jones makes the case that the problem itself forced the solution. Which is either deeply reassuring or deeply unsettling, depending on how much you enjoy being right by accident.

(Quick caveat: much of what follows is sourced through Jones' video — an independent analyst synthesizing public announcements and blog posts. Where he could be wrong, I'll say so.)

The "Jagged AI" Was Never Jagged

The jagged frontier flattens with scaffolding — capability is uneven without structure but levels up sharply with the right harness

For the past two years, the prevailing wisdom — popularized by researchers like Ethan Mollick — was that AI had a "jagged frontier": brilliant at some tasks, embarrassingly bad at others. Nate B Jones makes a compelling counter-case: "The jagged frontier was never an inherent property of AI intelligence. It was an artifact of how we were asking the AI to work" (Nate B Jones, 2026).

That's a strong claim. Mollick would push back — the empirical evidence for task-specific variability is real. But Jones' argument isn't that the frontier was smooth. It's that better scaffolding flattens it. That distinction matters.

We gave a capable analyst a napkin and 30 seconds, then concluded analysts were incompetent. That's like asking your teenager if they cleaned their room and accepting "yes" as proof.

Four Labs, One Architecture

Four labs built the same architecture independently — Anthropic, Google DeepMind, OpenAI Codex, Cursor — all decompose, parallelize, verify, iterate

Here's what each lab built, independently — per Jones' analysis:

Anthropic created an initializer agent that sets up environment state and a progress file. The coding agent makes incremental progress, leaving structured artifacts for the next session. Without that structure? The agent tries to one-shot everything, runs out of context, and leaves things worse than it found them (Nate B Jones, 2026).

Google DeepMind separated generation, verification, and revision into distinct roles — the same structure humans use in code review, legal proceedings, and scientific peer review (Nate B Jones, 2026). (Jones references a specific DeepMind model in his video; the exact name is unconfirmed — likely AlphaProof or AlphaGeometry.)

OpenAI's Codex runs tasks in parallel sandbox environments. Clean isolation. No cross-contamination (Nate B Jones, 2026).

Cursor published the most detailed evidence. Wilson Lin described a Planner → Sub-Planner → Worker → Judge hierarchy: root planners explore the codebase, recursive sub-planners parallelize the planning itself, workers grind on individual tasks in isolated repo copies, and a judge decides at the end of each cycle whether to continue, restart with fresh context, or stop (Wilson Lin, Cursor Engineering Blog, January 14, 2026). Their first attempt — flat coordination with shared files and locks — failed badly. Agents became risk-averse and avoided hard tasks. Sound familiar? That's every open-plan office I've ever worked in.

Now the kicker — and flag this as high-stakes: Cursor's coding harness solved Problem 6 of the First Proof challenge — a set of ten unpublished mathematical research problems contributed by academics at Stanford, MIT, Berkeley and others, encrypted on 1stproof.org and decrypted only on February 13, 2026 to give AI systems a fair test. The harness ran for four days with zero hints and zero human intervention. According to Cursor CEO Michael Truell, the solution was stronger than the official human-written one (Michael Truell, X, March 2026). The First Proof setup is documented; the specific Cursor result is Truell's claim awaiting independent verification — verify before you cite it in a board deck.

The same harness had reportedly built a web browser from scratch in Rust in approximately one week — around one million lines of code (Nate B Jones, 2026).

A coding tool that may have solved unsolved math. Don't try to say "Marcus-Spielman SDP interlacing polynomial method" five times fast — but do try to explain that away as incremental progress.

What This Means If You Build Things

The new question — can your work be decomposed into verifiable sub-problems? Decomposable work flows through the harness; subjective work stays with humans

Jones reframes the question perfectly: "The relevant question is shifting very quickly from 'can AI do a specific task in my job family' to 'can my work be decomposed into verifiable sub-problems'" (Nate B Jones, 2026).

Worth noting: all the concrete evidence Jones cites is from coding and mathematics. Whether decomposition-and-verification generalizes cleanly to strategic planning or creative work is still an open question. The architecture works spectacularly well on problems with verifiable outputs. Your mileage may vary if "correct" is a matter of opinion.

The harness is the product now. Not the model. The harness — the state around the agent, the scaffolding, the structured environment it operates within. A markdown file for tasks, a spot for memory, verification at every gate. The fluency curve — how well we use these tools — matters more than the intelligence curve.

Jones identifies the meta-skills that survive this transition: knowing whether architecture is maintainable, recognizing fragile solutions, understanding when tests cover the important cases. "The skill that survives this transition isn't 'I can do the work.' It's 'I can sniff check'" (Nate B Jones, 2026).

Here's the honest caveat: multi-agent harnesses are expensive and operationally messy. Jones says you have to be ready to "enable token burn" (Nate B Jones, 2026). He's right. Orchestration debugging is its own special hell — agents that silently fail, state files that drift, verification steps that pass when they shouldn't. But the alternative — serial cognition on problems that are structurally inaccessible to a single context window — isn't cheaper. It's impossible.

We Solved the Same Problem Without Knowing It

Orchemist's behavioral trust pipeline mapped onto the four-lab convergence — decompose, parallelize, verify, iterate, with deterministic exit codes as the source of truth

I've been building Orchemist, an open-source AI orchestration engine, and watching this convergence feels like finding out the highway you built connects to four other highways nobody told you about.

Orchemist uses a behavioral trust pipeline where AI agents are verified by deterministic infrastructure — not by other AI opinions. The pipeline phases map directly to what the labs converged on: decompose → parallelize → verify → iterate. On day one, the spec adversary rejected four successive versions of the spec. A dollar spent on a philosophical argument between two AIs. Money well spent, actually, because the fifth version was airtight.

I rate our system at 3.8 out of 5 on trust maturity. We're working toward Level 5 — what I call "Dark Factory," where the system runs without human intervention. We're not there. I'm honest about that because honesty is load-bearing in this field.

We weren't following the labs. We were solving the same problem they were solving. The convergence isn't validating us specifically — it's showing that the problem itself may force this architecture. I say "may" deliberately. Four labs arriving at the same pattern could mean the problem forced the solution. It could also mean shared research culture, employees who move between labs, or shared benchmark constraints pointing everyone toward the same local optimum. I don't know which. Neither does Jones. Neither do the labs, probably. Decompose, verify, iterate — it's a management insight that generalizes to autonomous agents as naturally as it generalizes to human teams. Whether that generalizes further is the open question (Nate B Jones, 2026).

The Part You'll Think About Later

Old human institutions — peer review, adversarial proceedings, sprint cycles, division of labor — mapped onto the agent architecture the four labs converged on

Humanity spent centuries developing organizational structures for collective cognition: peer review, adversarial proceedings, sprint cycles, division of labor. We figured out how to make groups smarter than individuals.

Then we forgot those lessons. Then AI labs rediscovered them independently, without talking to each other or to HR.

The question isn't whether AI can do your job. It's whether your work can be decomposed into verifiable sub-problems. If yes, someone is building the harness right now.

Maybe it should be you.

Sources

Nate B Jones, "4 AI Labs Built the Same System Without Talking to Each Other," YouTube, March 11, 2026 — Primary source for the convergence thesis. Synthesizes the Anthropic, DeepMind, OpenAI Codex and Cursor architectures from public announcements and engineering blog posts.
Wilson Lin, "Scaling long-running autonomous coding," Cursor Engineering Blog, January 14, 2026 — First-party source for the Planner / Sub-Planner / Worker / Judge architecture and the one-week, one-million-LoC Rust browser build.
Michael Truell (CEO, Cursor), X, March 2026 — First-party announcement of Cursor's First Proof Problem 6 result.
First Proof challenge (1stproof.org) — Set of ten unpublished mathematical research problems from Stanford, MIT, Berkeley and other academics, encrypted with the decryption key released February 13, 2026 specifically to test AI systems.
Ethan Mollick, "Jagged frontier" framing — popularized in Co-Intelligence (2024) and his Substack work on AI capability variability. Used here as the counterpoint Jones argues against.

Back to Blog