On this page

May 14, 2026

18 min read

Agentic Engineering: From Writing Code to Orchestrating AI

Izzy A

CTO @PromptMetrics

The developer's role has shifted from typist to conductor. Learn how multi-agent orchestration and AI code review are redefining software engineering today.

Agentic Engineering: From Writing Code to Orchestrating AI

"I don't think I've typed a line of code since December."

That's Andrej Karpathy, former Tesla AI lead and OpenAI co-founder, describing his workflow in early 2026. He's not exaggerating. 84% of developers now use AI tools in their development process, up from 76% in 2024 (Stack Overflow, 2025). But usage stats don't capture what's actually happening.

The verb itself has changed. Developers aren't writing code anymore. They're telling agents what to build, reviewing the output, and stitching it together. The role is shifting from typist to conductor, and it's happening faster than anyone expected.

This isn't about whether AI makes developers more productive. It's about what happens when you stop being the one who writes the code, and start being the one who commands the machines that do.

Key Takeaways
42% of all committed code is now AI-generated, heading toward 65% by 2027 (Sonar, 2026)
Developers now spend more time reviewing AI code (11.4 hrs/week) than writing it themselves (9.8 hrs/week)
Enterprise interest in multi-agent systems surged 1,445% between Q1 2024 and Q2 2025
The new bottleneck isn't your typing speed or your GPU capacity. It's your skill at orchestrating agents.

What Does "Orchestrating Agents" Actually Mean?

90% of professional developers now use at least one AI tool at work, and 51% use them daily (JetBrains, 2026). That tells you the tools are everywhere. It doesn't tell you how the job itself is mutating.

Karpathy describes the ratio shift: he went from writing 80% of his own code to delegating 80%, and now thinks even 20% hands-on is an overestimate. "I don't think I've typed a line of code probably since December," he said on the No Priors podcast. The experience induces a kind of "AI psychosis." It's a constant, slightly manic state of discovering that more is possible than you thought, while simultaneously worrying you're falling behind.

The unit of work has changed. It's not a function or a file anymore. It's a "macro action": a self-contained feature or change that you describe, delegate, and review. Peter Steinberg, an early adopter profiled in the same conversation, runs 10 agents simultaneously across multiple repositories. Each one takes about 20 minutes on a high-effort prompt. He rotates between them, issuing new instructions, reviewing completed work, merging what passes.

Is this productive? GitHub Copilot now has over 20 million users and a randomized controlled trial across Microsoft, Accenture, and a Fortune 100 firm found a 26% increase in completed pull requests (Cui et al., 2025). But framing this as "26% more productive" misses the point. These developers aren't just going faster. They're doing something qualitatively different.

According to a 2026 Gartner analysis, the number of developers using AI coding assistants will reach 90% of enterprise engineers by 2028, up from under 14% in early 2024. The more striking figure: enterprise inquiries about multi-agent systems surged 1,445% between Q1 2024 and Q2 2025 (Gartner, 2025). Companies aren't just asking if AI can write code. They're asking how to run fleets of agents that write, test, and review code in parallel, with humans in the loop only at decision points.

The orchestrator skill transition: The defining skill for developers in 2026 isn't mastering a new framework or language. It's learning to decompose problems into parallelizable chunks, write precise agent instructions, and review AI-generated code at scale. The Orchestrator-Worker pattern now dominates 70% of production multi-agent deployments, with enterprises reporting 3x faster task completion and 60% better accuracy compared to single-agent setups (AgentMarketCap, 2026).

How Do Developers Run Multiple Agents in Parallel?

The single-agent workflow is already table stakes. You prompt an agent, it builds something, you review it. That was late 2024.

What's happening now is different. Developers are running 5, 10, or more agents at once, and it's not a stunt. It's their default workflow. Ryan Lopopolo, an engineering lead at OpenAI, reportedly banned his team from touching code editors directly. Each engineer, he argues, now commands "5, 50, or 5,000 engineers worth of capacity 24/7." Whether or not your team goes that far, the logic is hard to argue with.

Karpathy frames it bluntly: "If you're not maximizing your subscription, you are the bottleneck in the system." His approach, which is spreading rapidly through engineering teams, treats agent runtime like GPU utilization during a PhD. When the GPUs aren't running, you're wasting capacity. When your agent subscriptions have idle tokens, same thing. The instinct becomes: while one agent works, start another.

Task Completion: Single Agent vs Multi-Agent (relative speed, lower is better)

<!-- Single agent bar -->
<rect x="120" y="80" width="100" height="200" class="bar-single" opacity="0.85"/>
<text x="170" y="72" text-anchor="middle" class="value">1.0x</text>
<text x="170" y="300" text-anchor="middle" class="label">Single Agent</text>

<!-- Multi agent bar -->
<rect x="340" y="146" width="100" height="134" class="bar-multi" opacity="0.85"/>
<text x="390" y="138" text-anchor="middle" class="value">3.0x faster</text>
<text x="390" y="300" text-anchor="middle" class="label">Multi-Agent (Orchestrator-Worker)</text>

<!-- Baseline line -->
<line x1="220" y1="80" x2="340" y2="80" stroke="#cbd5e1" stroke-dasharray="4,3" stroke-width="1"/>
<text x="560" y="50" text-anchor="end" class="label" font-size="10">Source: Gartner / AgentMarketCap, 2026</text>

But here's what the productivity charts don't show: there's a new kind of cognitive load. When you're orchestrating multiple agents, you're context-switching between different parts of a codebase at different levels of abstraction, simultaneously. One agent is deep in a database migration. Another is refactoring a React component. A third is writing tests. You have to hold enough of each in your head to review competently. It's not multitasking in the email-and-Slack sense. It's more like air traffic control.

The developers who thrive in this setup share a common trait: they're obsessive about instruction quality. They treat agent prompts like code. They version-control them, iterate on them, and share them across the team. The "vibe coding" phase where you toss a one-liner at an agent and hope for the best is already giving way to something more disciplined.

Enterprise multi-agent deployments using the Orchestrator-Worker pattern deliver 3x faster task completion and 60% better accuracy than single-agent systems, according to a 2026 analysis of production deployments by Gartner. The topology dominates roughly 70% of enterprise multi-agent implementations today (AgentMarketCap, 2026).

Why Does Token Throughput Matter Now?

For roughly a decade, most developers didn't feel compute-bound. CPUs were fast enough. Cloud instances were cheap enough. The bottleneck was human. How fast could you think, type, and debug?

That's over.

Agentic coding tasks consume roughly 1,000x more tokens than simple chat or code reasoning tasks, and runs on the same task can vary by up to 30x in token consumption depending on how the agent approaches the problem (Microsoft Research, 2026). Meanwhile, OpenAI's API token throughput surged from roughly 6 billion tokens per minute in October 2025 to 15 billion per minute by March 2026. That's a 150% increase driven almost entirely by agentic workloads (WSJ / AI Daily Post, 2026).

The practical takeaway isn't "tokens are expensive." It's that tokens are the new compute, and your token throughput is your new capacity metric. How many tokens can you put to useful work per day?

Karpathy draws a parallel to his PhD days: "You would feel nervous when your GPUs are not running. You have GPU capability and you're not maximizing your available flops. But now it's not flops. It's about tokens." The feeling is the same. Idle capacity is wasted potential. The difference is that token throughput is partly a skill issue. Better instructions, better parallelism, better review pipelines all increase your effective throughput.

OpenAI API Token Throughput (Billion Tokens/Minute)

<line x1="60" y1="40" x2="60" y2="240" class="axis"/>
<line x1="60" y1="240" x2="520" y2="240" class="axis"/>

<!-- Y labels -->
<text x="52" y="244" text-anchor="end" font-family="-apple-system, sans-serif" font-size="11" fill="#64748b">0</text>
<text x="52" y="184" text-anchor="end" font-family="-apple-system, sans-serif" font-size="11" fill="#64748b">5B</text>
<text x="52" y="124" text-anchor="end" font-family="-apple-system, sans-serif" font-size="11" fill="#64748b">10B</text>
<text x="52" y="64" text-anchor="end" font-family="-apple-system, sans-serif" font-size="11" fill="#64748b">15B</text>

<!-- Grid lines -->
<line x1="60" y1="180" x2="520" y2="180" stroke="#f1f5f9" stroke-width="1"/>
<line x1="60" y1="120" x2="520" y2="120" stroke="#f1f5f9" stroke-width="1"/>
<line x1="60" y1="60" x2="520" y2="60" stroke="#f1f5f9" stroke-width="1"/>

<!-- Data area -->
<path d="M80,228 L190,172 L300,112 L410,80 L520,60 L520,240 L80,240 Z" class="area"/>
<path d="M80,228 L190,172 L300,112 L410,80 L520,60" class="line-chart"/>

<!-- Data points -->
<circle cx="80" cy="228" r="4" class="point"/>
<circle cx="190" cy="172" r="4" class="point"/>
<circle cx="300" cy="112" r="4" class="point"/>
<circle cx="410" cy="80" r="4" class="point"/>
<circle cx="520" cy="60" r="5" fill="#8b5cf6" stroke="white" stroke-width="2"/>

<!-- X labels -->
<text x="80" y="258" text-anchor="middle" font-family="-apple-system, sans-serif" font-size="10" fill="#64748b">Oct 2024</text>
<text x="190" y="258" text-anchor="middle" font-family="-apple-system, sans-serif" font-size="10" fill="#64748b">Apr 2025</text>
<text x="300" y="258" text-anchor="middle" font-family="-apple-system, sans-serif" font-size="10" fill="#64748b">Jul 2025</text>
<text x="410" y="258" text-anchor="middle" font-family="-apple-system, sans-serif" font-size="10" fill="#64748b">Oct 2025</text>
<text x="520" y="258" text-anchor="middle" font-family="-apple-system, sans-serif" font-size="10" fill="#64748b">Mar 2026</text>

<text x="552" y="62" font-family="-apple-system, sans-serif" font-size="11" font-weight="600" fill="#8b5cf6">15B/min</text>
<text x="555" y="254" text-anchor="end" font-family="-apple-system, sans-serif" font-size="10" fill="#94a3b8">Source: WSJ / AI Daily Post, 2026</text>

Agentic coding workflows consume approximately 1,000x more tokens than simple code chat, and identical tasks can vary by up to 30x in token cost depending on how the agent is instructed, according to Microsoft Research's April 2026 analysis of real-world agentic coding task data (Microsoft Research, 2026).

But let's be honest about what "feeling compute-bound" does to you psychologically. Karpathy describes it as a state of productive paranoia: "I'm very antsy that I'm not at the forefront of it. I see lots of people on Twitter doing all kinds of things. And I need to be at the forefront or I feel extremely nervous." This isn't healthy in the long run. It's the same psychology that burned out a generation of startup founders. But it's also not irrational. The gap between what the tools can do and what most developers are getting out of them is enormous, and it's widening.

Is the Review Bottleneck the Real Problem?

Here's a stat that should stop you cold: developers now spend 11.4 hours per week reviewing AI-generated code, compared to 9.8 hours writing code with AI assistance (Digital Applied, 2026). Review time surged 31% year-over-year. Writing time grew only 8%. The crossover already happened.

We built tools that generate code faster than we can read it. That's the fundamental asymmetry of the agent era.

And the trust gap makes it worse. 96% of developers don't fully trust that AI-generated code is functionally correct, but only 48% always verify it before committing (Sonar, 2026). AI-written code contains 1.7x more issues than human-written code, with 1.75x more logic errors and 1.57x more security findings. We're generating more code, reviewing it less thoroughly than it needs, and shipping bugs we wouldn't have written ourselves.

The review bottleneck is also where the real craft is migrating. Writing a function from scratch was never the hard part of engineering. The hard part was always judgment. Does this solve the right problem? Will it hold up under edge cases? Is it maintainable six months from now? AI can generate the code, but it can't answer those questions. Review becomes where the thinking happens.

Steve Sanderson's Copilot team at GitHub ships roughly 200 PRs per week with 7-10 engineers. Their pattern: AI generates the code, multiple AI models cross-review it (Claude Opus, Haiku, Gemini all get a look), and humans make the final call. The PRs are 10x bigger than what a human-written PR would be. The review discipline has to scale accordingly.

What's striking is that the productivity gains from AI aren't evenly distributed. Median gains hit +34% at 60 days and plateau at +37% after 180 days. But drill into the task breakdown and the picture fractures: 78% gains for boilerplate, 64% for test writing, 59% for unfamiliar languages. Meanwhile, architectural decisions get 18% and security-sensitive code gets 16% (Digital Applied, 2026). The easy stuff gets radically faster. The hard stuff stays hard. That's where the orchestrator earns their keep.

AI-generated code contains 1.7x more bugs than human-written code, and 96% of developers say they don't fully trust AI output, yet only 48% consistently verify it before merging. The productivity gains from AI coding tools (median +37% at 180 days) are concentrated in boilerplate and testing tasks, with minimal improvement in architecture and security work (Sonar, 2026; Digital Applied, 2026).

What Happens When Agents Run Without You?

The logical endpoint of orchestration is removing yourself from the loop entirely. That's what Karpathy calls "auto-research" and what the broader agent ecosystem calls "claws" or "loops." These are agents that run for hours or days without human intervention, working toward an objective you set once.

Karpathy built a system that autonomously optimizes machine learning model training. He gave it an objective (lower validation loss), boundaries (what it could and couldn't change), and let it run overnight. It found hyperparameter improvements he'd missed despite two decades of experience. Weight decay on vocabulary embeddings. Suboptimal Adam betas. Things that only became visible through systematic exploration at a scale no human researcher can sustain.

The home automation version is even more visceral. Karpathy built "Dobby," a claw that controls his entire house: Sonos, lights, HVAC, shades, pool, security cameras. The agent found each smart-home subsystem by scanning his local network, reverse-engineered their APIs, built a unified dashboard, and now communicates entirely through WhatsApp. When a FedEx truck pulls up, a vision model detects it and Dobby texts him. "I can't believe I just typed in 'can you find my Sonos?' and music started playing," he said. Six apps, replaced by one agent in an afternoon.

This is where the "psychosis" comes from. The gap between what's technically possible and what you've actually implemented is so large it becomes existentially uncomfortable. Every time you manually do something an agent could handle, you're leaving capacity on the table. And the things agents can handle keep expanding.

The frontier labs are running this playbook at scale. As Karpathy puts it: "These researchers are basically glorified auto-researchers. They're actively automating themselves away. And this is the thing they're all trying to do." He visited OpenAI recently and told researchers the quiet part out loud: "You guys realize if this is successful, we're all out of a job? We're just building automation for the board."

He's half-joking. But the half that isn't is worth sitting with.

Watch: Andrej Karpathy — Software Is Changing (Again) on YouTube (39 min). Karpathy introduces the Software 3.0 framework and the shift to programming through natural language with AI agents.

The defining breakthrough of agentic engineering is that the developer no longer needs to be in the loop at every step. Autonomous coding agents can run for hours or days, optimizing toward verifiable objectives like lower validation loss in ML training or passing unit test suites, and frequently find improvements that experienced human practitioners miss (Karpathy, AutoResearch project, 2026).

Where Is This Heading in 2-3 Years?

The trajectory is clear enough to be uncomfortable. Developers project 65% of committed code will be AI-generated by 2027 (Sonar, 2026). Gartner predicts 55% of engineering teams will be building LLM-based features by that same year. But extrapolation is easy. The harder question is what doesn't change.

Karpathy describes AI systems as having "jagged intelligence": superhuman at generating unit tests, subhuman at understanding what you actually meant. This jaggedness is structural, not transitional. It comes from how these models are trained: reinforcement learning on verifiable rewards makes them extraordinary at anything with a clear right answer and mediocre at everything else. Ask an agent to optimize a CUDA kernel and it'll find speedups you'd never consider. Ask it to tell you a joke and you'll get the same tired atom pun it's been recycling since 2023.

The implication for developers is that the value you provide increasingly lives in the non-verifiable space. Can you define what "good" means for a product before anyone's built it? Can you sense when a solution is technically correct but strategically wrong? Can you make a call when two reasonable approaches conflict and there's no metric to settle it? These are the things agents can't do. They're also the things that were always the actual job.

What's genuinely new is how education and knowledge transfer work in this world. Karpathy released "microGPT": a 200-line Python implementation of a GPT training loop, boiled down to its absolute essence. In the old world, he'd make a video walking through it. In this one, he just expects agents to explain it. "If agents get it, then they can just explain all the different parts," he says. "I'm not explaining to people anymore. I'm explaining to agents." If an agent understands your code, it can teach it to anyone at their exact level, in their language, with infinite patience. The bottleneck becomes whether the agent comprehends the material, not whether you've written good docs.

The most durable skill in the agent era won't be coding proficiency. It'll be the ability to define what "good" looks like in ambiguous situations, decompose complex problems into verifiable sub-tasks that agents can execute, and exercise judgment at decision points where there's no clean metric to optimize against. AI systems remain structurally weak in domains without clear verifiable rewards, creating a lasting role for human judgment even as code generation becomes fully automated (Karpathy, "From Vibe Coding to Agentic Engineering", 2026).

Watch: Andrej Karpathy — From Vibe Coding to Agentic Engineering on YouTube (30 min). Recorded at Sequoia AI Ascent 2026, Karpathy discusses jagged intelligence, the skills developers need, and why understanding still matters even when AI handles the typing.

The pace difference between digital and physical is also worth watching. Karpathy predicts the digital space will change "at the speed of light" while the physical world lags behind. Atoms are just harder to move than bits. But the interface between the two is where interesting companies will form. Agents that can run experiments in the physical world and feed results back into the digital loop. Agents that can pay humans for data. The books "Daemon" and "Autonomous" get referenced in these conversations for a reason. They imagined a world where AI doesn't need a robot body to affect the physical world, just enough leverage over the humans and systems already in it.

Frequently Asked Questions

Will AI agents replace software developers?

Not in any simple sense. Demand for software is elastic. When it gets cheaper to build, people build more of it. That's Jevons paradox, and it's the same reason ATMs didn't eliminate bank tellers (more bank branches opened). 84% of developers already use AI tools and the industry is still hiring aggressively. The role is changing, not disappearing.

What skills matter most for developers in the agent era?

System design, problem decomposition, and code review. The median productivity gain from AI is +37% at 180 days, but almost none of that comes from architecture or security decisions (Digital Applied, 2026). AI is best at the parts you can verify automatically. The ambiguous stuff is still yours.

Is "vibe coding" real engineering?

It depends on what you're building. For prototypes, internal tools, and personal projects, vibing with an agent is genuinely sufficient. For production systems with security, compliance, or reliability requirements, you need structured review and verification pipelines. The arc from vibe coding to agentic engineering is real. It's the difference between playing with the tool and building with it.

How do I get started with multi-agent orchestration?

Start by running two agents in parallel on independent tasks. The skill to build is giving each one clear boundaries: tasks that don't step on each other's files, with distinct acceptance criteria. Most developers find the bottleneck shifts from "can I write this?" to "can I review both of these competently?" within the first week.

Conclusion

The "psychosis" Karpathy describes is real, and it's not going away. Every developer who's spent a serious week with coding agents has hit the same wall: the gap between what the tools can do and what you're getting out of them is enormous, and closing that gap is entirely on you. The models won't get worse. Your ability to wield them has to get better.

That's actually the optimistic read. If the limiting factor were model capability, you'd be stuck waiting for a lab to ship something better. But the limiting factor, at least right now, is skill. How well you prompt. How fast you review. How cleverly you parallelize. How clearly you define what "done" means before you hand it off.

The developers who internalize this are the ones running 10 agents at once, treating prompts like source code, and measuring their output in tokens per day instead of lines per day. The ones who don't are still typing.

42% of committed code is AI-generated, heading to 65% by 2027
Review is the new bottleneck: 11.4 hrs/week spent reviewing AI code vs 9.8 hrs writing
Multi-agent systems are in production now, delivering 3x faster completion with 60% better accuracy
Token throughput is your new capacity metric. Idle subscriptions are wasted flops
The skills that last are judgment, decomposition, and taste. The things that were always the real job

The verb changed. You're not a typist anymore. You're a conductor. Whether that's terrifying or exhilarating depends on how much you trust yourself to lead the orchestra.

Discuss:Hacker News·Reddit

Self-hosted prompt registry + agent telemetry. Zero vendor lock-in. Runs on a $5 VPS.

Read the docs →Star on GitHub