In 2017, fresh out of university, I wrote one line in my GitHub bio: one day I'd build my own AI personal assistant. I never changed it. Nine years and a few thousand commits later, the line is finally true. What I built just turned out to be bigger than an assistant. It's a company.
TL;DR
I set out to build Jarvis, a single AI that does what I say. I ended up with Alfred: a 20-agent company with a C-suite, an org chart, and a playbook, that I run every day. The lesson that got me there: you don't scale AI agents by making one of them smarter, you scale them by giving them management. Alfred exists to make me a faster engineer, not to replace me.
The promise I made in 2017
The dream was always Jarvis: one voice that knows you and does the work. Building toward it taught me that a single all-knowing agent is the wrong shape for real work, so Jarvis became Alfred, a butler who runs a staff. The timing wasn't an accident. Gartner expects up to 40% of enterprise apps to embed task-specific AI agents by 2026, up from less than 5% in 2025. The tools to actually build this only arrived now.
For most of those nine years the idea sat in that bio line because the technology couldn't carry it. A model good enough to plan, write code, and check its own work didn't exist. By 2026 it did, and the question stopped being "can a model do this?" and became "how do I organize a dozen of them so the output is trustworthy?"
That organizing question turned out to be the whole game.
From one agent to a company
Reliable agents come from structure more than from raw model intelligence. A single agent holding an entire task in one context window gets vague, forgets its own plan, and drifts; splitting the work across focused agents, each with its own context, is what fixes it. Anthropic measured that gap at 90.2%.
It started small: one agent that wrote code, another that reviewed it. Two specialists beat one generalist immediately, so I kept splitting roles. The wall everyone hits is the same one: a single agent carrying the whole task gets vague and forgets its plan. The fix was structure. Separate agents, separate context windows, one coordinator.
That instinct has hard numbers behind it. Anthropic reported that a multi-agent system, one lead agent delegating to specialists, outperformed a single agent by 90.2% on internal research evaluations, and that token usage alone explained 80% of the performance variance. Spreading work across focused agents with their own context is most of the win.
I'd handed AI a whole project before this. I built this entire portfolio with it in a single session. Alfred is that idea industrialized: roles, reporting lines, gates between steps. The moment Alfred had an org chart instead of a prompt, the output got reliable enough to depend on daily.
The staff and the C-suite
Alfred is twenty agents: one overseer, a four-person C-suite plus an outside advisor, and fourteen specialists who do the actual work. Each one is narrow on purpose. A planner only plans. A reviewer only reviews, and never reviews its own work. The overseer reads every request, decides what kind of work it is, dispatches the right people, and keeps the conclusions, not the noise.

The C-suite governs the system rather than any single task:
| Officer | What it owns |
|---|---|
| Overseer (Alfred) | Runs every session: classifies the work, dispatches specialists, holds the thread |
| CTO | The system's technical direction and how it evolves |
| COO | Throughput and how work is prioritized across everything in flight |
| CQO | Independent quality, the regression gate nothing ships past |
| CFO | Token cost, budget, and whether a given run is worth the spend |
| Advisor | The outside challenger, whose entire job is to disagree |
The specialists are the ones doing engineer-grade work, faster:
| Agent | What it does |
|---|---|
| PM | Turns a raw idea into scoped product intent |
| Architect | Judges feasibility and blast radius before anyone builds |
| Skeptic | Cold-reads the plan for the risk everyone else missed |
| Planner | Decomposes work into test-first steps with exact files and commands |
| Engineer | Implements, test-first, one task at a time |
| Reviewer | Judges the diff before it advances |
| QA | Re-runs every acceptance check with fresh evidence; the final decider |
| Debugger | Proves the root cause before a single line of fix is written |
| Researcher | Owns the outside world: fetches sources and verifies them |
| Librarian | The memory: recalls prior decisions, captures new ones |
| Designer | Owns the visual system end to end |
| Writer | Drafts in my voice |
| Reader | Judges the draft before it ever reaches me |
| Slacker | The single, controlled door to team chat |
The memory piece alone became its own project; I wrote separately about how I made it stick across sessions. None of this is theatre. AI now writes a real share of production code (GitHub reports Copilot generates around 46% of the code its 20 million users ship), and Google's 2026 DORA report found 90% of software professionals use AI daily. The difference with Alfred is that the work runs through a pipeline with gates, the way a real engineering org does: nothing reaches "done" without a separate agent verifying it with fresh evidence.
A holding company and its workspaces
Alfred runs on two tiers, and they orchestrate differently. A holding layer governs the system itself (the agents, the rules, the playbook), and workspaces are the subsidiaries where actual projects get built. The holding deliberates; the workspaces deliver. Keeping them separate is what stops "improve the system" work from contaminating "ship the feature" work.
| Tier | Governs | How it orchestrates |
|---|---|---|
| Holding | The system itself: agents, rules, the playbook | Deliberation: officers debate a change, a decision gets recorded |
| Workspace | The real project work | Delivery pipeline: scope, plan, build test-first, review, QA |
The thing that makes any of it coherent is the playbook. Before a session does anything, the overseer reads it: a field manual that says how to classify the request, which tier it belongs to, who to dispatch, and what gate each step has to clear. It's rules as data, not vibes. The overseer doesn't improvise the process; it executes a state machine that refuses to advance when a step is skipped. That refusal is the feature. It's why a tired human at midnight still gets work that passed every check.
The CFO who watches the bill
The CFO is the agent that approves the spend. Before any large run it reports the budget I have left, estimates what the work will cost, and tells me plainly whether it's worth it. It exists because multi-agent systems burn about 15x the tokens of a chat, Anthropic's own figure, and in 2026 that number stopped being abstract.
The bills got ugly fast. TechCrunch reported in June 2026 that Uber drained its entire annual AI-coding budget by April, that one company received a $500M model bill after setting no limits, and that per-developer token consumption rose 18.6x in nine months. Gartner now predicts over 40% of agentic AI projects will be cancelled by 2027, the top reasons being runaway cost and weak controls.
My CFO is the reason I never open that invoice. It treats my usage the way a finance function treats burn: something you forecast and defend, not something you discover at the end of the month. I built that officer early, almost by accident. It turned out to be the one the whole industry now wishes it had.
Not a replacement but an amplifier, and where it goes next
Alfred was never built to replace me, and the data backs the bet. Stack Overflow's 2025 survey found 84% of developers use AI tools but only 29% trust their accuracy, and that gap is the entire design: the agents do the work, I make the calls. McKinsey frames AI's reach as task transformation rather than job elimination, and notes that AI fluency is the fastest-growing skill in the US job market, up 7x in two years. The job didn't disappear. It moved up a level, from writing every line to running the company that writes them.
The last piece is voice. Today I drive Alfred by typing. The version I'm building now answers out loud: a wake word, a spoken instruction, agents dispatched, a report back in my ear. Jarvis, finally, except there's a whole staff behind the voice. The day Alfred runs from across the room is the day that 2017 bio line is fully, literally true.
I set out to build Jarvis. I ended up with a company. The voice is just how I'll talk to it.
Key Takeaways
- The unlock is organizational, not just intelligence. Anthropic's multi-agent system beat a single agent by 90.2%. You scale agents with roles, gates, and a coordinator, not with bigger prompts.
- Specialists with narrow jobs beat one generalist. Twenty agents, each doing one thing, with no agent reviewing its own work, produce output you can actually depend on.
- A playbook makes it repeatable. A state machine that refuses to skip steps is why the work holds up even when I'm not watching.
- A CFO for your agents is no longer optional. Multi-agent systems burn around 15x the tokens of a chat; the 2026 cost reckoning ($500M bills, budgets gone by April) is what happens without one.
- Build to amplify, not replace. 90% of developers use AI; only 29% trust it. The human stays in the loop, makes the calls, and moves up to running the system.