I would love to change the world, but they won't give me the source code

Navigating the storm - Diving AI agents

storm.jpg

Navigating the Storm: Driving AI Agents #

I've been spending a lot of time lately orchestrating multiple AI coding agents, and I've noticed something strange: it's surprisingly stressful. Not in the way that debugging a production outage is stressful, but in this odd, sustained, attention-demanding way that I'm still trying to articulate.

The best analogy I've found is that it's like driving in heavy rain.

How I got here #

Like most developers, I started with a single coding agent. I'd describe what I wanted in natural language, review what it built, iterate on the prompt. It was a new muscle to develop—learning to be specific about requirements, understanding how to break down problems in a way that an LLM could execute on—but it felt manageable. Almost relaxing, even.

Then I tried running multiple agents in parallel. One refactoring the backend authentication system, another updating the frontend components, a third writing tests. The productivity was remarkable. What would have taken me days was happening in hours. But I found myself in this weird state of hypervigilance that I wasn't expecting.

I mentioned this to a few other developers who'd been experimenting with multi-agent workflows, and they all immediately recognized what I was describing. We kept using similar words: "intense," "draining," "you can't look away." One friend called it "productive anxiety," which feels about right.

The driving in rain thing #

Here's what I realized: when you're orchestrating multiple agents, you're mostly not coding. You're watching. Terminal outputs scrolling. File changes appearing in your IDE. Agents reporting progress, hitting errors, asking for clarification.

It's like those long stretches of highway driving in the rain where nothing is really happening—you're just maintaining your lane, keeping your speed steady, staying present. Most of the time, the agents are doing fine. The code is being written, tests are passing, the work is progressing.

But you can't mentally check out. Because every so often, you need to make a small but critical correction. An agent is about to overwrite the wrong file. Two agents are operating on conflicting assumptions about the API contract. A dependency change in one agent's work is about to break another agent's context. These moments come suddenly, and if you miss them, recovery gets expensive.

What makes it particularly draining is that you're responsible for outcomes you're not directly controlling. The agents are doing the actual work—you're barely touching the keyboard—but you're the one who has to sense when something's drifting. By the time an agent has fully gone off-road, you're looking at a much bigger cleanup job.

What I'm learning about this #

I've been keeping notes on what seems to work. Not sure if these generalize, but here's what I'm seeing:

The struggle of "not coding" was real for me at first. I kept feeling like I should be doing more. But orchestration is the work. Making judgment calls about when to intervene, maintaining the mental model of what each agent knows and is doing, catching the moment before things drift—this is cognitively demanding in a different way than writing code directly.

I've found my personal limit is around three concurrent agents on complex tasks. More than that and the context-switching cost starts to outweigh the parallelization benefit. Your mileage may vary.

Checkpoints matter more than I expected. Every 15-20 minutes, I stop everything and review what each agent has done. Commit the good work, adjust course where needed. Without these deliberate pause points, I find myself losing track of the overall state.

The fatigue is different from normal coding fatigue. After a few hours of agent orchestration, I'm more tired than after a full day of writing code myself. I think it's the sustained attention without the natural breaks that coding normally provides.

The tooling problem #

Here's what's bothering me: we're doing all of this with tools that weren't designed for it.

Current IDEs have basically added a chat panel to the sidebar and called it AI integration. But the rest of the interface is still organized around the same paradigm—file trees, code editors, maybe some tabs. The code is the primary artifact, and the AI is a feature.

When you're running multiple agents, though, you're not primarily looking at code. You're monitoring state, tracking progress, managing context, catching divergence. The file tree view is almost beside the point.

I keep thinking we need something more like a dashboard. What is each agent currently doing? What's in their context? Where are they in their task graphs? What's queued? A unified view that shows agent state as the primary interface, not as a sidebar add-on.

Some specific things I wish I had:

Real-time diff streams, not file-at-a-time changes. Show me what's being modified across all agents, let me approve or redirect in the moment. I'm not editing files anymore; I'm steering edits.

Visual context management. Let me see what each agent knows, when context is getting stale, when two agents have incompatible understandings of the same system. Right now I'm tracking this in my head.

Built-in checkpointing. The tool should prompt me: "Agent 2 has made 47 changes in 12 minutes. Review?" One-click to see the aggregate, one-click to commit or rollback per-agent.

Task graphs instead of file trees. Show me the work breakdown—"Refactor authentication" splits into "Update user model," "Migrate sessions," "Add tests." Assign agents to tasks. The files are implementation details.

Communication intercepts. When Agent A's output feeds into Agent B's input, show me that dependency. Better yet, let me intercept and adjust that handoff before Agent B runs with wrong assumptions.

The meta-problem #

And that's just for one developer with multiple agents. We haven't really started thinking about what happens when multiple developers are each orchestrating multiple agents on the same codebase. Current Git workflows weren't designed for this. What does code review even mean when 80% of the diff was machine-generated? You're not reviewing code quality—you're reviewing orchestration decisions. Did the human steer the agents toward good architecture? Are the task boundaries sensible? Is the system coherent?

I don't have answers here. I'm not even sure what the right questions are yet.

Where this might be going #

Right now we're making do. We're driving in the storm with tools designed for sunny weather. It works through sheer concentration and adaptation, but it doesn't feel sustainable.

I suspect someone is building the right tooling for this. Not the "AI agent" as a feature bolted onto existing IDEs, but tools designed from the ground up for orchestration. Where agent state is the primary view. Where humans are conductors, not composers.

We're still early. Most of us are squinting through that little chat panel, hoping we catch the critical moment before an agent drifts into the median.

But the driving metaphor keeps resonating with people I talk to. We're all learning to navigate the same storm.

If you're working on this problem—either as someone orchestrating agents or building tools for it—I'd be curious to hear what you're seeing. I'm @lolsborn on Substack, or you can email me at osborn.steven@gmail.com.

Pushing AI Coding to Its Limits: Writing a Programming Language in a week

Social Image

The Goal Isn't the Language #

The real motivation behind Quest? To push the boundaries of what's possible when you treat AI agents as full development partners, not just autocomplete tools.

The question I wanted to answer: Can we develop novel workflows that produce genuinely high-quality code at scale using AI agents? Not toy demos. Not simple CRUD apps. Real software with edge cases handled, proper test coverage, and maintainable architecture.

I needed something genuinely challenging to test these limits. A programming language is perfect for this because it can be arbitrarily complex—you start with a parser, add a runtime, build a standard library, create tooling, and suddenly you're dealing with edge cases, performance optimization, error handling, and all the messy reality of real software development.

Need proof that Quest actually works? The blog you're reading right now runs on Quest. The entire codebase—routing, templating, static file serving, everything—is written in Quest and lives on GitHub.

The Problem We're Actually Solving #

Here's what happens when developers use AI agents in an ad-hoc way:

  • Wildly inconsistent quality between sessions
  • Context evaporates when you hit token limits
  • No specs means the generated code becomes your documentation (yikes)
  • Happy path syndrome where everything works perfectly until it doesn't

We've all been there. You get Claude or another AI to generate some code, it looks great, you're feeling productive... and then you realize it doesn't handle errors, has no tests, and you can't remember why you made certain design decisions.

A New Development Workflow #

Quest uses a structured, multi-step process that treats AI agents as development partners. Every feature follows this flow:

Request → Spec → Review → Implement → Test → Document → Approve

1. Specifications First #

Before any code exists, we create a QEP (Quest Enhancement Proposal) that documents:

  • What problem we're solving and why
  • Complete API design with examples
  • Implementation strategy
  • Test plan
  • Success criteria

These live in Git alongside the code. No external tools. No context loss. Just markdown files that both humans and AI agents can read.

2. Multi-Agent Review #

Different AI agents review different aspects:

  • Spec Review catches design issues before coding starts
  • Code Review ensures implementation matches the spec
  • Test Review validates edge cases are covered

This diverse perspective prevents the blind spots that single-agent development creates.

3. Human-in-the-Loop #

AI agents propose. Humans decide. We maintain checkpoints where:

  • Owners review specs for completeness
  • Approvers validate implementation quality
  • Testers verify all tests pass

4. Automated Quality Gates #

Nothing merges without:

  • Full test suite passing
  • Documentation updated
  • All checklist items complete

What This Enables: A Language in Week #

With this workflow, Quest went from concept to a working language with:

  • An interactive REPL
  • Pattern matching
  • A module system
  • A production-ready web server built on Rust's Tokio runtime
  • Comprehensive standard library

Here's what Quest code looks like:

# Everything is an object (thanks Ruby!)
let x = 42
x.plus(8)    # => 50
x.times(2)   # => 84
x.str()      # => "42"

# F-strings (thanks Python!)
let name = "Alice"
let age = 30
puts(f"Hello, {name}! You're {age} years old.")

# Pattern matching (thanks Elixir!)
fun describe_age(age)
  match age
    in 0 to 12
      "child"
    in 13 to 19
      "teenager"
    in 20 to 64
      "adult"
    else
      "senior"
  end
end

And building a web server is delightfully simple:


route.get("/page/{slug}", fun (req)
    let slug = req["params"]["slug"]
    let page = db.page.find_by_slug(db.get_db(), slug, true)
    if page == nil
        return not_found_handler(req)
    end
    
    try
        page["content_html"] = markdown.to_html(page["content"])
    catch e
        logger.error("Error converting markdown: " .. e.message())
        page["content_html"] = "<p>Error rendering content</p>"
    end
    
    return render_template("page.html", {
        page: page,
        pages: get_published_pages()
    })
end)

Why This Matters #

The structured approach solves real problems:

  • Consistency - Same request, same quality result
  • Persistence - Knowledge survives session boundaries
  • Confidence - Multiple review stages catch issues
  • Completeness - Checklists ensure nothing is missed
  • Traceability - Git history preserves full context

More importantly, it demonstrates that we can build complex software systems with AI agents when we give them the right workflows and tools.

The Real Innovation #

Quest the language is fun. It's a celebration of features I love from other languages. But Quest the process is what matters. It's a proof of concept that we can:

  • Build genuinely complex software with AI assistance
  • Maintain quality through structured workflows
  • Keep humans in control while leveraging AI capabilities
  • Scale development without sacrificing maintainability

Should You Use Quest? #

For production work? Absolutely not. It's experimental and quirky.

But should you steal the development process? Yes. The workflows, the multi-agent review approach, the structured specs—these patterns work with any language and any AI agent.

The future of software development isn't about AI replacing developers. It's about developing novel workflows where AI agents and humans collaborate effectively. Quest is one experiment in that direction.

And honestly? It's been a blast.


Want to explore more? Check out the full documentation and the GitHub repo. The complete development process is documented in QEP-000: Development Flow.

AI's Missing "Assembly Line Moment"

We keep waiting for AI to transform how we work, but I think we're looking in the wrong place. The bottleneck isn't the technology. It's us.

I've been telling people that what AI is lacking right now is its "Henry Ford moment." Think about the early industrial era: factories had all these powerful new machines that made individual workers more productive, but they were still organized like craftsmen's workshops. It wasn't until Ford introduced the assembly line that we truly unlocked the potential of industrial machinery. The breakthrough wasn't better machines. It was a fundamentally new way of organizing work around those machines.

We're in a similar position with AI today. The tools are remarkably capable, but we're still using pre-AI processes.

Here's what I mean: An intern with AI can build an absolutely amazing prototype by themselves in one day. Give them Claude or ChatGPT, and they'll accomplish what used to take a small team a week. The individual productivity gains are stunning.

But the moment you introduce one other person into the equation (in "meat space") the overall productivity of that project goes down. Even if you eliminate meetings and code reviews entirely, you still need to have the same conversation with your colleague that you just had with AI. You're constantly repeating yourself, rebuilding context that already exists somewhere else.

Human coordination is the new bottleneck.

The assembly line worked because Ford redesigned the entire production process around the capabilities of machines. He didn't just add machines to existing workflows. He rethought everything. We need the same radical rethinking for AI-augmented work.

What might this look like? I'm not entirely sure yet, but I suspect it involves:

  • Async-first, AI-mediated collaboration where AI helps maintain context across team members without constant meetings
  • New roles and workflows designed specifically for AI-human collaboration, not just "AI as assistant"
  • Shared AI workspaces where context persists and builds rather than being recreated in every conversation
  • Automated handoffs where AI manages the translation between different people's working styles and contexts

The companies that figure this out (that have their "assembly line moment" with AI) won't just be incrementally more productive. They'll be operating in a completely different paradigm.

The technology is ready. The question is: who will be our Henry Ford?

Producing quality code with Agents

Background #

Like most developers working with coding agents, I've found some processes produce better results than others. After discussing workflows with other developers, I discovered fascinating variety in their approaches.

I recently came across a post on Simon Willison's blog referencing Peter Steinberger where Peter discusses the evolution of his workflow. Peter went from handwriting a fairly rigid spec to riffing with Claude to come up with a a plan.

My workflow is similar to Peters, but trying to go back to making formal specs that live along side the code. I modeled this on Python Enhancement Proposals (PEPs).

Process #

I maintain a specs/ folder where ALL features live as markdown files (git-ignored for shared projects). I use separate specialized agents: spec writer, spec reviewer, coding agent, code reviewer, bug filer, etc.

Key practice: I run each agent in separate sessions, closing immediately after each task before handing off to the next agent. This produces better results—I believe because previous claims about completeness are gone, allowing the new agent to focus purely on finding improvements.

Novel aspect: Each specification lives in the codebase as a deliverable alongside the code. This keeps all feature context easily available, makes work easy to grade against original intentions, and preserves design decisions and process.

Slash commands #

/spec #

Create a new QEP (Quest Enhancement Proposal) in specs/ directory.

Goals: 

$1

Steps:
 1. [ ] Find the next available QEP number by checking specs/qep-*.md files
 2. [ ] Create specs/qep-NNN-slug.md with proper template
 3. [ ] Fill in Title, Number, Status (Draft), Author, Created date
 4. [ ] Add Motivation, Proposal, and Rationale sections based on description
 5. [ ] Include Examples section with Quest code samples
 6. [ ] Add Implementation Notes section
 7. [ ] Add References section if applicable
 8. [ ] Open the file for user to review and edit

Template structure:
- Title: QEP-NNN: Feature Name
- Metadata: Number, Status, Author, Created
- Sections: Motivation, Proposal, Rationale, Examples, Implementation Notes, References

/spec-review #

Review QEP specification $1 in specs/ completeness and quality.

Review checklist:
 1. [ ] Read the QEP-$1 document from specs/
 2. [ ] Verify proper formatting and metadata (Number, Status, Author, Created)
 3. [ ] Check Motivation section clearly explains the problem
 4. [ ] Verify Proposal section has clear, concrete design
 5. [ ] Review Rationale section explains design decisions
 6. [ ] Ensure Examples section has working Quest code samples
 7. [ ] Check Implementation Notes for technical feasibility
 8. [ ] Verify consistency with existing Quest language design
 9. [ ] Check for conflicts with other QEPs or existing features
 10. [ ] Suggest improvements or clarifications needed
 11. [ ] Recommend status change if appropriate (Draft → Accepted → Implemented)

Provide detailed feedback on:
- Clarity and completeness
- Technical soundness
- Backward compatibility concerns
- Edge cases or gotchas
- Documentation needs

/code #

Beging coding $1

 1. [ ] Identify scope and nature of the problem
 2. [ ] Run full test suite.  If there are failures halt and ask for them to be resolved before continuing.
 3. [ ] Write initial implementation
 4. [ ] Write comprehensive tests in tests/ and run the individual test file to confirm implementation
 5. [ ] Run full test suite to ensure new bugs were not introduced.
 6. [ ] Update docs/docs necessary
 7. [ ] Update README.md / CLAUDE.md if making a fundamental language change.  (not necessary for bug fixing)