Realizing the Dream of Natural Language Programming

October 21, 2025•13 min read•Playbooks AI Team•design-principles

In Jan 2023, Andrej Karpathy tweeted:

"The hottest new programming language is English."

The sentiment resonated across the AI community, but despite so much progress on the core technology, Natural Language Programming wasn't solved - till Playbooks came along. Why?

The Playbooks story begins earlier - in June 2022, before the tweet, before ChatGPT changed everything. We were experimenting with GPT3 text-davinci-002 and text-davinci-003 a bit later on how to build customer support AI agents that needed prescribed behavior and human-like flexibility.

The predominant approaches at that time were -

writing imperative programs that define a rigid step by step conversational flow
visual workflow builders like Google's DialogFlow CX
"story"-based systems like Rasa

But try specifying nuanced agent behaviors - handling corner cases, adjusting behavior based on preconditions, locale-specific variations, graceful degradation, intent switching and return, side quests. It was impossible to do with any of these approaches.

The available approaches fundmentally constrained or sidestepped the capabilities of the LLM

So, over the 2022 Winter break, Amol started experimenting with natural language programming. The Jan 2023 tweet by Andrej was a validation that this is a problem worth solving. Now, after 2.5yrs, the initial idea of using Markdown as a programming language has evolved into what Playbooks is today: a complete Software 3.0 stack where natural language is not just a interface, but a first-class executable programming language.

The question isn't whether LLMs can understand natural language - they obviously can (let's keep aside the philosophical question of what "understanding" means, especially for AI systems). The puzzle is why, given their remarkable linguistic capabilities, it took so long for us or anyone else to build a true natural language programming system.

It is a Hard Problem

Building Playbooks has required solving what amounts to an N-dimensional complex optimization problem, where every dimension influences every other. Getting each dimension 80% right still leaves you with an unusable system. You need simultaneous breakthroughs across multiple fronts.

The Runtime vs. LLM Balance

The first dimension: what should we task the LLM to handle and what the outer loop "runtime" around the LLM should handle? This seems like a simple engineering decision, but it's anything but. Too much reliance on the LLM, and you get unreliable execution. Too little, and you lose the fluid expressiveness that makes natural language powerful.

Take something as fundamental as execution stack management when one playbook calls another. Initially, we experimented with having LLMs manage this - after all, they can reason about control flow. But after multiple iterations, it became clear: the runtime should own deterministic concerns. The LLM is your CPU for soft logic - the semantic, adaptive parts. We want the runtime handles the hard guarantees.

Every facet of Playbooks has to resolve this question. The answers aren't obvious from first principles. The final decisions have taken countless experiments through building, breaking, and rebuilding.

Language Design: From Natural English to Executable Programs

How do you design a language that feels like natural English but has the expressiveness of a real programming language? This is perhaps the most paradoxical challenge.

The goal of building natural language programs is not new. In fact, traditional programming languages have evolved towards the readability of natural language over time - from machine code to assembly to C to Ruby/Python. But given these are bound by the limitations of purely syntactice processing, they never attained the semantic expressiveness of natural languages.

One argument against using natural language for programming is that it is too imprecise. That's true. But that opens up the opportunity to (un)define language syntax to enable a full range of precision. Playbooks threads this needle is by supporting a spectrum of equally supported coding standard variations. You can write close to natural language:

- Ask user for their order id
- Get order details
- Tell user the status

all the way to python-like code (with some prompt magic mixed in):

- $order_id:str = Say("user", "What's the order id?")
- $details:str = GetOrderDetails($order_id)
- Say("user", status based on $details)

Both work and so do various points along this precision continuum. A core Playbooks programming guideline emerges: start natural, add explicitness as needed. Somewhat similar to types in Python - optional, but helpful when complexity warrants it.

Two Stage Semantic Processing

Perhaps the most crucial innovation is a two stage semantic processing - first compiler, then runtime. First, Playbooks programs are compiled to Playbooks Assembly Language (PBAsm). Then the runtime invokes LLM again on the compiled PBAsm.

When you write "Ask user for their name", it compiles to:

01:QUE Say(user, Ask user for their $name:str); YLD for user

The compiler adds type annotations ($name:str), opcodes (QUE for enqueuing calls, YLD for yield, etc), explicit yield instructions showing when the LLM should pause.

The assembly language code is more reliable to execute on LLMs than the original instruction. The LLM is given a more precise instructions as a strict guardrail, yet still include natural langauge where decisions, fluidity is needed. We get auditing, debugging, and verifiable behavior without sacrificing natural language expressiveness.

Not Natural Languge vs Python, but Natural Language + Python

Traditional computer code like Python is still very important. There are many cases where natural language is not a viable option at all. Interfacing with external systems, large scale data processing, security, scientific computing, just to name a few.

So, one important aspect was to answer how to best combine natural language and Python. We can't compile natural language down to Python code because we lose semantic fluidity. The notion of "tools" that current agent frameworks offer are stateless and seem like a band-aid. So how to best interface these two worlds? Can we do it on the same call stack so one can freely intermix natural language and python logic with shared state?

After much experimentation, we arrived at the notion of using functions as the unit of specification for both paradigms.

The functions that participate in the NL-Python shared call stack are "playbook"s.

NL and Python playbooks run on a shared stack with access to the same state variables (global for now). Python playbooks can leverage everything that the Python ecosystem has to offer - libraries, type safety, and so on. NL playbooks offer full semantic expressiveness. Best of both worlds! (but no such thing as a free lunch, so some compromises, caveat emptor).

Context Engineering: The Invisible Architecture

One of the most subtle but critical dimensions when interfacing with LLMs is context engineering. It is non-trivial exercise that AI agent builders have to contend with. As programs grow longer and agent interactions become more complex and multiple agents are in the mix, managing LLM context becomes a challenge. Include too much, and you waste tokens and money. Remove/compress something from context, and your agent loses critical information, your prefix-cache is blown, etc.

Playbooks automates this through two key innovations:

Stack-Based Context Management

The framework uses a stack-based approach that automatically compacts context as playbooks complete. Consider this execution:

During execution:

Main → GetOrderStatus → SummarizeOrderStatus (active)

The context includes full execution traces from all three playbooks - instructions, inputs, outputs, intermediate steps.

After SummarizeOrderStatus returns: Its detailed trace is replaced with a concise summary. When GetOrderStatus returns to Main, both traces become a unified summary of GetOrderStatus execution. Main continues with compact context containing only what's essential - dramatically reducing token usage while preserving semantic information.

Prompt Caching Optimization

Modern LLM providers cache prompt prefixes - reducing cost and latency by up to 10x for cached tokens. But naive caching strategies break down with dynamic context.

Compacting the oldest context invalidates the entire cache prefix
Compacting the newest context preserves cache but may lose relevant information

So, Playbooks compacts the middle balancing cache efficiency with current execution needs

As the call stack unwinds, what was "middle" becomes "current," so the framework progressively adjusts context on every LLM call, preserving the full uncompacted version and selectively presenting the optimal slice.

This happens automatically. You write natural language agents, while the framework takes care of this complexity. At the same time, being a developer-first system, Playbooks allows several ways for developers to influence context management through playbook description placeholders, organization of playbooks, raw prompt playbooks, separate agents, artifacts re-loading as needed, and so on.

The Agent Architecture

Multi-agent systems introduce another layer of complexity. Playbooks treats Agents as classes and playbooks as methods. Some are public (callable by other agents), some aren't. Agents can send messages, call each other's playbooks, even hold multi-party meetings.

Many non-obvious decisions lead to clean, programmable, manageable and controllable behavior. For instance, meeting lifecycles are tied to playbook execution. When an agent returns from its meeting playbook, it exits the meeting. No explicit lifecycle management and the control flow is the protocol. This significantly reduces the complexity of specifying agent behavior.

This feels obvious in retrospect, but getting here required thinking through countless interaction patterns. What happens when Agent A calls Agent B during a meeting? How do you prevent deadlocks? How do you make behavior intuitive without writing a 200-page specification? How agents handle side conversations through DMs, potentially even unrelated to the current meeting?

The Control Flow Paradox

Building AI agents is fundamentally different from building traditional software, introducing yet another dimension to this complex optimization challenge.

In traditional programming, control flow is deterministic. You specify exactly which code path executes under which conditions. In AI agents, control flow is emergent. It arises from the interaction between guidelines, context, and the LLM's reasoning.

This creates unique questions: How do you specify behavior that's both flexible and consistent? How do you enforce guidelines at multiple levels of specificity simultaneously—macro-level personality ("resolve issues professionally"), mid-level procedures ("check order status before refunds"), and micro-level adaptations ("acknowledge frustration before proceeding")? How do you prevent the agent from going off-rails while allowing natural conversation flow? How do you make behavioral constraints composable without creating a tangled web of rules? How do you inject special cases, tips and tricks, keeping a constant check on user interaction health to recommend human escalation, satisfying global constraints like you must ask the user XYZ at least once during the conversation?

Traditional if-then-else logic can't express this. You're not specifying execution paths - you're specifying behavioral guardrails that the agent must navigate fluidly.

Playbooks handles this through its layered approach: agent descriptions set personality and high-level behavior, playbook steps provide procedural guidance, triggers create reactive patterns, and the Notes section adds business rules, all expressed in natural language that the LLM interprets contextually rather than executes mechanically. The compilation to PBAsm ensures the procedural flow remains verifiable while the natural language preserves the semantic flexibility needed for human-like interaction.

The LLM Selection Problem

Which LLM should Playbooks support? This isn't just a model selection question - it fundamentally shapes how you build everything else. Different models have different context windows, different instruction-following capabilities, different failure modes. The prompt engineering must adapt to your target LLM.

After extensive testing, we've standardized on Claude Sonnet and PlaybooksLM for now, with each of those requiring specialized prompts and runtime specializations.

It is a hard problem

We have puzzled over this -

Given the remarkable linguistic capabilities offered by LLMs, why hasn't anyone else demonstrated a true natural language programming system like Playbooks?

The honest answer: it's really hard. Even if one makes the conceptual breakthroughs, building a well Engineered system requires solving this N-dimensional complex optimization problem that we describe above.

Along the way, we have needed 4 complete rewrites of the system to get here. But we think the effort is well worth it.

Consider the alternatives:

Pure prompt chains: No verifiable execution. Works until it doesn't.
Python-based frameworks (like LangGraph, CrewAI, Pydantic): Explicit and type safe but lose natural language fluidity. You're back to imperative programming.
Visual agent builders (like n8n): Not flexible enough. The agents may start out simple, but once the graph has more than a couple dozen nodes, it becomes very hard to understand what's going on. Real world Enterprise agents require 10x-100x the complexity. That's impossible to manage in a visual tool.

Each solves part of the problem while creating new ones. Playbooks' key insight is the compilation model: natural language source that compiles to structured assembly that LLMs can reliably execute, while retaining behavioral fluidity.

What This Enables

Natural language programming isn't just syntactic sugar. It unlocks fundamentally new capabilities:

Business users can read and approve agent behavior. Your compliance team doesn't need to parse Python. They can read the natural language playbook and understand exactly what the agent will do.

Rapid iteration. Changing agent behavior is editing text, not refactoring code. The semantic compiler ensures correctness.

Natural multi-agent systems. Agents communicate in natural language. Meetings use natural language protocols. The code reads like how humans would coordinate.

Verifiable execution. The PBAsm compilation means you can audit, debug, and verify what the agent did. No more black boxes.

10x less code. A 29-line Playbooks program is equivalent to 300+ lines of LangGraph code.

The Path Forward

We don't need to wait for AGI to build useful AI agents. Natural language programming works today because we've built the right abstractions around current LLM capabilities.

This is Software 3.0: where Python runs on CPU and natural language runs on LLM, on the same call stack. Where business logic is readable by humans and executable by AI. Where the vision of English as a programming language becomes production reality.

The puzzle of "why hasn't someone done this yet" has an answer: because it required solving all these problems simultaneously - runtime design, language design, context engineering, agent architecture, and LLM selection. Get 80% on each dimension, and you have an interesting demo. Get 95%+ on all dimensions simultaneously, and you have a production system.

That's what Playbooks is.

Try It Yourself

pip install playbooks

# Create a simple agent
echo '# GreetingAgent
This agent welcomes users.

## Main
### Triggers
- At the beginning
### Steps
- Introduce yourself and ask user for their name
- Welcome them warmly
- End program' > hello.pb

# Run it
ANTHROPIC_API_KEY=your-key playbooks run hello.pb

The future of programming isn't writing code. It's describing what you want in the language of thought - and having it reliably execute.

Welcome to Software 3.0.

Playbooks AI is open source. Documentation at playbooks-ai.github.io