Five Frameworks, Five Philosophies, Part II

Mic
Mar 19
11 min read

By the time you have spent a while building LLM applications in Python, you usually reach a familiar stage of mild confusion.

At first, everything seems simple enough. You wire up a model, add a prompt, maybe attach retrieval, maybe call a tool, and for a moment it feels like the ecosystem makes sense.

Then the framework recommendations begin.

One person tells you to use LlamaIndex because retrieval quality is everything. Another insists Haystack is the grown-up choice because production systems need structure. Then someone else appears and says prompt engineering is already obsolete and you should be using DSPy. And just when you think you have regained your footing, LangGraph shows up and informs you that your “agent” is actually an underspecified state machine with commitment issues.

In previous posts, I already covered LlamaIndex and Haystack. Those two frameworks occupy very important parts of the ecosystem. LlamaIndex is strongly focused on retrieval and knowledge access. Haystack is built around explicit, production-friendly pipelines. Both are useful, both are serious, and both make a lot of sense once you understand the problem they were designed to solve.

But they do not cover everything. There are still two big questions left hanging in the air.

First: what if the real problem is not retrieval or orchestration, but the fact that manually tuning prompts is messy, inconsistent, and suspiciously close to educated superstition?

Second: what if the real problem is not generating one response, but managing a multi-step workflow that has to loop, branch, pause, retry, and occasionally wait for a human being to stop it from doing something embarrassing?

That is where DSPy and LangGraph come in.

They do not feel like small variations on the same design. They feel like frameworks built by people who looked at LLM systems and became annoyed by completely different things. DSPy asks how to optimise LLM programs more systematically. LangGraph asks how to make agent workflows explicit, durable, and controllable.

Apparently, this is how ChatGPT sees the difference between the different packages...

So while the earlier posts were about getting information into the model and building solid pipelines around it, this post is about two harder topics: making LLM systems better, and making agent systems behave.

DSPy: The Framework That Treats Prompts Like Parameters

DSPy is the odd one out in this comparison, which is part of what makes it interesting.

Most LLM frameworks are built around making it easier to call models, chain components, retrieve documents, or orchestrate tools. DSPy is not especially interested in any of that by itself. Its main idea is that the way we usually build prompt-based systems is clumsy. We write prompts manually, tweak wording by hand, run a few examples, change three adjectives, feel briefly hopeful, and then repeat the whole ritual the next day.

DSPy looks at that process and asks a rather impolite but fair question: why are we still doing this manually?

Its answer is to replace raw prompt-writing with signatures. Instead of writing a long instruction string, you describe the task in terms of inputs and outputs. DSPy then constructs and optimises the actual prompt behaviour from that higher-level specification.

That shift is small in syntax, but large in philosophy.

Signatures and Predict

The most basic DSPy pattern starts with a signature. A signature describes what goes in, what comes out, and what kind of task this is.

This starts pretty straight forward. We import the package, we set up the model and declare this to be our default model. Then the unique feature of DSPy shows up, the Signature.

This captures the whole DSPy worldview in a few lines. The docstring, field names, and field descriptions are enough for the framework to construct the actual prompt sent to the model.

In other words, you are not really hand-writing prompts anymore. You are specifying behaviour.

The docstring declares the behaviour. In this case it says the module should answer questions. Then we declare that there is one input field and this is called question. Then we define an output field, called answer, and describe to the LLM what we want, the answer should be short, between 1 and 5 words.

Then we initialise a prediction module based on this signature. Behind the scenes this essentially does the following

- take the signature
- turn it into a prompt
- call the configured LLM
- map the response to the declared output

And then finally, we run the model with out input being the question about the capitol of France.

Chain of Thought

Once you have the signature idea in place, you can swap Predict for ChainOfThought and ask the model to reason before it answers.

The interesting part here is not just that the model now produces reasoning and answer separately. Plenty of systems can do that. The interesting part is that this reasoning structure becomes part of the program DSPy can later optimise.

So instead of treating chain-of-thought prompting as a special trick you sprinkle into certain tasks, DSPy treats it as a first-class modelling choice. That is very much its style: less prompt hacking, more program design.

Building a Module

This gets more useful once you stop thinking in terms of one call and start thinking in terms of reusable units.

DSPy modules allow you to combine multiple signatures into a larger component that can still be optimised as one program.

This is the point where DSPy starts to feel less like a prompting library and more like a framework for building trainable LLM pipelines. The summary step and the sentiment step are not just chained together. They form a unit, called a module, that can later be compiled and improved.

In the initialisation we see that it contains two predictors. The forward method then tells us what happens when we call the module with a prompt. Using the result of the first predictor to plug into the second and then returning a prediction that is made up of our three results.

That is really the promise DSPy makes: not that you will never touch prompts again, but that prompt quality becomes something you can approach more systematically than “let me reword this sentence for the ninth time and hope.”

Optimising with BootstrapFewShot

This is where DSPy earns the label compiler.

Given a set of examples and a metric, DSPy can search for better prompting strategies automatically. Instead of manually crafting few-shot prompts, you let the framework discover examples that improve performance on the task you care about.

We have a simplified version of our signature from before, but also a training data list of questions with our gold standard answer. Then we set up a comparison metric, in our case simply exact comparison. The final predictor is then compiled using the signature and the training set.

The core idea is straightforward: if you can define what better performance looks like, then at least some part of prompt construction should be automatable. DSPy is built around that assumption.

And to be fair, it is a compelling one. Hand-tuning prompts can work, but it rarely scales gracefully. Once you have labelled data and a metric, systematic optimisation starts to look much more attractive than prompt poetry.

RAG as a DSPy Module

DSPy can also fold retrieval into the same module-oriented framework.

Here we set up a mock retriever, in a real application you use the vector store here. It simply takes a query and returns a number of strings where the dictionary key appears in the query. We of course have a signature, which in this case has two input fields, since we add context as well as a question. And then these get combined to a module using the retriever and the ChainOfThought predictor, which in its forward method is combined by first getting the context and then calling the predictor with the context and the original question.

This is a useful example because it shows that DSPy is not limited to pure prompting tasks. You can combine retrieval and generation inside the same optimisable program, then tune the system against actual examples. Especially if you combine this approach of a RAG and the bootstrapping from before.

That becomes especially attractive once you are dealing with tasks where prompt quality, reasoning structure, and grounding all matter at the same time. Instead of treating retrieval and generation as two unrelated headaches, DSPy encourages you to think of them as one program with measurable behaviour.

When to choose DSPy:

DSPy makes the most sense when you have labelled examples, care a great deal about output quality, and want something more systematic than manual prompt iteration. It is especially attractive for research workflows, reasoning-heavy systems, and tasks where “good enough after some fiddling” is not a satisfying engineering strategy.

From Optimisation to Orchestration

If DSPy is concerned with improving what the model does, LangGraph is concerned with managing what the system does.

That distinction is worth keeping in mind, because LangGraph solves a very different problem. It is not mainly about prompt quality. It is about control flow, persistence, retries, branching logic, and state.

This is where the conversation shifts from “how do I get better answers?” to “how do I build an agent that does not wander off into the bushes the first time a workflow becomes complicated?”

LangGraph: The Framework for Agents That Need Structure

LangGraph comes from the LangChain ecosystem, but it is not just LangChain with a different outfit.

It exists because simple chains are not enough for many agent workflows. Once your system needs loops, tool retries, multi-step planning, persistent memory, approval checkpoints, or resumable execution, a linear pipeline starts to feel very inadequate.

LangGraph’s answer is to model the whole thing as a graph.

Nodes represent steps. Edges represent transitions. State is passed between nodes and can be persisted at each stage. That sounds more formal than a lot of agent demos on the internet, which is precisely the point.

A Minimal State Graph

A LangGraph application begins by defining a state schema and some nodes that read from and write to it.

This example is small, but it shows the central idea. Let's quickly go through what happens here.

We first define the AgentState, this is the shape of the shared state for our StateGraph later on. Think of it as the memory of the graph. Whenever a node gets activated, it gets the current state, does some work and then potentially updates the state. In our case the state is a question, an answer, and the number of attempts.

Then we define a node. Our example is very simplistic, as our graph will only contain this one node, called answer_node. It has the state as a parameter and returns the state. Inside the node we define the model and invoke it, as standard in LangChain, with the question and then update the answer and attempt attributes of our state.

Then we define a routing or control function. It checks if our answer is too short and if we have not run out of retries and answers "retry" otherwise it answers "done". We made it simple here, you should propably use Enums here.

then the real graph part starts and we build the graph. We add our single node, called answer and plug in the answer_node defined before. Then we tell the graph where to enter, in this case it is immediately the answer node. If you later invoke the whole graph, this is where it starts.

Then comes the main feature, we add an edge, but with a condition. If the should_retry function returns "retry" we go back to the answer node, if it returns "done" we end and the current state is our final answer.

Then we compile the graph, invoke it with our question and print the final answer.

Note that we do not tell the answer node that a previous answer was too short, we could do this by updating the question attribute of the state.

A ReAct Agent from Scratch

One of the best ways to understand LangGraph is to build a ReAct-style agent explicitly.

We will not go through every single piece of this script, but the key point is that the model and its ability to call tools are now put in two different nodes. The model node can declare that they want to use a tool, by having a tool_calls attribute in the last message of the state of the graph, then the graph switches over to the tools node. In that node the tool is explicitly called and a ToolMessage is appended to the current state containing the result of the tool. Note that from the tools node we always move back to the model node.

What this buys you is visibility. The think-act-observe loop is not buried inside a convenience wrapper. You can see it, inspect it, and modify it.

That may sound less glamorous than “one-line autonomous agent,” but in practice it is often far more useful. Once something fails, or loops badly, or calls the wrong tool, transparency becomes much more valuable than magical elegance. This way you could restrict which tool the system is allowed to call at what time in a more involved processing graph.

Checkpointing and State Persistence

One of LangGraph’s strongest features is that state can be persisted between invocations.

In this example our graph is even easier than in the start, a single node, that is where start and that is where we end. The node simply takes the input, invokes the model and appends the answer to the state.

But the important part is the checkpointer using the MemorySaver(). Together with a thread_id, you create a memory for the graph. When the graph is invoked the second time, it remembers the state linked to the thread_id.

This is one of the clearest dividing lines between a toy agent and a serious workflow system. Once you need durable threads, resumable execution, or long-lived multi-turn behaviour, explicit checkpointing stops being an advanced feature and starts being basic infrastructure.

LangGraph is designed with that reality in mind.

Multi-Agent Supervisor Pattern

LangGraph also makes multi-agent designs much easier to express cleanly.

In this situation our first node decides whether a writer or researcher node is the appropriate next node with a conditional connection. This is a very simple supervisor pattern, but the architecture scales well. Add more specialist nodes, more nuanced routing, tool-enabled sub-agents, or approval steps, and you have the outline of a real multi-agent system.

LangGraph is particularly good at this kind of explicit orchestration. Instead of pretending one model call can elegantly do everything, it encourages you to separate responsibilities and model the handoffs properly.

Human-in-the-Loop Interrupts

This is where LangGraph becomes especially practical.

A lot of agent workflows should not be fully autonomous. They should pause before sending, publishing, purchasing, approving, or changing something important. LangGraph has built-in support for that pattern.

A bit of a longer example, but it shows a very simple concept. The first node, the draft node, creates a text bit about the topic. Then we reach the review node and the process interrupts, showing us the drafted version and allowing us to make an input. After the input we restart the process at the exact moment where it was interrupted with the return value of the review node now being our decision. Then this is routed either back to drafting or to the publish node in which it becomes the final output.

That single interrupt mechanism captures one of LangGraph’s core strengths. It treats human review as part of the workflow, not as an awkward patch added later. For real systems, that is often exactly what you want.

When to choose LangGraph:

LangGraph is the right fit when your system needs loops, retries, branching logic, durable state, multi-step coordination, or human approval checkpoints. It is especially useful once your agent is no longer just answering questions and has started behaving more like a workflow engine with opinions.

Two Frameworks, Two Different Kinds of Maturity

What makes DSPy and LangGraph especially interesting is that both of them signal a more mature phase of LLM development, but in different ways.

DSPy reflects the idea that prompting should become more systematic, measurable, and optimisable.

LangGraph reflects the idea that agents should become more explicit, stateful, and controllable.

Neither framework is mainly trying to impress you with how little code you can write. They are both more concerned with what happens after the demo works and the real engineering problems begin.

That is why they do not feel like direct substitutes for LlamaIndex or Haystack. They sit at different layers of the problem.

LlamaIndex helps when retrieval is the bottleneck.

Haystack helps when production pipelines are the bottleneck.

DSPy helps when prompt and reasoning quality are the bottleneck.

LangGraph helps when workflow complexity and state are the bottleneck.

Once you look at the ecosystem through that lens, the apparent chaos starts to look a bit more reasonable.

Closing Thoughts

Taken together, these frameworks show how quickly the ecosystem is moving from simple demos toward more deliberate system design.

In the earlier posts, LlamaIndex and Haystack covered retrieval and pipelines. Here, DSPy and LangGraph shift the focus to optimisation and workflow control. That makes them more demanding, but also more relevant once an application grows beyond a neat prototype.

At that point, the challenge is usually not getting a model to produce an answer at all. It is getting the larger system to produce reliable behaviour without constant manual intervention.

That is where these newer ideas start to matter.