Five Frameworks, Five Philosophies, Part I

Mic
Mar 16
7 min read

If you have spent any time building LLM-powered applications, you have probably run into this problem:

LangChain works.

But then someone tells you LlamaIndex is better for RAG. Someone else says DSPy is the future. And then a third person shows up with Haystack.

Suddenly you are not building anything. You are reading framework documentation.

This is the state of the Python LLM ecosystem right now. Several well-funded, actively maintained frameworks competing for roughly the same territory, each with a different philosophy about how LLM applications should be built.

In this first post, we focus on the five frameworks that have established themselves as the serious, full-featured contenders: LangChain, LlamaIndex, Haystack, DSPy, and LangGraph. Each one has a distinct identity, a distinct philosophy, and a distinct sweet spot.

We will look at what each framework actually is, what problem it was designed to solve, and — most importantly — when you should reach for it instead of the others. In a follow-up post, we cover the specialist libraries that fill in the gaps the main frameworks leave behind.

We have already covered LangChain in detail in a previous post. Here we treat it as the baseline and measure everything against it. In this post we will look at LlamaIndex and Haystack.

The Baseline: LangChain

Before we compare, let us quickly restate what LangChain is. LangChain is a general-purpose framework for building LLM-powered applications. Its core idea is composition: you build pipelines from small, reusable pieces using the `|` operator.

Clean, readable, and integrates with almost everything.

LangChain's strength is breadth. It has integrations with hundreds of LLM providers, vector databases, tools, and APIs. It also supports agents, memory, retrieval, and structured outputs.

Its weakness? That same breadth. For teams who need deep specialisation — particularly in retrieval or prompt optimisation — LangChain can feel like a general-purpose hammer trying to do neurosurgery.

That is where the alternatives step in.

LlamaIndex: The Retrieval Specialist

LlamaIndex was originally called "GPT Index", which tells you exactly what it was built to do: help LLMs work with your own data.

Where LangChain treats retrieval as one feature among many, LlamaIndex treats it as the entire point.

The core abstraction in LlamaIndex is the index. You load documents, build an index, and query it. The library handles chunking, embedding, retrieval, and re-ranking with far more control than most frameworks offer.

The Basic RAG Pipeline

The simplest possible LlamaIndex setup. Load documents, build a vector index, and start querying.

A few lines. You have a working RAG pipeline. What LlamaIndex does underneath — chunk splitting, embedding calls, cosine similarity retrieval, context assembly — would take quite a few lines to build yourself.

Controlling the Retriever

The query engine above uses sensible defaults. But LlamaIndex lets you control every stage of retrieval individually. Here we increase how many chunks are fetched and apply a similarity threshold.

This pattern — build a retriever, wrap it in a query engine — is the one you will use most once you move beyond toy examples. The defaults get you started; the retriever parameters get you production quality. We will not expand on the exact options here, but in addition to the number of chunks retrieved, you can also set things like filters if your data has metadata, restrict to specific files, or set similarity cutoffs.

Retrieval Strategies

Once you move beyond the default setup, different retrievers behave quite differently—and that difference shows up immediately in answer quality.

VectorIndexRetriever is the standard baseline. It uses embedding similarity to return the top-k chunks. Fast and simple, but it often returns fragmented context.

AutoMergingRetriever improves on this by stitching related chunks back together. Instead of isolated snippets, you get more coherent sections—usually leading to better, more reliable answers.

BM25Retriever takes a completely different approach: keyword search. No embeddings, just classic information retrieval. Surprisingly strong when exact terminology matters, especially in technical or structured documents.

Sub-Question Decomposition

Complex questions often span multiple documents or topics. LlamaIndex can automatically break a hard question into smaller sub-questions, answer each separately, and synthesise a final response.

The framework queries each index independently, then synthesises the results. You wrote none of that orchestration logic yourself. The whole pipeline then becomes

User query
  ↓
LLM splits into sub-questions
  ↓
Each sub-question routed to correct index
  ↓
Answers combined
  ↓
Final response

Persisting and Reloading an Index

Building an index from scratch on every run is wasteful. LlamaIndex makes it easy to save the index to disk and reload it later.

In practice this is almost always what you want. Build once, query many times. The cold start problem disappears.

A LlamaIndex Agent with Tools

LlamaIndex agents can use query engines as tools, meaning they can decide which index to query, when to search the web, or when to call a function — all driven by the LLM.

The agent decides when to query the documents, when to check the date, and how to combine both into a coherent answer.

When to choose LlamaIndex

You are building a RAG system, a document Q&A tool, or any application where the quality of retrieval is the primary variable to tune.

Haystack: The Production Pipeline Framework

Haystack comes from deepset, a company that builds NLP infrastructure. That origin matters. Haystack was designed from the start for production use cases, not for rapid experimentation.

The core abstraction in Haystack is the pipeline. You build a directed graph of components — readers, retrievers, rankers, generators — and Haystack routes data through them.

A Basic RAG Pipeline

The same task as LlamaIndex above, expressed in Haystack’s explicit component graph style.

Notice the explicit connect() calls. Haystack makes the data flow visible and traceable, which is exactly what you want when something breaks in production.

When the “Simple RAG Example” Isn’t So Simple

Haystack’s pipeline API looks clean on paper: define components, connect them, run a query. In practice, even a minimal example exposes a few sharp edges — not fatal, but very revealing.

The first trap appears before writing any code. Installing

pip install farm-haystack

works — but gives you Haystack v1. The modern API (with Pipeline) only exists in v2, which is installed via:

pip install haystack-ai

Same library name, different package, incompatible APIs. Easy to miss.

Even with the correct package, things can still fail silently. Using a very new Python version (e.g. 3.14) may lead to:

ImportError: cannot import name 'Pipeline'

Not because the code was wrong, but because the ecosystem hadn’t caught up yet. Dropping back to Python 3.10–3.12 (3.11 works best) fixed it immediately.

None of this is dramatic — but it highlights a broader truth:

LLM frameworks are still evolving, and even “hello world” examples often depend as much on environment details as on code correctness.

Writing a Custom Component

One of Haystack’s most useful features is how easy it is to slot your own logic into a pipeline. Any class decorated with @component becomes a first-class pipeline citizen.

This composability is the main reason teams choose Haystack. Every processing step is a named, testable unit. When something goes wrong, you know exactly where to look. Although quite a few lines for a simple example, it is very easy to parse. You can clearly see how different components are being build. Looking at the connect() calls you can also clearly decipher what the syntax is for incoming and outgoing arguments.

Serialising a Pipeline to YAML

Haystack pipelines can be saved as YAML and reloaded without touching any Python code. This matters a great deal in team environments or if you want to use the same pipeline in different instances.

This opens the door to configuration-driven deployment. Swap a GoogleAIGeminiGenerator for a HuggingFaceLocalGenerator by editing a YAML file, not a codebase.

Evaluating RAG Quality

Haystack ships with an evaluation framework that measures standard RAG metrics out of the box. This is the feature that most clearly sets it apart from the other frameworks.

Running this as part of a CI pipeline means you catch retrieval regressions before they reach users. In case of the above mini example, you might get the following:

Faithfulness scores: [1.0, 0.0]
Context relevance scores: [1, 1]

Meaning that context was relevant in both cases, but only the first answer was deemed correct, the second completely wrong.

Evaluation Example: Practical Friction Points

The evaluation pipeline above works correctly, but several version- and API-related issues can occur:

Import paths are version-dependent Evaluators may be located in different modules depending on the installed Haystack version (e.g. no evaluators.rag in some releases).
Constructor arguments changed Evaluators require a chat_generator parameter or an llm parameter.
Strict input naming Method signatures must match exactly (e.g. responses vs predicted_answers).
Implicit internal components Evaluators internally use PromptBuilder, which triggers warnings about missing required_variables. These warnings originate from the framework, not user code.
Hidden LLM dependency Even though the code looks like pure evaluation logic, an LLM (e.g. Gemini) is required. The evaluators are not simple metrics—they internally call a model to judge faithfulness and relevance.

Branching Pipelines

Haystack pipelines are not limited to linear chains. You can branch on conditions, route queries to different components, and merge streams back together.

Branching unlocks patterns that are very hard to express cleanly in LangChain: intent routing, fallback chains, A/B testing different prompts in the same pipeline.

When to choose Haystack

You are building something that will run in production, needs to be maintained by a team, requires evaluation and monitoring, or needs predictable, debuggable behaviour.

Final Thoughts

By now, the uncomfortable truth should be clear: all of these frameworks work, and none of them is universally “the right one.” The differences are not about capability so much as emphasis. LlamaIndex optimises for retrieval quality and iteration speed; Haystack optimises for structure, traceability, and production reliability. They are solving adjacent problems, not competing for a single crown.

What matters is not which framework you pick, but what you are trying to optimise. If you are exploring data and trying to improve answer quality, LlamaIndex will feel natural. If you are building something that needs to be stable, testable, and maintainable by a team, Haystack’s explicit pipelines start to pay off.

In practice, most non-trivial systems drift toward a mix of both ideas anyway. The ecosystem is not converging on a single winner; it is specialising. And that shifts the real skill from “learning a framework” to recognising when a particular abstraction helps—and when it gets in your way.