Building Tools for LLMs vs Building LLM-Powered Tools

Mic
Jun 2
18 min read

Updated: 3 days ago

This post is more of a simple explanation on how LLMs can show up in applications and how it differs. Since everything today says that their app “uses AI”, we want to go through two of the main structures how this is usually orchestrated.

One is a regular piece of software that happens to call an LLM at some point, whether this is to summarise a paragraph, classify a ticket, or generate a friendlier error message. This does not mean that this is always needed, quite a few apps use LLM calls as a lazy workaround putting in a more extensive and controllable codebase for checking. But the important bit is, that the control flow is yours. The LLM is a component and in many ways no different in kind from a database call or a regex. Especially the latter is a good example, you can either use regex to extract and email address from a text passage or you can send the passage to an LLM and ask it to do it. The result is pretty much the same, if done correctly.

The other is software where the LLM is the control flow, this means that it decides what happens next, which functions get called, in what order, and when to stop. Your code becomes a toolbox the model reaches into, on its own schedule and for its own reasons.

These get talked about as if they’re the same structure, often under the same buzzword, but they’re not. And the confusion causes real problems in both directions: people build agentic loops for tasks that needed a fifteen-line script, and people hardcode rigid pipelines for tasks that actually needed a model making judgment calls about what to do next.

This post is about telling the two apart and about a habit that pays off either way, regardless of which one you’re building: writing functions that are describable, not just callable.

We’re not going to touch MCP yet. That’s the next post, and it builds directly on this one. First, we look at the fundamentals and the differences, because the protocol only matters once the thing it’s transporting is well-designed and understood.

ChatGPT can definitely understand the difference between its own model's uses

Two Shapes, One Buzzword

Let’s make the distinction concrete before getting into design details, because the rest of this post only makes sense once the shapes are clear.

Shape A — LLM as component. Your application has a fixed shape, for example requests come in, flow through a known sequence of steps, and produce a result. Somewhere in that sequence, one or more steps happen to be “ask a model to do something.” The model’s output varies, but the structure doesn’t.

Shape B — LLM as orchestrator. Your application hands the model a goal and a set of capabilities. The model decides which capabilities to use, in what order, how many times, and when it’s done. The structure itself is determined at runtime, by the model, based on what it encounters.

A useful gut-check: if you wrote out your application as a flowchart, would the boxes and arrows be the same for every input, hence only the content inside the boxes is changing? Then this is Shape A. If on the other hands, your arrows can change on input and on what happens at runtime? Then that’s Shape B. By changing here, we really mean changing completely, if an arrow exists or not for a certain check is most likely just a conditional somewhere. But if the order in which arrows change, this is a good indicator for Shape B.

Many production systems that “use AI” today are Shape A, often disguised. A support ticket summarizer, a code review commenter, a translation layer, a “rewrite this email to be more polite” feature, many of these are Shape A. One or two LLM calls, a mostly fixed order, deterministic glue code around them.

Shape B is less common in production than the hype suggests, but it’s where most of the interesting engineering challenges live and it’s the shape that makes “tools for LLMs” a first-class design problem rather than an afterthought.

The Function Is the Contract

Here’s a function prototype that checks whether a username is available:

If you’re calling this yourself, the docstring is decoration, even though a very useful one. You wrote the function, you know what it does, the docstring is for future-you, a teammate, or your IDE’s autocomplete.

If an LLM is calling this, the docstring is no longer decoration. It’s the only information the model has about what this function does, when to use it, what arguments it expects, and what it returns. The model can’t read your implementation, rather by design, it usually never sees the function body at all. It only reads the description and then sends python after it to run it.

This changes how you write functions, and the change is bigger than it first appears. Compare three versions of the same function:

Version 1 tells the model nothing it couldn’t guess from the function name — and a model that can’t tell what a function does from its description will either avoid calling it, or call it at the wrong moment.

Version 2 is the level most people stop at. It’s technically descriptive but leaves out the two things that matter most for an agent’s decision-making: edge-case behaviour (what about case sensitivity?) and temporal scope (is this a one-time check or does it need re-checking?).

Version 3 answers questions the model would otherwise have to guess at — or worse, get subtly wrong in a way that’s hard to spot in testing. That “does not reserve the name” caveat is the difference between a model that calls this once, gets True, and proceeds confidently versus a model that, three steps later, calls it again “just to be safe,” burning an extra round trip and possibly confusing your rate limits.

The function signature, the docstring, the parameter names, the return type, together they essentially form a contract. Not a contract between two pieces of code, which is what we’re used to from type systems and interfaces, but a contract between your code and a model’s probabilistic understanding of your code. That’s a stranger kind of interface than the more classical one, and it’s worth treating as its own skill. In addition the enforcement of the contract is looser, two parts of code usually have a very restrictive interface in between, for example a function looking at a database will not invent a new data structure to return. An LLM has more freedom in what it puts in, so the better the description, the lower the chance that this can happen.

A Second Example: Parameters Are Important Too

This is not exclusive to docstrings, parameter design matters just as much, and the failure modes are sneakier because the function still works — it just gets called wrong.

A model calling this has to guess the date format. Maybe it sends "2026-03-01" or "March 1, 2026", or in the worst case "03/01/2026", which is ambiguous and useless even to most humans. Your function might handle some of these and silently misparse others.

Using a proper date type (most tool-calling frameworks translate this into a clear schema, typically an ISO 8601 string format in the underlying JSON) removes an entire category of guesswork. And documenting the ValueError tells the model what to expect if it gets the order backwards. This matters quite a bit, because a model that doesn’t know an error is possible won’t have a plan for handling it.

The general principle: every ambiguity you leave in the contract is a guess the model has to make, and guesses compound. One ambiguous date format might cause an occasional retry. Five ambiguous parameters across three tools, called in sequence, can produce failure combinations you’ll never anticipate in testing. The morale for tools called by an LLM is: the more detailed a description, the better. Don't be afraid to make the description longer than the function itself. It does not just describe the behaviour of the function itself, but also when and how to call it, with what parameters, and what could go wrong and what the errors mean.

Who’s Driving?

We outline two two systems that both “summarise customer feedback and flag urgent issues.”. It is fairly brief and we omit any of the additional helper functions.

System A (Shape A, the fixed pipeline):

System B (Shape B, the agent loop):

System A calls an LLM twice, in a fixed order, and uses plain Python, for example if urgency >= 4, to decide what happens with the result. The LLM is a function. A fuzzy, occasionally surprising function whose output varies, but structurally, a function: input goes in, output comes out, your code decides what to do with it. Whether the output is good, will depend on your setup of the LLM and what control you have over it.

System B hands the model a goal and a toolbox, and the model decides which tools to call, in what order, and whether to call any of them more than once. The LLM is the function caller. Your Python is the toolbox, sitting there waiting to be picked up.

Neither is “better” in the abstract. But they fail differently, they cost differently, and they deserve different amounts of trust and the right choice depends heavily on what you’re building.

How They Fail

System A is predictable. If urgency_raw comes back as "4", the alert fires, this happens every time with no exceptions. You can write a unit test: feed in feedback, mock the two llm_calls to return fixed values, assert that send_alert was called. That test will never flake due to the control flow, only the LLM outputs themselves can vary, and those are isolated to two narrow, well-defined slots.

If something goes wrong in System A, for example urgent feedback that didn’t trigger an alert, you have exactly two places to look: did rate_urgency return a number below 4 when it shouldn’t have (an LLM output problem, narrow and inspectable), or did the if statement have a bug (a code problem)?

System B is flexible in a way System A structurally cannot be. If feedback arrives in a format nobody anticipated, for example, a voice-to-text transcript that includes timestamps and “umm”s, the model can notice that, decide the summarise_text tool needs slightly different framing, or even skip straight to send_alert if the content is clearly an emergency regardless of normal triage steps.

But that same flexibility means that if something goes wrong, like the urgent feedback that didn’t get flagged, your debugging space is now: why did the model, given this exact prompt and this exact toolbox, decide not to call send_alert this time?

That’s not a question you answer by reading code. It’s a question you answer by re-running similar inputs, reading the model’s reasoning trace if you have one, and possibly concluding “it was a reasonable call given an ambiguous case, and a different model, or the same model on a different day, might do it differently.”

The Real Question

“Who’s driving?” is really asking: when something goes wrong, do you want to debug your code, or do you want to debug a decision?

Debugging code is a deterministic skill, with enough care and time you will succeed. Debugging a decision made by a probabilistic system, where “rerun it and see if it happens again” is a legitimate diagnostic step, and much less deterministic. It also needs plenty of experience with the model in question.

Small Tools, Composable Tools

The main idea that does the heavy lifting here: small, single-responsibility units compose better than large multi-purpose ones.

This is similar to decorators: A decorator that does one thing, whether it is logs, retries, or caches, can be stacked with other decorators in whatever combination a given function needs. A decorator that logs and retries and caches is a decorator you can only use exactly the way its author imagined, and exactly that way every time or you just hid different decorators in a single one by adding lots of flags.

Tools for LLMs work the same way, for a related but slightly different reason. It’s not only about reusability, rather it’s about the model’s ability to reason about what each tool does, in isolation, without having to first figure out which “mode” it’s in.

Compare the following prototypes:

vs (with truncated docstrings to not inflate this too much)

The first version is one tool with five hidden modes, distinguished by a string the model has to get exactly right and the required arguments change depending on that string. The final bit is a big problem, since most tool-schema formats can’t even express this cleanly. A model calling this has to: decide it wants to “manage a user,” recall which of five magic action-strings applies, and then recall which kwargs that particular string demands. The latter is probably done by an extensive docstring, at best. Three points of failure for what should be one decision.

The second version is four tools, each with an obvious name and a narrow job, each with its own clean, fixed parameter list. When the model is deciding what to call next, the names themselves carry most of the information, for example “I need to suspend this account” maps almost directly onto suspend_user, with no intermediate translation step. Any extra docstrings will only make this more robust.

In practice, the single mega-function version tends to produce two characteristic failure modes: the model picks the right action string but forgets a required kwarg (because the requirement was buried in prose or docstring, not in the schema), or the model picks a plausible-sounding but wrong action string, for example "ban" instead of "suspend", say because nothing in the schema constrained it to the exact five valid values as cleanly as five separate function names would have.

How Far to Split?

This raises an obvious question: if smaller is better, why not have a hundred tiny tools, like get_user_email, get_user_name, get_user_signup_date, each returning one field?

Because composability cuts both ways, a toolbox with a hundred near-identical tools makes the model’s first job, namely deciding which tool is relevant at all, much harder, for the same reason a function with a hundred optional kwargs is hard to call correctly. There’s a sweet spot, and it tends to track the natural verbs of your domain, not the natural fields.

get_user returning a full profile dict is usually right-sized: one verb (“fetch”), one natural object (the user), and the model, or your downstream code, can pick whichever fields it needs from the result. get_user_email as a separate tool only earns its place if fetching just the email is meaningfully cheaper, more restricted (different permissions), or more common than fetching the whole profile.

A rough heuristic that’s served well across both decorators and tools: split along axes that vary independently. For example, get_user and update_user vary independently, essentially you might need one without the other. This means that bundling them, into something like manage_user, forces the model to over-specify even for the simple case. But two separate tools, in the way of get_user vs get_user_including_deleted, usually don’t vary independently enough to justify the split, it is a parameter that can be easily described in both schema and docstring, not a verb.

The Prompt Is Part of the API

An important difference to "ordinary" coding is: if you change a tool’s docstring, you’ve shipped a change to your application’s behaviour. Even though no Python logic changed, no tests failed, and your CI is green, the model might behave very differently.

Rename that to:

The function logic did not change, no code was changed, and no return type was modified. But now the model is meaningfully more likely to call this before calling a refund_order tool, just because you told it to, in the one place it actually reads. It now knows the exact valid status strings instead of guessing whether it’s "shipped" or "in_transit" or "in transit".

That second change, the listing valid values explicitly, is small, but it eliminates an entire class of “the model called the tool correctly but with a value that doesn’t mean anything to your backend” errors. This docstring could be further improved by also telling the model that a value of None would search all orders. Then it does not have to guess whether it can leave this out.

System Prompts Are the Other Half

The same logic applies, at a larger scale, to the system prompt in an agent loop. Consider two versions of the prompt for our feedback-processing example:

Version 1 is technically sufficient, a capable model will probably figure out roughly this sequence on its own most of the time. But looking at this workflow, it would be quite embarrassing if the system cannot figure out this workflow even a single time.

Version 2 removes ambiguity about ordering, for example does logging happen before or after the alert. It also handles some edge cases that might occur regularly, like feadback in differrent languages, and it clarifies what "alert on urgent issues” really means numerically, i.e. 4 or more.

This means tool descriptions and system prompts deserve the same engineering care as your function signatures: version them (a prompt change is a code change for an LLM tool and should be handled the same way), review changes to them (a teammate reading “removed the refund-order ordering instruction” should ask why, same as they’d ask about a removed if check), and one has to test them. This last part is the one that is a bit problematic.

What “Testing the Prompt” Actually Looks Like

This is not “does the function return the right value when called directly”, that’s a normal unit test and should absolutely still be written. The new kind of test is: given this exact toolbox and this exact system prompt, does the model choose to call the right thing, with the right arguments, for a representative set of realistic inputs?

This kind of test is slower, it makes real model calls, or calls to a cheaper model standing in for one, it is noisier, a borderline-urgency message might reasonably get rated 3 or 4 depending on phrasing, and harder to make deterministic. It is much more similar to testing of a machine learning algorithm. You setup your testing metrics and decide how much deviation you accept, epsecially in case of the urgency and the way the summary is worded.

But it’s the only test that exercises the part of your system that’s actually new, in this case the English-language half of your API surface. Skipping it means your “tests” cover the 10% of your system that was already easy to get right (plain Python control flow) and leave untested the 90% that’s actually doing the interesting work.

Putting It Together: A Worked Example

Let’s walk through designing tools for a small, realistic scenario: an internal agent that helps triage GitHub issues, we want it to label them, find related issues, and draft a first response.

Step 1 — identify the verbs. What actions does this domain naturally support? Searching issues, reading an issue’s full content, applying labels, posting a comment, and finding related issues by similarity. Five verbs, five candidate tools.

Step 2 — write each contract as if a stranger had to use it correctly with no other context:

Notice a few things about these contracts, each tying back to a point made above:

search_issues explicitly says it doesn’t return the body, and points to the tool that does. This prevents a failure mode where the model tries to read result["body"] from search results and gets a KeyError.
apply_labels clarifies additive-vs-replace behaviour. This one-word difference in implementation completely changes how safe it is to call repeatedly.
post_comment includes a warning that there’s no draft step. This is the kind of caveat that exists purely for the model’s benefit. A human using this function in code would typically know this from context, but a model deciding “should I post this now or refine it first?” benefits from being told explicitly that “now” is irreversible.

Step 3 — decide the shape. Is this Shape A or Shape B? “Triage” suggests genuinely variable workflows: a duplicate gets different treatment than a novel bug report, which gets different treatment than a feature request. The number and order of steps depends on what’s found, which all suggests Shape B. An agent loop with these five tools and a system prompt describing the overall triage policy (when to label as duplicate, when to draft a response vs. just label, etc.) fits naturally.

Contrast with a simpler version of “issue triage” — say, just “translate non-English issue titles to English.” That’s clearly a Shape A: one LLM call, fixed structure, no tools needed at all.

The five tools above, with their carefully-written contracts, are useful regardless of which shape wins — they’re well-designed functions on their own merits. That’s the point: good contract design isn’t an “agent thing” or a “Shape B thing.” It’s just good function design, with an unusually demanding reader. This is also a good point to make clear that docstrings of functions are much more important if you are not sure if these might be used as a tool later.

Not All Tools Are Created Equal

Looking back at those five issue-tracker tools, we notice something: they don’t carry the same amount of risk.

The tools search_issues, get_issue, and find_similar_issues are read-only. This means that calling them a hundred times with garbage arguments, then the worst outcome is a hundred empty result lists. On the other hand apply_labels and post_comment change something, and post_comment in particular is irreversible the moment it runs.

This matters because Shape B hands the decision of when to call something to the model and we have to remember that the model can be wrong about that decision in ways that are easy to miss in testing. A good mental model here is someone new at a company working with a locked-down dashboard: they can run a report, filter a table, or draft a message, but they can’t SSH into production and start improvising. The dashboard’s shape is the safety mechanism, not the analyst’s judgment.

Applied to tool design, this suggests sorting your candidate tools into rough tiers before you build anything:

Safe-by-default tools: These are read-only, side-effect-free, cheap to call repeatedly tools. This includes searching, fetching, listing, calculating, summarising a known record. These are good first tools precisely because a model getting confused and calling one twice, or calling one when it wasn’t needed, costs you nothing more than a wasted round trip...and perhaps some tokens.

Tools that need a second look: These are the ones that write, send, delete, or change permissions. Tools like apply_labels and post_comment, or from earlier in the post send_user_email and suspend_user all fall into this group. These don’t need to be excluded from an agent’s toolbox, but they’re the tools where the contract’s caveats, stop being nice-to-haves and become the actual safety mechanism.

A rough rule that travels well across domains: if a new hire would need a second pair of eyes, or explicit sign-off, before doing something, the model probably needs the equivalent before calling the tool that does it. That might mean the tool itself returns a draft instead of taking the final action, or it might mean your application layer inserts a confirmation step between “model decided to call this” and “this actually runs.”

The point isn’t to be paranoid about every tool. It’s that tool design and tool risk are the same design problem, just viewed from two angles, and sorting your tools into these tiers early tends to surface, for free, which ones need the extra-careful Version-3-style docstring from earlier, and which ones are genuinely fine with a one-liner.

When NOT to Use This

Not every “uses AI” feature needs an agent loop, and not every function needs a five-paragraph docstring.

If your task is genuinely Shape A, workflows like take this input, run it through one or two well-defined transformations, produce this output, build it as Shape A. A function that calls an LLM once (or twice) and uses normal code for everything else is cheaper, faster, easier to test, and dramatically easier to debug. Enthusiasm is not your friend here: the fact that an agentic loop could handle your ticket-classification pipeline doesn’t mean it should, any more than the fact that a sledgehammer could hang a picture frame.

The detailed-contract advice in this post matters most when a function will be called by a model making its own decisions about whether and when to call it, whether this is Shape B, or any tool exposed via something like MCP (see the next post). If a function is called from exactly one place, in exactly one fixed spot in your code, by an LLM that has no choice in the matter, a one-line docstring is genuinely fine. The model isn’t choosing whether to call summarise_text; your code is calling it. Save the careful contract-writing for the tools that actually get chosen.

Reach for an agent loop, i.e. Shape B, when the number and order of steps genuinely depends on what’s in the input — when you can’t write the if/else tree because the tree’s shape changes case to case, not just its contents. If you find yourself writing a flowchart with one box per possible LLM decision rather than per processing step — that’s a sign the flowchart should be a system prompt and a toolbox instead.

One More Distinction Worth Having: RAG Isn’t a Synonym

Before we move on, one piece of terminology clarrification that’ll make the next post easier to follow. And also because it caused myself enough trouble when first looking terminology up in the first place

It’s common to hear “RAG,” “tool-calling,” and “MCP” used as if they’re roughly interchangeable, essentially they are used as three names for “the AI looks stuff up.” They’re not, and conflating them makes architecture conversations needlessly foggy.

RAG (retrieval-augmented generation) is a pattern: search a knowledge base, retrieve relevant chunks, stuff them into the prompt, ask the model to answer using that context. It’s fundamentally about improving what the model knows before it answers. Depending on your style of RAG, this might involve a vector storage, it might include a knowledge graph, or it might even use an LLM on its own, but in the end it generated knowledge to include into the prompt.

Tool-calling is broader than RAG. A retrieval function can absolutely be a tool, for example search_issues from our worked example is basically a RAG-flavoured lookup, since we need to look up similar issues by similarity in context. But tools can also do things that have nothing to do with retrieval: apply_labels, post_comment, send_user_email are actions, not lookups. Tool-calling is the mechanism that lets a model request any structured operation, retrieval being just one flavour.

MCP, which we’re looking at in the next post, is not exactly broader than tool-calling, rather it's a protocol for exposing tools (of any flavour, retrieval or action), resources, and prompts to AI applications in a standardised, reusable way.

So: you can build RAG without tool-calling, for example a fixed Shape A pipeline that always retrieves the same way. You can build tool-calling without MCP, nothing in this post used MCP at all. And you can build an MCP server whose tools happen to be RAG-style retrieval but it also includes resources and prompts.

Why a Protocol, At All?

One last bit of motivation before we move to MCP itself, because it’s worth seeing the problem MCP exists to solve in its plainest form.

Every tool we’ve designed in this post, whether it is check_username_available, the five issue-tracker tools, or search_orders, currently lives inside one application’s codebase, called by one agent loop, using one model provider’s tool-calling format.

Now imagine a second application needs search_issues and get_issue too, but it's a separate internal dashboard, built by a different team, using a different model provider, with its own slightly different tool-schema conventions. Without a shared standard, that team re-implements the same two functions, with their own interpretation of what “the body” means, their own decision about whether state="all" includes draft issues, and their own docstrings that will inevitably drift from yours over time.

Multiply this by three teams and five tools, and you get what’s sometimes called the N×M integration problem: N applications, each needing custom integration code for M tools or data sources, producing N×M bespoke connections, and each one is a slightly different reimplementation of the same underlying contract.

Now that we know what a tool is to an LLM, essentially a contract made of names, types, descriptions, and the occasional crucial caveat, the next post looks at the protocol that’s emerged to collapse that N×M problem into something closer to N+M problem: MCP.

Spoiler: it’s a quite simpler idea than the hype suggests. And it's way easier to over-build than you’d expect. The basic idea is to take the tools we just discussed and collect them in a single system that can be reachable from more places and the majority of code that is needed are a few decorators.