How To

Eager Tool Calling: How We Cut Agent Latency by 50% on Long Tool Chains

An 8-tool agent task took 24 seconds. The model was fast. The tools were fast. The wall clock was slow. We rewrote the stream handler to fire each tool the moment its block finishes streaming — not at message_stop — and cut median end-to-end agent latency by 50% across production traffic, with longer tool chains pulling further ahead.

HBHenry Bui
·
agenticaiperformancestreamingtoolcallingproductionlessons
Cover Image for Eager Tool Calling: How We Cut Agent Latency by 50% on Long Tool Chains

Eager Tool Calling: How We Cut Agent Latency by 50% on Long Tool Chains

1. Introduction — The Latency Nobody Was Measuring

An eight-tool agent task in our staging environment clocked in at twenty-four seconds. The model was fast. The individual tools were fast. The wall clock was slow. Where did the time go?

We dropped the trace into a flame graph and the answer was embarrassing. For the first four seconds the model streamed tokens while every tool sat idle. Then the model finished, fired all eight tools in parallel, and spent another twenty seconds waiting for them. Two phases, stacked end to end, with no overlap between them.

We rewrote that codepath. The same request now finishes in roughly five seconds — the four-second stream and the tool execution happen in the same four seconds instead of consecutively. Across a representative sample of production traces, median end-to-end task completion is 50% faster, and longer tool chains see proportionally larger wins.

The technique is called eager tool calling (internally we call it tool-call pipelining). The idea is small. The engineering was not. Here is what it is, how it works in our stream handler, the production bugs that taught us what not to do, and the numbers we measured.


2. The Problem: Where Agent Latency Actually Hides

The common intuition is that parallel tool calling already solved this. It didn't. Parallel tool calling is necessary but not sufficient.

2.1 The classic agent loop

A single agent turn looks like this:

  1. Model reasons over the context and begins streaming a response.
  2. The response contains one or more tool_use blocks.
  3. Runtime waits for message_stop.
  4. Runtime executes the tools.
  5. Runtime collects the tool_result blocks and starts the next turn.

Total wall clock for one turn = stream time + tool time, added serially. The tool layer is a passive consumer that refuses to start work until the producer has completely finished.

2.2 The parallel tool calling half-fix

Modern APIs — Anthropic, OpenAI, Bedrock — let the model emit multiple tool_use blocks in a single assistant message, and mature runtimes run those blocks concurrently.

That change moves the tool phase from sum of tool durations to max of tool durations. Genuinely valuable. On an eight-tool turn with tools averaging 2.5 seconds, parallel tool calling takes ~2.5 seconds for the tool phase instead of 20. Big win.

But the stream phase still happens first. The tools still wait for message_stop. If the model takes four seconds to emit its response, those four seconds are pure dead time on the tool side.

2.3 What we wanted

Tools that start running the instant the model finishes emitting them, while the model continues streaming the rest of the message. Not tools-parallel-with-tools. Tools-parallel-with-generation.

Parallel tool calling overlaps tools with each other. Eager tool calling overlaps tools with the model itself.


3. Three Eras of Tool Calling

Era Concurrency When tools start Wall clock
Sequential None After each prior tool finishes Σ(stream + all tools)
Parallel Tools with tools After message_stop stream + max(tool)
Eager (pipelined) Tools with tools and tools with generation The moment each tool block finishes streaming max(stream, max(tool))

The move from sequential to parallel collapses one dimension of latency. The move from parallel to eager collapses the other. Time goes from sum to max in both directions.

Wall Clock Comparison

Parallel runs tools with each other. Eager runs tools with generation.

0s1s2s3s4s5s6sPARALLELtools aftermessage_stopstream (4.0s)tool A (2.0s)tool B (2.0s)tool C (2.0s)6.0sEAGERtools mid-streamstream (4.0s)tool A (2.0s)tool B (2.0s)tool C (2.0s)4.5smessage_stop1.5s saved · 1.33× fasterSame stream, same tools — only the start times differ.

4. How Eager Tool Calling Works

4.1 The key insight — the seal

Each tool_use block the model emits has a stable tool_call_id. During streaming, the runtime sees chunks arrive one at a time. Most chunks carry a tool_call_id; partial argument chunks carry only an index.

Here is the observation that makes eager tool calling possible: the moment a chunk arrives with a tool_call_id different from the previous one, the previous tool call is definitionally complete. The model has moved on. Its arguments are fully accumulated. It is safe to execute right now, even though the assistant message is still streaming.

We call that transition the seal event. A tool is sealed the instant the model starts streaming a different tool — or when the message itself ends.

4.2 Chunk-by-chunk trace

chunk(id = A, args = "{path:")    → accumulate into tool_stream[A]
chunk(id = A, args = "/etc}")     → accumulate
chunk(id = B, args = "{url:")     → NEW id → SEAL A → dispatch tool A
chunk(id = B, args = "...")       → accumulate B     (A executing in background)
chunk(id = C, args = "{query:")   → SEAL B → dispatch tool B
                                → (A, B executing while C still streaming)
message_stop                      → SEAL C → dispatch tool C
                                → (A, B, C all overlap on separate threads)

4.3 Visualizing the overlap

In a correct eager execution, the tool lanes are not end-to-end. They overlap:

stream : [==================================]
tool A :   [=========]            ← fires when B's id arrives
tool B :       [=========]        ← fires when C's id arrives; runs concurrent with A
tool C :           [=========]    ← fires at message_stop; runs concurrent with A + B

Drop a vertical line at the 50% mark and it cuts through all three bars. That is the signature of eager tool calling. If your "eager" diagram shows tools end-to-end, you have implemented sequential tool calling under a different name.

Seal Mechanism

A new tool_call_id is the seal trigger

STREAM CHUNKS (TIME →)chunk(id=A)chunk(id=A)chunk(id=B)← NEW IDSEAL DETECTORprev_id != new_id ?_last_id: A → Btool_streams[A]: completeSEAL EVENTon_tool_call_sealed(tool_streams[A])chunk B keeps buffering →dispatchEXECUTOR POOLtool Arunning(concurrentwith stream)tool B: queuedtool C: queuedTool A starts the moment chunk(id=B) arrives — not at message_stop.Stream + tool execution overlap on the wall clock.

5. Inside CloudThinker's Stream Handler

The full implementation lives in backend/app/services/agent/stream_handler.py. Three small data structures power it:

class StreamHandler:
  def __init__(
      self,
      on_tool_call_sealed: Callable[[dict[str, Any]], None] | None = None,
  ):
      # Callback fired once per completed tool call during LLM streaming.
      # Receives the sealed tool_stream dict; MUST NOT raise.
      self._on_tool_call_sealed = on_tool_call_sealed

      # Most-recent tool_call_id seen for each conversation.
      # A new id on the next chunk detects a seal event.
      self._last_tool_call_id_per_conv: dict[UUID, str] = {}

      # Maps (conversation_id, chunk.index) -> tool_call_id so
      # partial-args chunks (which carry no id) can be routed back
      # to the correct tool_stream for accumulation.
      self._tool_call_id_by_index: dict[tuple[UUID, int], str] = {}

5.1 The seal trigger

The whole pipeline hinges on this one method:

def _seal_previous_tool_call_if_needed(
  self, conversation_id: UUID, new_tool_call_id: str
) -> None:
  """Fire the seal callback on the prior in-flight tool call.

  Called when the stream handler sees a chunk with a new
  tool_call_id that differs from the previous one — proof that
  the LLM has finished streaming the prior tool call and moved
  on. Enables tool-call pipelining.
  """
  if self._on_tool_call_sealed is None:
      return
  prev_id = self._last_tool_call_id_per_conv.get(conversation_id)
  if not prev_id or prev_id == new_tool_call_id:
      return
  prior_stream = self.tool_streams.get(conversation_id, {}).get(prev_id)
  if prior_stream is None:
      return
  try:
      self._on_tool_call_sealed(prior_stream)
  except Exception:
      logger.exception(
          f"on_tool_call_sealed callback raised for "
          f"tool_call_id={prev_id} conversation={conversation_id}"
      )

Twenty lines. That is the entire dispatch trigger. The callback — wired up by the agent executor — hands the sealed tool stream to an async worker pool which runs it immediately.

5.2 Routing partial argument chunks

There is one subtlety that bit us on day one. Streaming tool arguments arrive as a series of chunks. The first chunk carries the full tool_call_id. Subsequent chunks carry only the index. Without a mapping from index back to tool_call_id, partial-argument updates can't be routed to the right buffer, and parallel streaming turns the arguments into garbage.

The _tool_call_id_by_index map fixes this. Every first-chunk registers (conversation_id, index) → tool_call_id, and every subsequent chunk looks up its buffer through that map.

5.3 Fire-and-forget callback, isolated executor

The seal callback is fire-and-forget on purpose. If it raises, the exception is caught and logged. The streaming path never breaks because an executor had a bad day. Each sealed tool runs in its own task with its own exception boundary; one tool failing surfaces as a tool_result with is_error: true on the next turn, and the other tools keep running.

Architecture

Five stages, two boundaries

ProviderstreamSSE chunksAnthropic / OpenAIchunksStreamHandler+ SealDetector• route by index → id• detect id transition• emit seal eventstream_handler.pyseal event↑ stream → callback(fire-and-forget)on_tool_call_sealedisolated callbackagent executor wires thistool_call↑ callback → executor(cancellation scope)ExecutorPooltool Atool Btool Casyncio tasksisolated · cancellableToolMsgbufnext turnOne stream reader, N tool tasks, one result buffer.Boundaries are exception-isolated; a failing tool never kills the stream.

6. Production Lessons — The Hard Parts

Five things that bit us. If you implement this pattern yourself, expect these.

6.1 Idempotent tools or bust

The model occasionally retracts mid-stream — emits a tool call, thinks better of it, and replaces it. Rare, but real. A tool that has already been fired eagerly cannot be un-fired.

We added a per-tool idempotent: bool flag. Non-idempotent tools (payments, destructive CLI commands, emails) fall back to the classic non-eager path and only fire after message_stop. The eager fast path is reserved for reads, analytics queries, and other safe operations.

6.2 Cancellation propagation

When the user interrupts or the model emits a stop sequence mid-stream, in-flight eager tools must be cancelled cleanly. Leaking a tool execution into a dead conversation is both a correctness bug (results never consumed) and a resource leak.

Each eager dispatch is wrapped in a cancellation scope tied to the stream reader's lifetime. Stream dies → scope cancels → tools abort and release their resources.

6.3 Error isolation per tool

One tool failing must not kill the stream reader or the other tools. We discovered this the hard way when a flaky S3 list operation crashed the entire turn for a minute before we added exception boundaries around each sealed dispatch. Now a failed tool produces an error tool_result and nothing else.

6.4 Observability — don't lose the seal

Eager execution hides behavior inside the stream. "Why is tool X slow?" becomes impossible to answer if you can't see when it was sealed versus when it finished. Every seal event now emits a trace span with seal_latency_ms, tool_call_id, and conversation_id. When something misbehaves, the timeline tells you exactly which stage was slow.


7. The Numbers

We benchmarked three workloads representative of real agent traffic, comparing sequential baseline, parallel tool calling, and eager tool calling. P50 wall clock from request-received to final assistant message. The numbers below come from the open-source benchmark in eager-tools — synthetic, deterministic, reproducible by anyone via make bench.

Workload Sequential p50 Parallel p50 Eager p50 Speedup vs parallel Speedup vs sequential
3-tool analytics query 4.9s 3.5s 2.9s 1.21× 1.69×
9-tool incident triage 17.6s 9.5s 6.5s 1.46× 2.71×
15-tool ad-campaign sweep 30.4s 11.5s 8.8s 1.31× 3.46×

The synthetic harness removes network jitter so the comparison isolates the dispatch strategy. The full bench file ships 16 workloads in this range — speedup vs parallel from 1.20× to 1.50×, median ~1.28×. In production, where tail latency and provider variance compound, end-to-end task completion is ~50% faster on the median trace, and longer tool chains pull further ahead. Treat the OSS bench as the lower-bound version anyone can reproduce on a laptop; production wins are larger.

7.1 Cost impact

  • Output tokens per task: effectively unchanged (same reasoning, same answer)
  • Net cost reduction: ~35% on tool-heavy workloads — fewer retries from timeouts, more conversations completing inside the cache TTL window, and faster end-to-end task completion freeing up agent slots.

7.2 Where the speedup comes from

Two stacked effects:

  1. Generation / execution overlap. The biggest chunk. Every millisecond of stream time that used to sit idle on the tool side now has a tool running in parallel.
  2. Fewer turns. A task that used to take three model turns often completes in one. Each saved turn saves prefill, network round-trip, and reasoning overhead.
Bench Results

Wall clock — sequential vs parallel vs eager

sequentialparalleleager0s8s16s24s32swall clock (seconds)4.9s3.5s1.21×2.9s3-tool analytics17.6s9.5s1.46×6.5s9-tool incident30.4s11.5s1.31×8.8s15-tool ad-campaignSpeedup labels are vs parallel — the realistic baseline today.

8. When NOT to Use It

Being honest about the limits.

  • Fast tools (sub-50ms). If your tools finish in milliseconds, there is nothing to hide behind the stream. Seal/dispatch overhead exceeds the latency saved. Don't bother.
  • Sequentially dependent tools. If tool B needs tool A's result to even be formulated, the model won't emit B until A returns. No pipeline opportunity; eager and classic are identical.
  • Non-streaming backends. You can't seal per block without incremental parsing. If your provider or gateway buffers the full response before forwarding, eager tool calling is impossible until that changes.
  • Non-idempotent tools. Already covered in Section 6. Destructive operations, payments, outbound messages — these stay on the classic path.

9. A Mental Model That Helps

Eager tool calling is CPU instruction pipelining for agents. A modern CPU doesn't wait for one instruction to retire before decoding the next — it fills the pipeline, hides latency, and lets multiple stages run concurrently. The agent runtime is the CPU, the model stream is the instruction fetch, the tools are the execution units, and the seal event is the register-ready signal that lets execution begin before the rest of the batch is decoded.

Once that analogy clicks, the implementation falls out of it. You need a streaming parser, a seal detector, an async executor pool, per-block cancellation, and idempotent operations. Everything else is bookkeeping.


10. Try It Yourself

10.1 Implementer's checklist

  1. Streaming SSE parser with per-block completion events.
  2. Per-conversation last_tool_call_id state.
  3. (conversation_id, index) → tool_call_id routing map for partial chunks.
  4. Async executor pool whose lifetime is tied to the stream reader, for clean cancellation.
  5. Per-tool idempotency flag; non-idempotent tools fall back to the classic path.
  6. Observability spans on every seal event, with seal_latency_ms and correlation IDs.

10.2 Closing

Eager tool calling is not a novel idea in the abstract. CPUs have been pipelining instructions since the 1980s. What's novel is that the agent ecosystem spent two years treating streaming as a UX-only feature — something you do to make tokens appear live in a chat window — rather than as an opportunity to parallelize execution against generation.

For any production agent running multiple tools per turn, this pattern is not optional. It is the difference between an impressive demo and a system fast enough to replace a human operator.

Try CloudThinker free and cut your agents' latency in half on long tool chains.