2026-06-19

What is tool error handling and retries?

Tool error handling and retries are the ways an AI agent responds when a tool call fails, by classifying the failure, deciding whether it is safe to try again, and then either retrying with a better request or stopping and surfacing the error.

Why it matters

Tools fail for ordinary reasons: network timeouts, rate limits, malformed inputs, missing permissions, empty results, or temporary backend outages. If an agent treats every failure the same, it either gives up too early or blindly repeats a bad request.

Good error handling helps an agent:

recover from transient failures,
avoid wasting time on unrecoverable ones,
ask for missing information when needed,
and return a clearer answer to the user.

In practice, most teams start with simple retries for flaky infrastructure and add smarter handling only when they see repeated tool failures.

How it works

Detect the failure.
The tool returns an error, timeout, exception, or an invalid response. The agent runtime or orchestration layer captures that outcome.
Classify the error.
A useful distinction is:
- Transient: likely to succeed if tried again, such as a timeout or temporary rate limit.
- Permanent: unlikely to succeed without changing the request, such as a bad parameter, unauthorized access, or a missing record.
Choose a response.
Common responses are:
- retry the same call,
- retry with backoff or jitter,
- repair the input and retry,
- switch to a fallback tool,
- ask the user for clarification,
- or stop and report the failure.
Limit retries.
Retries should be bounded. Without limits, an agent can loop forever, amplify load, or repeat an invalid action. Many systems also avoid retrying non-idempotent actions unless they can guarantee safety.

The key idea is that retries are not just “try again.” They are a controlled decision based on the error type and the risk of repeating the action.

Tiny concrete example

Suppose an agent calls a weather API:

Tool call: get_weather(city="Pariss")
Tool response: error = "invalid city name"

A good agent should not keep retrying the same call. It should infer that the input may be misspelled and either:

retry with corrected input:
```
Tool call: get_weather(city="Paris")
```
or ask the user:

“Did you mean Paris, France?”

If the tool instead returned a timeout, a retry might be appropriate:

Tool call: get_weather(city="Paris")
Tool response: timeout
Retry after short delay...
Tool call: get_weather(city="Paris")

Common pitfalls / when NOT to use it

Retrying permanent errors. A bad argument will usually fail again until the input changes.
No retry limit. Infinite loops are a common agent bug.
Retrying unsafe actions blindly. For actions like charging a card or sending a message, repeated calls can create duplicate side effects unless the tool is designed for idempotency or deduplication.
Hiding useful failures. If the agent always “fixes” errors silently, users may not realize something important went wrong.
Treating all tools the same. A search query, a database write, and an external API call may need different retry rules.

A practical rule: retry transient read operations first; be much more cautious with writes and side effects.