An inference endpoint, or serving API, is a network service you send inputs to in order to get model predictions back in real time.
It solves the problem of turning a trained model into something applications can actually use.
You reach for an inference endpoint when you want:
In practice, most teams use a serving API when they need low-latency, always-on access to a model and do not want every client to manage model files, hardware, or inference code.
You deploy a model behind an API.
The model weights, runtime, and preprocessing/postprocessing logic live on a server or managed platform.
A client sends an input request.
This is usually HTTP or gRPC, with JSON or another structured payload. The request may include text, images, features, or other model inputs.
The server runs inference.
The service loads the input, executes the model forward pass, and often applies extra logic such as tokenization, batching, safety checks, or formatting the output.
The endpoint returns a prediction.
The response might be a class label, a score, embeddings, generated text, or tool-call metadata.
A serving API is about online inference: getting predictions from a model through a request/response interface. That is different from training, and different from batch jobs that score large datasets offline.
Request:
POST /v1/infer
Content-Type: application/json
{
"text": "I love this product"
}
Response:
{
"label": "positive",
"confidence": 0.98
}
A product team might call this endpoint from an app after a user submits feedback, then show the result immediately.