#inference-serving

6 件の記事

What is an inference endpoint / serving API?

What is an inference endpoint / serving API?

An inference endpoint, or serving API, is a network service you send inputs to in order to get model predictions back in real time. It solves the problem of turning a trained model into something applications can actually use. You reach for an inference endpoint when you want: a web or mobile app to call a model on demand a backend to classify, rank, summarize, or generate text without running the model itself a standardized way to deploy, scale, monitor, and secure model requests In practice, m

What is batching / continuous batching?

What is batching / continuous batching?

Batching is the practice of grouping multiple requests together so a model can process them more efficiently, and continuous batching is the version where new requests are added to that group while others are still running. Large models are expensive to run because the same GPU is often underused if you serve one request at a time. Batching improves throughput by sharing compute across many requests. Continuous batching goes a step further: it keeps the GPU busy by letting the server dynamically

What is streaming (token streaming)?

What is streaming (token streaming)?

Streaming, or token streaming, is when an AI model sends you its output incrementally as it is generated instead of waiting until the whole answer is finished. This solves the “dead air” problem: users see text quickly, which makes apps feel faster and more interactive. It is especially useful for chatbots, copilots, and any UI where a short time-to-first-token matters more than waiting for the full response. In practice, teams use streaming when they want: faster perceived latency, a typing-lik

What is time to first token (TTFT)?

What is time to first token (TTFT)?

Time to first token (TTFT) is the time between sending a generative AI request and receiving the model’s first output token. TTFT is the part of latency users feel first. If it is high, the system can seem “stuck” even if the full answer eventually streams quickly. You usually care about TTFT when you want: a responsive chat or copilot experience good perceived performance in streaming UIs to compare serving setups, models, or prompts to separate “slow to start” from “slow to finish” In practice

What is tokens per second (throughput)?

What is tokens per second (throughput)?

Tokens per second, or throughput, is the rate at which a language model generates or processes tokens over time. Throughput is a practical way to describe model speed. If you care about user experience, cost efficiency, or server capacity, tokens per second tells you how much text the system can handle in a given time. You’d use it when comparing models, sizing infrastructure, or setting expectations for streaming responses. In practice, teams often care about both throughput and latency

What is speculative decoding?

What is speculative decoding?

Speculative decoding is a text-generation trick that lets a fast “draft” model propose several next tokens and then uses a larger “target” model to verify them, often reducing end-to-end latency without changing the final output distribution. Autoregressive LLMs normally generate one token at a time, and each token requires a full forward pass through the big model. That is accurate, but it can be slow. Speculative decoding matters when you want lower latency or higher throughput without switchi