Contextual chunking, also called late chunking, is a way to split a document into retrieval chunks after a model has already encoded the whole document, so each chunk keeps some awareness of the surrounding context.
Normal chunking cuts text first and then embeds each piece alone. That is simple, but it can make chunks semantically thin: a paragraph may be ambiguous without the surrounding sections, headings, or earlier definitions.
Late chunking helps when you want better retrieval quality for long documents such as policies, manuals, research papers, or knowledge bases. In practice, it is useful when:
Encode the whole document first.
A transformer model processes the full text, so each token’s representation can reflect the broader document context.
Split into chunks afterward.
Instead of embedding each chunk in isolation, you derive chunk representations from the contextualized token states of the already-encoded document.
Pool token states into chunk vectors.
A chunk embedding is typically built from the token embeddings belonging to that chunk, but those token embeddings already “know” about the rest of the document.
Use those chunk vectors for retrieval or indexing.
The result is often better semantic search over long documents, because chunks are less context-starved.
The name “late chunking” is a practical description, not a universally standardized formal term. “Contextual chunking” is used more broadly and can sometimes mean slightly different implementation choices, but the core idea is the same: chunk after contextual encoding, not before.
Suppose a document says:
“The policy applies only to external contractors. They must rotate keys every 90 days.”
If you chunked before encoding, the sentence “They must rotate keys every 90 days” might be unclear by itself.
With late chunking, the chunk embedding for that sentence can still reflect the earlier line that “they” refers to external contractors, making it more likely to retrieve the right policy section when someone searches for contractor key rotation.
A good rule of thumb: start with conventional chunking and strong metadata; reach for contextual/late chunking when retrieval quality is clearly limited by chunks that lose meaning when isolated.