KV Cache Reuse (a.k.a. prefix caching)
Enabled by setting the environment variable NIM_ENABLE_KV_CACHE_REUSE
to 1
.
See configuration documentation for more information.
In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of similarity in the prompts, allowing for efficient reuse of computational resources and minimizing processing time for the variations at the end.
For example, when a user asks questions about a large document, the large document repeats among requests but the question at the end of the prompt is different. When this feature is enabled, there is typically about a 2x speedup in time-to-first-token (TTFT).
Example:
Large table input followed by a question about the table
Same large table input followed by a different question about the table
Same large table input followed by a different question about the table
and so forth…
KV Cache reuse will speed up TTFT starting on the second request and following.
May require using just-in-time engine to enable this feature depending on model/hardware configuration.