NVIDIA Docs Hub NVIDIA NIM Large Language Models (Latest) KV Cache Reuse (a.k.a. prefix caching)

KV Cache Reuse (a.k.a. prefix caching)

How to use

Enabled by setting the environment variable NIM_ENABLE_KV_CACHE_REUSE to 1. See configuration documentation for more information.

In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of similarity in the prompts, allowing for efficient reuse of computational resources and minimizing processing time for the variations at the end.

For example, when a user asks questions about a large document, the large document repeats among requests but the question at the end of the prompt is different. When this feature is enabled, there is typically about a 2x speedup in time-to-first-token (TTFT).

Example:

Large table input followed by a question about the table
Same large table input followed by a different question about the table
Same large table input followed by a different question about the table
and so forth…

KV Cache reuse will speed up TTFT starting on the second request and following.

Debugging

May require using just-in-time engine to enable this feature depending on model/hardware configuration.

Previous Parameter-Efficient Fine-Tuning

Next Acknowledgements

Directory

KV Cache Reuse (a.k.a. prefix caching)

How to use

When to use

Debugging