Directory

KV Cache Reuse (a.k.a. prefix caching) - NVIDIA Docs
Large Language Models (Latest)

KV Cache Reuse (a.k.a. prefix caching)

Enabled by setting the environment variable NIM_ENABLE_KV_CACHE_REUSE to 1. See configuration documentation for more information.

In scenarios where more than 90% of the initial prompt is identical across multiple requests—differing only in the final tokens—implementing a key-value cache could substantially improve inference speed. This approach leverages a high degree of similarity in the prompts, allowing for efficient reuse of computational resources and minimizing processing time for the variations at the end.

For example, when a user asks questions about a large document, the large document repeats among requests but the question at the end of the prompt is different. When this feature is enabled, there is typically about a 2x speedup in time-to-first-token (TTFT).

Example:

  • Large table input followed by a question about the table

  • Same large table input followed by a different question about the table

  • Same large table input followed by a different question about the table

  • and so forth…

KV Cache reuse will speed up TTFT starting on the second request and following.

May require using just-in-time engine to enable this feature depending on model/hardware configuration.

Previous Parameter-Efficient Fine-Tuning
Next Acknowledgements
© Copyright © 2024, NVIDIA Corporation. Last updated on Nov 6, 2024.