NVIDIA Docs Hub NVIDIA NIM Large Language Models (Latest) Configuring a NIM

Configuring a NIM

NVIDIA NIM for LLMs (NIM for LLMs) uses Docker containers under the hood. Each NIM is its own Docker container and there are several ways to configure it. Below is a full reference of all the ways to configure a NIM container.

GPU Selection

Passing --gpus all to docker run is acceptable in homogeneous environments with one or more of the same GPU.

Note

--gpus all only works if your configuration has the same number of GPUs as specified for the model in the Support Matrix. Running an inference on a configuration with fewer or more GPUs can result in a runtime error.

In heterogeneous environments with a combination of GPUs (for example: A6000 + a GeForce display GPU), workloads should only run on compute-capable GPUs. Expose specific GPUs inside the container using either:

the --gpus flag (ex: --gpus='"device=1"')
the environment variable NVIDIA_VISIBLE_DEVICES (ex: -e NVIDIA_VISIBLE_DEVICES=1)

The device ID(s) to use as input(s) are listed in the output of nvidia-smi -L:

Copy
Copied!

            
            GPU 0: Tesla H100 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)
GPU 1: NVIDIA GeForce RTX 3080 (UUID: GPU-b404a1a1-d532-5b5c-20bc-b34e37f3ac46)

Refer to the NVIDIA Container Toolkit documentation for more instructions.

Shared memory flag

Passing --shm-size=16GB to docker run is required when not using NVLink for multi-GPU setups. It is not required on SXM systems or when using profiles using only 1 GPU (e.g NIM_TENSOR_PARALLEL_SIZE=1).

Environment Variables

Below is a reference for REQUIRED and No environment variables that can be passed into a NIM (-e added to docker run):

ENV	Required?	Default	Notes
`NGC_API_KEY`	Yes	None	You must set this variable to the value of your personal NGC API key.
`NIM_CACHE_PATH`	No	`/opt/nim/.cache`	Location (in container) where the container caches model artifacts.
`NIM_DISABLE_LOG_REQUESTS`	No	`1`	Set to `0` to view request logs. By default, logs of request details to `v1/completions` and `v1/chat/completions` are disabled. These logs contain sensitive attributes of the request including `prompt`, `sampling_params`, and `prompt_token_ids`. Users should be aware that these attributes will be exposed to container logs when enabling this parameter.
`NIM_JSONL_LOGGING`	No	`0`	Set to `1` to enable JSON-formatted logs. Readable text logs are enabled by default.
`NIM_LOG_LEVEL`	No	`DEFAULT`	Log level of NIM for LLMs service. Possible values of the variable are DEFAULT, TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. Mostly, the effect of DEBUG, INFO, WARNING, ERROR, CRITICAL is described in Python 3 logging docs. `TRACE` log level enables printing of diagnostic information for debugging purposes in TRT-LLM and in `uvicorn`. When `NIM_LOG_LEVEL` is `DEFAULT` sets all log levels to `INFO` except for TRT-LLM log level which equals `ERROR`. When `NIM_LOG_LEVEL` is `CRITICAL` TRT-LLM log level is `ERROR`.
`NIM_SERVER_PORT`	No	`8000`	Publish the NIM service to the prescribed port inside the container. Make sure to adjust the port passed to the `-p/--publish` flag of `docker run` to reflect that (ex: `-p $NIM_SERVER_PORT:$NIM_SERVER_PORT`). The left-hand side of this `:` is your host address:port, and does NOT have to match with `$NIM_SERVER_PORT`. The right-hand side of the `:` is the port inside the container which MUST match `NIM_SERVER_PORT` (or `8000` if not set).
`NIM_MODEL_PROFILE`	No	None	Override the NIM optimization profile that is automatically selected by specifying a profile ID from the manifest located at `/etc/nim/config/model_manifest.yaml`. If not specified, NIM will attempt to select an optimal profile compatible with available GPUs. A list of the compatible profiles can be obtained by appending `list-model-profiles` at the end of the `docker run` command. Using the profile name `default` will select a profile that is maximally compatible and may not be optimal for your hardware.
`NIM_MANIFEST_ALLOW_UNSAFE`	No	`0`	If set to `1`, enable selection of a model profile not included in the original `model_manifest.yaml` or a profile that is not detected to be compatible with the deployed hardware.
`NIM_PEFT_SOURCE`	No		If you want to enable PEFT inference with local PEFT modules, then set a `NIM_PEFT_SOURCE` environment variable and pass that into the run container command. If your PEFT source is a local directory at `LOCAL_PEFT_DIRECTORY`, mount your local PEFT directory to the container’s PEFT source set by `NIM_PEFT_SOURCE`. Make sure that your directory only contains PEFT modules for the base NIM. Also make sure that the PEFT directory and all the contents inside it are readable by NIM.
`NIM_MAX_LORA_RANK`	No	`32`	Set the maximum LoRA rank.
`NIM_MAX_GPU_LORAS`	No	`8`	Set the number of LoRAs that can fit in GPU PEFT cache. This is the maximum number of LoRAs that can be used in a single batch.
`NIM_MAX_CPU_LORAS`	No	`16`	Set the number of LoRAs that can fit in CPU PEFT cache. This should be set >= max concurrency or the number of LoRAs you are serving, whichever is less. If you have more concurrent LoRA requests than `NIM_MAX_CPU_LORAS` you may see “cache is full” errors. This value must be >= NIM_MAX_GPU_LORAS.
`NIM_PEFT_REFRESH_INTERVAL`	No	`None`	How often to check `NIM_PEFT_SOURCE` for new models in seconds. If not set, PEFT cache will not refresh. If you choose to enable PEFT refreshing by setting this ENV var, we recommend setting the number greater than 30.
`NIM_SERVED_MODEL_NAME`	No	`None`	The model name(s) used in the API. If multiple names are provided (comma-separated), the server will respond to any of the provided names. The model name in the model field of a response will be the first name in this list. If not specified, the model name will be inferred from the manifest located at `/etc/nim/config/model_manifest.yaml`. Note that this name(s) will also be used in `model_name` tag content of Prometheus metrics, if multiple names provided, metrics tag will take the first one.
`NIM_ENABLE_OTEL`	No	`0`	Set this flag to `1` to enable OpenTelemetry instrumentation in NIMs.
`OTEL_TRACES_EXPORTER`	No	`console`	Specifies the OpenTelemetry exporter to use for tracing. Set this flag to `otlp` to export the traces using the OpenTelemetry Protocol. Set it to `console` to print the traces to standard output.
`OTEL_METRICS_EXPORTER`	No	`console`	Similar to `OTEL_TRACES_EXPORTER`, but for metrics.
`OTEL_EXPORTER_OTLP_ENDPOINT`	No	`None`	The endpoint where the OpenTelemetry Collector is listening for OTLP data. Adjust the URL to match your OpenTelemetry Collector’s configuration.
`OTEL_SERVICE_NAME`	No	`None`	Sets the name of your service to help with identifying and categorizing data.
`NIM_TOKENIZER_MODE`	No	`auto`	The tokenizer mode. `auto` will use the fast tokenizer if available. `slow` will always use the slow tokenizer.
`NIM_ENABLE_KV_CACHE_REUSE`	No	`0`	Set to `1` to enable automatic prefix caching / KV cache reuse. For use cases where large prompts frequently appear and a cache for KV caches across requests would speed up inference.
`NIM_MAX_MODEL_LEN`	No	`None`	Model context length. If unspecified, will be automatically derived from the model configuration. Note that this setting has an effect on only models running on the vLLM backend.

Volumes

Local paths can be mounted to the following container paths.

Container path	Required?	Notes	Docker argument example
`/opt/nim/.cache` (or `NIM_CACHE_PATH` if present)	No; however, if this volume is not mounted, the container does a fresh download of the model every time the container starts.	This directory is to where models are downloaded inside the container. You can access this directory from within the container by adding the `-u $(id -u)` option to the `docker run` command. For example, to use `~/.cache/nim` as the host machine directory for caching models, first run `mkdir -p ~/.cache/nim` before running the `docker run ...` command.	`-v ~/.cache/nim:/opt/nim/.cache -u $(id -u)`

Previous Deploying with Helm

Next Model Profiles

Directory

Configuring a NIM

GPU Selection

Shared memory flag

Environment Variables

Volumes