Made in Chicago Company Directory

Prerequisites

Check the support matrix to make sure that you have the supported hardware and software stack.

NGC Authentication

Generate an API key

An NGC API key is required to access NGC resources and a key can be generated here: https://org.ngc.nvidia.com/setup/personal-keys.

When creating an NGC API Personal key, ensure that at least “NGC Catalog” is selected from the “Services Included” dropdown. More Services can be included if this key is to be reused for other purposes.

Note

Personal keys allow you to configure an expiration date, revoke or delete the key using an action button, and rotate the key as needed. For more information about key types, please refer the NGC User Guide.

Export the API key

Pass the value of the API key to the docker run command in the next section as the NGC_API_KEY environment variable to download the appropriate models and resources when starting the NIM.

If you’re not familiar with how to create the NGC_API_KEY environment variable, the simplest way is to export it in your terminal:

Copy
Copied!

            
            export NGC_API_KEY=<value>

Run one of the following commands to make the key available at startup:

Copy
Copied!

            
            # If using bash
echo "export NGC_API_KEY=<value>" >> ~/.bashrc

# If using zsh
echo "export NGC_API_KEY=<value>" >> ~/.zshrc

Note

Other, more secure options include saving the value in a file, so that you can retrieve with cat $NGC_API_KEY_FILE, or using a password manager.

To pull the NIM container image from NGC, first authenticate with the NVIDIA Container Registry with the following command:

Copy
Copied!

            
            echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin

Use $oauthtoken as the username and NGC_API_KEY as the password. The $oauthtoken username is a special name that indicates that you will authenticate with an API key and not a user name and password.

Launching the NIM

The following command launches a Docker container for the nv-embedqa-e5 model.

Copy
Copied!

            
            # Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/nv-embedqa-e5-v5
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.1.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the NIM
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

Flags	Description
`-it`	`--interactive` + `--tty` (see Docker docs)
`--rm`	Delete the container after it stops (see Docker docs)
`--name=nv-embedqa-e5-v5`	Give a name to the NIM container for bookkeeping (here `nv-embedqa-e5-v5`). Use any preferred value.
`--runtime=nvidia`	Ensure NVIDIA drivers are accessible in the container.
`--gpus all`	Expose all NVIDIA GPUs inside the container. See the configuration page for mounting specific GPUs.
`--shm-size=16GB`	Allocate host memory for multi-GPU communication. Not required for single GPU models or GPUs with NVLink enabled.
`-e NGC_API_KEY`	Provide the container with the token necessary to download adequate models and resources from NGC. See above.
`-v "$LOCAL_NIM_CACHE:/opt/nim/.cache"`	Mount a cache directory from your system (`~/.cache/nim` here) inside the NIM (defaults to `/opt/nim/.cache`), allowing downloaded models and artifacts to be reused by follow-up runs.
`-u $(id -u)`	Use the same user as your system user inside the NIM container to avoid permission mismatches when downloading models in your local cache directory.
`-p 8000:8000`	Forward the port where the NIM server is published inside the container to access from the host system. The left-hand side of `:` is the host system ip:port (`8000` here), while the right-hand side is the container port where the NIM server is published (defaults to `8000`).
`$IMG_NAME`	Name and version of the NIM container from NGC. The NIM server automatically starts if no argument is provided after this.

If you have an issue with permission mismatches when downloading models in your local cache directory, add the -u $(id -u) option to the docker run call to run under your current identity.

If you are running on a host with different types of GPUs, you should specify GPUs of the same type using the --gpus argument to docker run. For example, --gpus '"device=0,2"'. The device IDs of 0 and 2 are examples only; replace them with the appropriate values for your system. Device IDs can be found by running nvidia-smi. More information can be found GPU Enumeration.

GPU clusters with GPUs in Multi-instance GPU mode (MIG), are currently not supported

Running Inference

Note

It may take a few seconds for the container to be ready and start accepting requests from the time the docker container is started.

Confirm the service is ready to handle inference requests:

Copy
Copied!

            
            curl -X 'GET' 'http://localhost:8000/v1/health/ready'

If the service is ready, you will get a response like this:

Copy
Copied!

            
            {"object":"health-response","message":"Service is ready."}

Copy
Copied!

            
            curl -X "POST" \
  "http://localhost:8000/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
"input": ["Hello world"],
"model": "nvidia/nv-embedqa-e5-v5",
"input_type": "query"
}'

For further information, see the Reference.

Deploying on Multiple GPUs

Text Embedding NIM deploys a single TensorRT model across however many GPUs that you specify and are visible inside the docker container. If you do not specify the number of GPUs, Text Embedding NIM defaults to one GPU. When using multiple GPUs, Triton distributes inference requests across the GPUs to keep them equally utilized.

Use the docker run --gpus command-line argument to specify the number of GPUs that are available for deployment.

Example using all GPUs:

Copy
Copied!

            
              docker run --gpus all ...

Example using two GPUs:

Copy
Copied!

            
              docker run --gpus 2 ...

Example using specific GPUs:

Copy
Copied!

            
              docker run --gpus '"device=1,2"' ...

Downloading NIM Models to Cache

In the event that model assets must be pre-fetched (e.g. in an air-gapped system), the NIM container supports downloading these assets to the NIM cache without starting the server.

Copy
Copied!

            
            # Choose a container name for bookkeeping
export NIM_MODEL_NAME=nvidia/nv-embedqa-e5-v5
export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)

# Choose a NIM Image from NGC
export IMG_NAME="nvcr.io/nim/$NIM_MODEL_NAME:1.1.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"

# Start the NIM container with a command to download the model to the cache
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME download-to-cache

# Start the NIM container in an airgapped environment and serve the model
docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus=all \
  --shm-size=16GB \
  --network=none \
  -v $LOCAL_NIM_CACHE:/mnt/nim-cache:ro \
  -u $(id -u) \
  -e NIM_CACHE_PATH=/mnt/nim-cache \
  -e NGC_API_KEY \
  -p 8000:8000 \
  $IMG_NAME

By default, the download-to-cache command downloads the most appropriate model assets for the detected GPU. To override this behavior and download a specific model, set the NIM_MODEL_PROFILE environment variable when launching the container. Use the list-model-profiles command available within the NIM container to list all profiles. See Optimization for more details.

Stopping the Container

The following commands stop the container by stopping and removing the running docker container.

Copy
Copied!

            
            docker stop $CONTAINER_NAME
docker rm $CONTAINER_NAME

Directory

Getting Started

Prerequisites

NGC Authentication

Generate an API key

Export the API key

Launching the NIM

Running Inference

Deploying on Multiple GPUs

Downloading NIM Models to Cache

Stopping the Container

See Also