Made in Chicago Company Directory

为特定企业应用程序采用的大型语言模型（LLM）通常受益于模型自定义。企业需要根据其特定需求定制 LLM，并快速部署这些模型以实现低延迟和高吞吐量推理。本文将帮助您开始此过程。

具体来说，我们将展示如何使用 PubMedQA 数据集定制 Llama 3 8B NIM，以回答生物医学领域的问题。问题回答对于组织来说至关重要，因为它们需要从大量内容中快速提取关键信息，并为客户提供相关信息。

本教程中使用的 NVIDIA 软件

NVIDIA NIM 是NVIDIA AI Enterprise的一部分，是一套易于使用的推理微服务，旨在加速企业中性能优化的生成式 AI 模型的部署。NIM 推理微服务可以部署在任何地方，从工作站和本地到云，提供企业控制自己的部署选择并确保数据安全。它还提供行业领先的延迟和吞吐量，实现经济高效的扩展，并为最终用户提供无缝体验。

现在，用户可以访问适用于 Llama 3 8B Instruct 和 Llama 3 70B Instruct 模型的 NIM 推理微服务，以便在任何 NVIDIA 加速的基础设施上进行自托管部署。如果您刚刚开始进行原型设计，请查看 NVIDIA API 目录中的 Llama 3 API。

NVIDIA NeMo 是一个用于开发自定义生成式 AI 的端到端平台。NeMo 包含用于训练、自定义、检索增强生成（RAG）、guardrails、toolkits、数据 curation 和模型预训练的工具。NeMo 提供了一种简单、经济高效且快速的方式来采用生成式 AI。

使用 NeMo 框架，企业可以构建与品牌声音保持一致的模型，并理解特定领域的知识。无论是创建客户服务聊天机器人还是 IT 帮助机器人，NeMo 都可以帮助开发者构建自定义生成式 AI，该 AI 擅长处理其任务，同时融合行业术语、领域知识和技能以及独特的组织要求。

图 1 显示了使用 NeMo 和 LoRA 自定义 LLM NIM 以及使用 NIM 部署它所涉及的一般步骤。首先，将模型转换为 .nemo 格式。然后，为 NeMo 模型创建 LoRA 适配器，并将这些适配器与 NIM 一起用于自定义模型的推理。NIM 支持动态加载 LoRA 适配器，从而支持针对不同用例训练多个 LoRA 模型。

Diagram showing the steps for customizing an LLM NIM with LoRA using NeMo framework and deploying it with NIM. The steps include converting models to .nemo format, creating LoRA adapters with NeMo framework, and then using the LoRA adapter with NIM for inference on the customized model. — *图 1. 使用 NeMo 框架和 LoRA 自定义 LLM NIM 以及使用 NIM 部署它所涉及的各个步骤*

预备知识

开始之前，请确保您已完成以下内容：

访问 NVIDIA A100、NVIDIA H100 或 NVIDIA L40S GPU。建议至少有一个或多个 GPU，累积显存达到 80 GB 或更多。
支持 Docker 的环境，并已安装 NVIDIA Container Runtime，这将使容器 GPU 感知。
NGC CLI API 密钥是在您使用 NVIDIA NGC 进行身份验证并下载 NGC CLI 工具时提供的。
NVIDIA AI Enterprise 许可证。要申请 90 天免费试用许可证，请访问 API 目录中的 Llama 3 8B Instruct，然后单击 Run Anywhere with NIM 按钮。

第 1 步：下载 Llama 3 8B Instruct 模型

您可以使用 CLI 从 NVIDIA NGC 目录下载 Llama 3 8B Instruct 模型，该模型已转换为 .nemo 格式，与 NeMo 框架兼容。

ngc registry model download-version "nvidia/nemo/llama-3-8b-instruct-nemo:1.0"

这将创建一个名为 llama-3-8b-instruct-nemo_v1.0 的文件夹，其中包括.nemo 文件。

第 2 步：获取 NeMo 框架容器

NeMo 框架可作为NGC 目录中的 Docker 容器使用，该容器中包含用于 LoRA 微调的环境和所有脚本。

以下代码假设 Llama 3 8B Instruct 模型文件夹是当前工作目录的一部分，因此它已被挂载 (位于/workspace)，并且微调脚本可以访问它。

# Run the docker container in interactive mode
docker run \ 
     --gpus all \
     --shm-size=2g \
     --net=host \
     --ulimit memlock=-1 \
     --rm -it \
     -v ${PWD}:/workspace \
     -w /workspace \
     -v ${PWD}/results:/results \ 
     nvcr.io/nvidia/nemo:24.05 bash

进入容器后，您可以在 Jupyter Notebook 环境中执行其他步骤。

第三步：下载并预处理自定义数据集

PubMedQA 是一个用于医疗领域问答的数据集要下载数据集，请克隆 pubmedqa GitHub 存储库，其中包含将数据集拆分为 train/val/test 集的步骤。

下面提供了一个原始示例：

"18251357": { 
"QUESTION": "Does histologic chorioamnionitis correspond to clinical chorioamnionitis?", 
"CONTEXTS": [ "To evaluate the degree to which histologic chorioamnionitis, a frequent finding in placentas submitted for histopathologic evaluation, correlates with clinical indicators of infection in the mother.", "A retrospective review was performed on 52 cases with a histologic diagnosis of acute chorioamnionitis from 2,051 deliveries at University Hospital, Newark, from January 2003 to July 2003. Third-trimester placentas without histologic chorioamnionitis (n = 52) served as controls. Cases and controls were selected sequentially. Maternal medical records were reviewed for indicators of maternal infection.", "Histologic chorioamnionitis was significantly associated with the usage of antibiotics (p = 0.0095) and a higher mean white blood cell count (p = 0.018). The presence of 1 or more clinical indicators was significantly associated with the presence of histologic chorioamnionitis (p = 0.019)." ], 
"reasoning_required_pred": "yes", 
"reasoning_free_pred": "yes", 
"final_decision": "yes", 
"LONG_ANSWER": "Histologic chorioamnionitis is a reliable indicator of infection whether or not it is clinically apparent." },

鉴于问题和上下文，本教程的目标是对 Llama 3 8B 进行微调，以给出“是”或“否”的回答。

如需微调，请将数据转换为.jsonl 格式，其中每行都是以 JSON dict 形式呈现的单独示例，其中包含用于监督式学习的input:和output:键。预处理后，示例如下所示：

{
"input": "OBJECTIVE: To evaluate the degree to which histologic chorioamnionitis, a frequent finding in placentas submitted for histopathologic evaluation, correlates with clinical indicators of infection in the mother ... \nQUESTION: Does histologic chorioamnionitis correspond to clinical chorioamnionitis?\n ### ANSWER (yes|no|maybe): ", 
"output": "<<< yes >>>"}

请注意，输入内容包括上下文跟问题。

在输出中，添加“<<<”和“>>>”标记可以验证 LoRA-tuned 模型，因为基础模型也可以基于零射模板生成“Yes”/“No”响应。

有关预处理的端到端说明，请参阅Jupyter Notebook 教程。

第 4 步：使用 NeMo 框架微调模型

NeMo 框架包含一个高级 Python 脚本 megatron_gpt_finetuning.py，用于微调，该脚本可以抽象化一些低级 API 调用。

MODEL="/workspace/llama-3-8b-instruct-nemo_v1.0/8b_instruct_nemo_bf16.nemoo"
TRAIN_DS="[./pubmedqa/data/pubmedqa_train.jsonl]"
VALID_DS="[./pubmedqa/data/pubmedqa_val.jsonl]"
TEST_DS="[./pubmedqa/data/pubmedqa_test.jsonl]"
TEST_NAMES="[pubmedqa]"

# Tensor and Pipeline model parallelism
TP_SIZE=1
PP_SIZE=1

# Save results and checkpoints in this directory
OUTPUT_DIR="./results/Meta-Llama-3-8B-Instruct"

torchrun --nproc_per_node=1 \ /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
exp_manager.exp_dir=${OUTPUT_DIR} \
exp_manager.explicit_log_dir=${OUTPUT_DIR} \ 
trainer.devices=1 \
trainer.num_nodes=1 \ 
trainer.precision=bf16-mixed \ 
trainer.val_check_interval=20 \ 
trainer.max_epochs=10 \ 
model.megatron_amp_O2=False \ 
++model.mcore_gpt=True \ 
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=1 \
model.micro_batch_size=1 \ 
model.global_batch_size=8 \ 
model.restore_from_path=${MODEL} \ 
model.data.train_ds.num_workers=0 \ 
model.data.validation_ds.num_workers=0 \ 
model.data.train_ds.file_names=${TRAIN_DS} \ 
model.data.train_ds.concat_sampling_probabilities=[1.0] \ 
model.data.validation_ds.file_names=${VALID_DS} \ 
model.peft.peft_scheme="lora"

这将在 $OUTPUT_DIR/checkpoints 中创建 LoRA 适配器，以 .nemo 格式。

The model.peft.peft_scheme 参数决定了所使用的技术。本教程使用 LoRA，但 NeMo 框架也支持其他技术，例如 p-tuning、adapters 和 IA3。

训练 Llama 3 70B 模型涉及相同的过程，唯一的区别是更多的内存和计算需求，以及要在多个 GPU 上进行分片的模型。推荐配置为八个 NVIDIA A100 或 NVIDIA H100 80 GB GPUs，以及八路 Tensor Parallellism (TP=8，PP=1)。

您可以在运行脚本时覆盖许多此类配置。有关全套可能的配置，请参阅config yaml。

第 5 步：准备 LoRA 模型库

现在您已拥有 .nemo LoRA 模型，是时候部署它了。NIM 可以在同一基础模型上部署多个 LoRA 适配器，它需要一个特定的目录结构以便理解。

以下示例展示了如何准备此“模型存储”。每个 LoRA 适配器应被放在一个文件夹中，该文件夹的名称将用作在推理时向其发送请求的参考。

</path/to/LoRA-model-store>
├── llama3-8b-pubmed-qa
│   └── megatron_gpt_peft_lora_tuning.nemo
├── llama3-8b-lora_model_2_nemo
│   └── llama3-8b-instruct-lora_model_2.nemo
└── llama3-8b-lora_model_3_hf
    ├── adapter_config.json
    └── adapter_model.safetensors

在本教程中，一个 LoRA 适配器在 PubMedQA 上进行了训练，因此请继续将其放在模型存储文件夹中的自己的目录中。如果您有其他适配器，您可以将此过程与那些用于多 LoRA 部署的适配器一起复制。请注意，NVIDIA NIM 支持使用 NeMo 框架训练的适配器以及 Hugging Face PEFT。

第 6 步：使用 NIM 部署

模型存储整理好后，部署只需一个 Docker 命令。

export NGC_API_KEY=<YOUR_NGC_API_KEY>
export LOCAL_PEFT_DIRECTORY=</path/to/LoRA-model-store>
chmod -R 777 $LOCAL_PEFT_DIRECTORY

export NIM_PEFT_SOURCE=/home/nvs/loras # Path to LoRA models internal to the container
export NIM_PEFT_REFRESH_INTERVAL=3600  # (in seconds) will check NIM_PEFT_SOURCE for newly added models every hour in this interval
export CONTAINER_NAME=meta-llama3-8b-instruct

export NIM_CACHE_PATH=</path/to/NIM-model-store-cache>  # Model artifacts (in container) are cached in this directory

mkdir -p $NIM_CACHE_PATH
chmod -R 777 $NIM_CACHE_PATH


docker run -it --rm --name=$CONTAINER_NAME \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -e NIM_PEFT_SOURCE \
    -e NIM_PEFT_REFRESH_INTERVAL \
    -v $NIM_CACHE_PATH:/opt/nim/.cache \
    -v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama3-8b-instruct:1.0.0

首次运行该命令时，系统会下载 NVIDIA TensorRT-LLM 优化的 Llama 3 引擎，并将其缓存在 $NIM_CACHE_PATH 中。这将加快后续部署的速度。还有其他几个选项可用于进一步配置 NIM，您可以在 NIM 配置文档中找到完整列表。

运行此命令应在端口 8000 上启动服务器，现在您已准备好开始发送推理请求。

第 7 步：发送推理请求

要创建完成，您可以向/completions 端点发送 POST 请求。要继续操作，请在单独的终端中创建 Python 脚本或启动 Jupyter Notebook。以下命令使用 Python requests 库。

import requests
import json

url = 'http://0.0.0.0:8000/v1/completions'
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}

# Example from the PubMedQA test set
prompt="BACKGROUND: Sublingual varices have earlier been related to ageing, smoking and cardiovascular disease. The aim of this study was to investigate whether sublingual varices are related to presence of hypertension.\nMETHODS: In an observational clinical study among 431 dental patients tongue status and blood pressure were documented. Digital photographs of the lateral borders of the tongue for grading of sublingual varices were taken, and blood pressure was measured. Those patients without previous diagnosis of hypertension and with a noted blood pressure \u2265 140 mmHg and/or \u2265 90 mmHg at the dental clinic performed complementary home blood pressure during one week. Those with an average home blood pressure \u2265 135 mmHg and/or \u2265 85 mmHg were referred to the primary health care centre, where three office blood pressure measurements were taken with one week intervals. Two independent blinded observers studied the photographs of the tongues. Each photograph was graded as none/few (grade 0) or medium/severe (grade 1) presence of sublingual varices. Pearson's Chi-square test, Student's t-test, and multiple regression analysis were applied. Power calculation stipulated a study population of 323 patients.\nRESULTS: An association between sublingual varices and hypertension was found (OR = 2.25, p<0.002). Mean systolic blood pressure was 123 and 132 mmHg in patients with grade 0 and grade 1 sublingual varices, respectively (p<0.0001, CI 95 %). Mean diastolic blood pressure was 80 and 83 mmHg in patients with grade 0 and grade 1 sublingual varices, respectively (p<0.005, CI 95 %). Sublingual varices indicate hypertension with a positive predictive value of 0.5 and a negative predictive value of 0.80.\nQUESTION: Is there a connection between sublingual varices and hypertension?\n ### ANSWER (yes|no|maybe): "

data = {
    "model": "llama3-8b-pubmed-qa",
    "prompt": prompt,
    "max_tokens": 128
}

response = requests.post(url, headers=headers, json=data)
response_data = response.json()

print(json.dumps(response_data, indent=4))

输出如下所示：

{
    "id": "cmpl-403d22baa7c3470eb468ee8a38033e1f",
    "object": "text_completion",
    "created": 1717493046,
    "model": "llama3-8b-pubmed-qa",
    "choices": [
        {
            "index": 0,
            "text": " <<< yes >>>",
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 412,
        "total_tokens": 415,
        "completion_tokens": 3
    }
}

此示例返回文本输出“<<< yes >>>”以及其他元数据。如果您回忆之前的几个步骤，这就是它所训练的格式。对整个 PubMedQA 测试集运行推理并计算准确度可提供以下指标：

Accuracy 0.786000
Macro-F1 0.584112

总结

很赞！您已成功自定义 Llama 3 8B Instruct 模型，并使用 NVIDIA NIM 进行部署。与 PubMedQA 排行榜相比，只需几个训练步骤和较短的训练时间，您就可以获得相当准确的模型。完整教程还包括有关计算这些指标的说明。可以进一步调整超参数以获得更高的准确性，并且使用 NeMo 框架进行的先进训练优化可加快迭代速度。

为了进一步简化生成式 AI 定制，NeMo 团队宣布了 NVIDIA NeMo Customizer 微服务的早期访问计划。这项高性能、可扩展的服务简化了针对特定领域用例的 LLM 微调和对齐。利用知名的微服务和 API 架构，它帮助企业将解决方案更快地推向市场。申请早期访问。

Directory

使用 NVIDIA NeMo 定制化 NVIDIA NIM 满足特定领域需求

本教程中使用的 NVIDIA 软件

预备知识

第 1 步：下载 Llama 3 8B Instruct 模型

第 2 步：获取 NeMo 框架容器

第三步：下载并预处理自定义数据集

第 4 步：使用 NeMo 框架微调模型

第 5 步：准备 LoRA 模型库

第 6 步：使用 NIM 部署

第 7 步：发送推理请求

总结

Tags

关于作者

Directory

使用 NVIDIA NeMo 定制化 NVIDIA NIM 满足特定领域需求

本教程中使用的 NVIDIA 软件

预备知识

第 1 步：下载 Llama 3 8B Instruct 模型

第 2 步：获取 NeMo 框架容器

第三步：下载并预处理自定义数据集

第 4 步：使用 NeMo 框架微调模型

第 5 步：准备 LoRA 模型库

第 6 步：使用 NIM 部署

第 7 步：发送推理请求

总结

Tags

关于作者

Related posts

利用 NVIDIA CUDA-Q，AI 编码助手助力大规模量子应用开发

利用 NVIDIA AI 蓝图搭建视频搜索和摘要智能代理

借助 NVIDIA AI Workbench 实现混合环境下的无缝协作和快速原型设计

利用 NVIDIA Parabricks 加速 Pangenome 比对挖掘新的生物学发现

NVSwitch 和 TensorRT-LLM MultiShot 共同加速 AllReduce 速度达 3 倍