A Simple Guide to Deploying Generative AI with NVIDIA NIM

Whether you’re working on-premises or in the cloud, NVIDIA NIM inference microservices provide enterprise developers with easy-to-deploy optimized AI models from the community, partners, and NVIDIA. Part of NVIDIA AI Enterprise, NIM offers a secure, streamlined path forward to iterate quickly and build innovations for world-class generative AI solutions.

Using a single optimized container, you can easily deploy a NIM in under 5 minutes on accelerated NVIDIA GPU systems in the cloud or data center, or on workstations and PCs. Alternatively, if you want to avoid deploying a container, you can begin prototyping your applications with NIM APIs from the NVIDIA API catalog. 

Use prebuilt containers that deploy with a single command on NVIDIA accelerated infrastructure anywhere.

Maintain security and control of your data, your most valuable enterprise resource.

Achieve best accuracy with support for models that have been fine-tuned using techniques like LoRA.

Integrate accelerated AI inference endpoints leveraging consistent, industry-standard APIs.

Work with the most popular generative AI application frameworks like LangChain, LlamaIndex, and Haystack. 

This post walks through a few deployment options of NVIDIA NIM. You’ll be able to use NIM microservices APIs across the most popular generative AI application frameworks like Hugging Face, Haystack, LangChain, and LlamaIndex. For a full guide to deploying NIM, see the NIM documentation.

How to deploy NIM in 5 minutes 

Before you get started, make sure you have all the prerequisites. Follow the requirements in the NIM documentation. Note that an NVIDIA AI Enterprise License is required to download and use NIM.

When you have everything set up, run the following script:

# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct
export IMG_NAME=”nvcr.io/nim/meta/${CONTAINER_NAME}:1.0.0″

# Choose a LLM NIM Image from NGC
export IMG_NAME=”nvcr.io/nim/${CONTAINER_NAME}:1.0.0″

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=”~/.cache/nim”
mkdir -p “$LOCAL_NIM_CACHE”

# Start the LLM NIM
docker run -it –rm –name=$CONTAINER_NAME \
–runtime=nvidia \
–gpus all \
-e NGC_API_KEY \
-v “$LOCAL_NIM_CACHE:/opt/nim/.cache” \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME

Next, test an inference request:

curl -X ‘POST’ \
‘http://0.0.0.0:8000/v1/completions’ \
-H ‘accept: application/json’ \
-H ‘Content-Type: application/json’ \
-d ‘{
“model”: “meta/llama3-8b-instruct”,
“prompt”: “Once upon a time”,
“max_tokens”: 64
}’

Now you have a controlled, optimized production deployment to securely build generative AI applications. 

Sample NVIDIA-hosted deployments of NIM are also available on the NVIDIA API catalog. 

Note that as a new version of NIM is released, the most up-to-date documentation will always be at https://docs.nvidia.com/nim.

How to integrate NIM with your applications 

While the previous setup should be completed first, if you’re eager to test NIM without deploying on your own, you can do so using NVIDIA-hosted API endpoints in the NVIDIA API catalog. Follow the steps below.

Integrate NIM endpoints

You can start with a completions curl request that follows the OpenAI spec. Note that to stream outputs, you should set stream to True. 

To use NIMs in Python code with the OpenAI library: 

You don’t need to provide an API key if you’re using a NIM.

Make sure to update the base_url to wherever your NIM is running.

from openai import OpenAI

client = OpenAI(
base_url = “http://0.0.0.0:8000/v1″,
api_key=”no-key-required”
)

completion = client.chat.completions.create(
model=”meta/llama3-8b-instruct”,
messages=[{“role”:”user”,”content”:”What is a GPU?”}]
temperature=0.5,
top_p=1,
max_tokens=1024,
stream=True
)

for chunk in completion:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end=””)

NIM is also integrated into application frameworks like Haystack, LangChain, and LlamaIndex, bringing secure, reliable, accelerated model inferencing to developers already building amazing generative AI applications with these popular tools. 

To use NIMs in Python code with LangChain:

from langchain_nvidia_ai_endpoints import ChatNVIDIA

llm = ChatNVIDIA(base_url=”http://0.0.0.0:8000/v1″, model=”meta/llama3-8b-instruct”, temperature=0.5, max_tokens=1024, top_p=1)

result = llm.invoke(“What is a GPU?”)
print(result.content)

Check out notebooks from each of these frameworks to learn how to use NIM:

Integrate NIM Hugging Face endpoints

You can also integrate a dedicated NIM endpoint directly on Hugging Face. Hugging Face spins up instances on your preferred cloud, deploys the NVIDIA optimized model, and enables you to start inference with just a few clicks. Simply navigate to the model page on Hugging Face and create a dedicated endpoint directly using your preferred CSP. See this blog post for a step-by-step guide.

Figure 1. The Llama 3 model page on Hugging Face with the NIM endpoints deployment option

Get more from NIM

With fast, reliable, and simple model deployment using NVIDIA NIM, you can focus on building performant and innovative generative AI workflows and applications.

Customize NIM with LoRA

To get even more from NIM, learn how to use the microservices with LLMs customized with LoRA adapters. NIM supports LoRA adapters trained using either  HuggingFace or NVIDIA NeMo. Simply store the LoRA adapters in the LOCAL_PEFT_DIRECTORY and serve using a script similar to the one used for the base container.

# Choose a container name for bookkeeping
export CONTAINER_NAME=llama3-8b-instruct
export IMG_NAME=”nvcr.io/nim/meta/${CONTAINER_NAME}:1.0.0″

# Choose a LLM NIM Image from NGC
export LOCAL_PEFT_DIRECTORY=~/loras
# Download NeMo-format lora. You can also download HuggingFace PEFT loras
ngc registry model download-version “nim/meta/llama3-70b-instruct-lora:nemo-math-v1”

# Start the LLM NIM
docker run -it –rm –name=$CONTAINER_NAME \
–runtime=nvidia \
–gpus all \
-e NGC_API_KEY \
-e NIM_PEFT_SOURCE \

-v “$LOCAL_NIM_CACHE:/opt/nim/.cache” \
-v $LOCAL_PEFT_DIRECTORY:$NIM_PEFT_SOURCE \
-u $(id -u) \
-p 8000:8000 \
$IMG_NAME

You can then deploy using the name of one of the LoRA adapters in LOCAL_PEFT_DIRECTORY.

curl -X ‘POST’ \
‘http://0.0.0.0:8000/v1/completions’ \
-H ‘accept: application/json’ \
-H ‘Content-Type: application/json’ \
-d ‘{
“model”: “llama3-8b-instruct-lora_vhf-math-v1”,
“prompt”: “John buys 10 packs of magic cards. Each pack has 20 cards and 1/4 of those cards are uncommon. How many uncommon cards did he get?”,
“max_tokens”: 128
}’

For additional details on LoRA, see this technical blog post.

NIMs are regularly released and improved. Visit the API catalog often to see the latest NVIDIA NIM microservices for vision, retrieval, 3D, digital biology, and more. 

Stay in the Loop

Get the daily email from CryptoNews that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

- Advertisement - spot_img

You might also like...