Transforming Telco Network Operations Centers with NVIDIA NeMo Retriever and NVIDIA NIM

Telecom companies are challenged with consistently meeting service level agreements (SLAs) for end customers that ensure network quality of service. This includes quickly troubleshooting network devices with complex issues, identifying root causes, and resolving issues efficiently at their network operations centers (NOCs).

Current network troubleshooting and repair processes are often time-consuming, error-prone, and lead to prolonged network downtime, which negatively impacts operational efficiency and customer experience.

To address these issues, Infosys built a generative AI solution using NVIDIA NIM inference microservices and retrieval augmented generation (RAG) for automated network troubleshooting. The solution streamlines NOC processes, minimizes network downtime, and optimizes network performance.

Building smart network operations centers with generative AI

Infosys is a global leader in next-generation digital services and consulting with over 300K employees around the world. The Infosys team built a smart NOC, a generative AI customer engagement platform designed for NOC operators, chief network officers (CNOs), network administrators, and IT support staff. 

The RAG-based solution uses an intelligent chatbot to support NOC staff with digitized product information for network equipment and assists with troubleshooting network issues by quickly providing essential, vendor-agnostic router commands for diagnostics and monitoring. This reduces mean time to resolution and enhances customer service.

Challenges with vector embeddings and document retrieval

Infosys faced several challenges when building the chatbot for a smart NOC. These included balancing high accuracy and low latency for the underlying generative AI model, as the highest accuracy could result in additional latency for the model to further rerank retrieved vector embeddings during a user query.

In addition, addressing network-specific taxonomy, changing network device types and endpoints, and complex device documentation made it difficult to create a reliable, user-friendly solution.

The time-consuming nature of vector embedding processes on CPUs can significantly affect the user experience, particularly during extended job runs. This can potentially lead to delays and frustrations.

Using LLMs for inference through an API revealed a notable uptick in latency, a factor that inherently amplifies the overall processing time and warrants attention for optimization.

Data collection and preparation

To solve these challenges, Infosys built a vector database of network devices–specific manuals and knowledge artifacts—such as training documents and troubleshooting guides—to build contextual responses to user queries. Their initial focus included Cisco and Juniper Networks devices. Embeddings were created using embedding models, customized chunk sizes, and other fine-tuned parameters to populate the vector database.

Figure 1. Data preprocessing pipeline for a basic retrieval augmented generation workflow

Solution architecture

Infosys balanced the following considerations and goals for their solution architecture:

User interface and chatbot: Develop an intuitive interface using React for creating customized chatbots, tailored to the workflow and advanced query scripting options, and displaying the response from NVIDIA NIM using a Llama 2 70B model.

Data configuration management: Provide flexible settings for chunking and embedding using NVIDIA NeMo Retriever embedding NIM (NV-Embed-QA-Mistral-7B). This enables users to define parameters like chunk size and overlap and select from various embedding models for optimal performance and control data ingestion.

Vector database options: Implement the ability to choose between different vector databases, such as FAISS for high-speed retrieval for efficient data retrieval, ensuring flexibility, efficiency, and consistency responsiveness.

Backend services and integration: Create robust backend services for chatbot management and configuration, including a RESTful API for integration with external systems, and ensure secure authentication and authorization.

 Integration with NIM: Integrate NIM microservices to improve the accuracy, performance, and cost of inference.

Configuration:

10 NVIDIA A100 80-GB GPUs with eight NVIDIA A100 GPUs running NIM

Two A100 GPUs running NeMo Retriever microservices

128 CPU cores

1 TB storage

Guardrails: Use NVIDIA NeMo Guardrails, an open-source toolkit for easily adding programmable guardrails to LLM-based conversational applications and protecting against vulnerabilities.

Figure 2. Workflow for a user prompting a generative AI chatbot and the backend RAG pipeline to provide a fast and accurate response

AI workflow with NVIDIA NIM and NeMo Guardrails

To build the smart NOC, Infosys used a self-hosted instance of NVIDIA NIM and NVIDIA NeMo to fine-tune and deploy foundational LLMs. The team used NIM to expose OpenAI-like API endpoints that enabled a uniform solution for their client application.

Infosys used NeMo Retriever to power their vector database retrieval and reranking workflows. NeMo Retriever is a collection of microservices that present a single API for indexing and querying user data–enabling enterprises to seamlessly connect custom models to diverse business data and deliver highly accurate responses. For more information, see Translate Your Enterprise Data into Actionable Insights with NVIDIA NeMo Retriever.

Using NeMo Retriever, powered by the NV-Embed-QA-Mistral-7B NIM, Infosys achieved over 90% accuracy on their text embedding model.

NV-Embed-QA-Mistral-7B ranks first on the Massive Text Embedding Benchmark (MTEB), excelling across 56 tasks, including retrieval and classification. This model’s innovative design enables NV-Embed to attend to latent vectors for better pooled embedding outputs and employs a two-stage instruction tuning method to enhance accuracy.

Figure 3. NV-Embed-QA-Mistral-7B embedding model performance

Infosys used NeMo Retriever reranking NIM (Rerank-QA-Mistral-4B), which refines the retrieved context from the vector database with respect to the query. This step is crucial when retrieved contexts come from various datastores with differing similarity scores. The reranker is based on a fine-tuned Mistral 7B model, uses 7B parameters, and enhances efficiency without sacrificing performance.

Figure 4. The nv-rerank-qa_v1 reranker model improves accuracy

Using the NV-Embed-QA-Mistral-7B model boosted accuracy by 19% (to 89% from 70%) on the baseline model, leading to an overall improvement in performance during response generation. Using the nv-rerank-qa_v1 reranking model improved accuracy by over 2%. Adding the NeMo Retriever reranking model to the RAG pipeline improved LLM response accuracy and relevance.

Results

Latency and accuracy are two key factors to evaluate the performance of LLMs. Infosys measured both factors with results for baseline models compared to models deployed using NVIDIA NIM.

LLM latency evaluation

Infosys measured LLM latency to compare results with and without using NVIDIA NIM (Table 1).

Without NIM, the LLM latency for Combo 1 was measured at 2.3 seconds. Using NIM to deploy a Llama 3 70B model with NeMo Retriever embedding and reranking microservices, the LLM latency achieved for Combo 5 was 0.9 seconds—an improvement of nearly 61% compared to the baseline model.

 Without NIM With NIM  Combo 1Combo 2 Combo 3Combo 4 Combo 5Latency (sec)2.31.91.11.30.9Table 1. Latency comparison for LLMs

Figure 5. Latency comparison for five different LLMs

LLM accuracy evaluation

Infosys measured LLM latency for a smart NOC to compare results with and without NIM (Table 2).

When comparing the same model, Infosys achieved LLM accuracy of up to 85% without NIM and 92% with NeMo Retriever embedding and reranking NIMs—an absolute improvement of 22% compared to the base model. This demonstrates the effectiveness of NVIDIA NIM in optimizing the accuracy of RAG systems, making it a valuable enhancement for achieving more accurate and reliable model outputs.

 NIM OFFNIM ON   Combo 1 Combo 2 Combo 3 Combo 4 Combo 5Framework LangChain Llama-index LangChain LangChain LangChain Chunk size, chunk overlap 512,100 512,100 512,100 512,100 512,100 Embedding model All-mpnet-base-v All-MiniLM-L6-v2 NV-Embed-QA-Mistral-7B NV-Embed-QA-Mistral-7B NV-Embed-QA-Mistral-7B Rerank model No No No nv-rerank-qa_v1 nv-rerank-qa_v1 TRT-LLM No No Yes Yes Yes Triton No No Yes Yes Yes Vector DB Faiss-CPU Milvus Faiss-GPU Faiss-GPU Faiss-GPU LLM Ollama (Mistral 7B) Vertex AI (Cohere-command)NIM LLM (Mistral-7B) NIM LLM(Mistral-7B) NIM LLM(Llama-3 70B) Accuracy 70% 85% 89% 91% 92%Table 2. Accuracy comparison for generative AI models

Figure 6. Accuracy comparison for five different LLMs

Conclusion

By using NVIDIA NIM and NVIDIA NeMo Retriever microservices to deploy its smart NOC, Infosys lowered LLM latency by 61% with an absolute improvement in accuracy by 22%. NeMo Retriever embedding and reranking microservices, deployed on NIM, enabled these gains through optimized model inference.

The integration of NeMo Retriever microservices for embedding and reranking significantly improved RAG relevance, accuracy, and performance. Reranking enhances contextual understanding, while optimized embeddings ensure accurate responses. This integration enhances user experience and operational efficiency in network operation centers, making it a crucial component for system optimization.

Learn how Infosys eliminates network downtime through automated workflow, powered by NVIDIA.

Get started deploying generative AI applications with NVIDIA NIM and NeMo Retriever NIM microservices. Explore more AI solutions for telecom operations.

Stay in the Loop

Get the daily email from CryptoNews that makes reading the news actually enjoyable. Join our mailing list to stay in the loop to stay informed, for free.

Latest stories

- Advertisement - spot_img

You might also like...