NVIDIA introduces compact Mistral-NeMo-Minitron-8B-Base

Thu, 22nd Aug 2024

NVIDIA has introduced a new small language model named Mistral-NeMo-Minitron-8B-Base. Designed to offer state-of-the-art accuracy in a compact form factor, this model is a smaller version of the previously released openMistral NeMo 12B model. It is compact enough to function on NVIDIA RTX-powered workstations, making it suitable for applications such as AI-powered chatbots, virtual assistants, content generators, and educational tools.

The newly released model is made available as an NVIDIA NIM microservice and can be downloaded from the Hugging Face platform. It was developed through pruning and distillation techniques, enabling it to maintain high accuracy with reduced computational demands. NVIDIA also provides opportunities for developers to further customise the model using their AI Foundry and NeMo software.

“We combined two different AI optimisation methods—pruning to shrink Mistral NeMo's 12 billion parameters into 8 billion, and distillation to improve accuracy,” commented Bryan Catanzaro, vice president of applied deep learning research at NVIDIA. “By doing so, Mistral-NeMo-Minitron 8B delivers comparable accuracy to the original model at lower computational cost.”

Small language models like the Mistral-NeMo-Minitron-8B-Base can operate in real time on workstations and laptops. This characteristic is particularly beneficial for organisations with limited resources, as it allows them to deploy generative AI functionalities within their infrastructure while keeping costs, operational efficiency, and energy use in check. Additionally, running these language models locally on edge devices enhances security because the data does not need to be transmitted to a server.

The Mistral-NeMo-Minitron-8B-Base model is optimised for low latency, ensuring faster responses for users and higher computational efficiency in production. For developers seeking an even smaller model, the 8-billion-parameter version can be further pruned and distilled using NVIDIA AI Foundry to create a neural network tailored for specific enterprise applications. NVIDIA's AI Foundry platform and services offer developers a full-stack solution for creating customised foundation models, packaged as NIM microservices. It includes access to foundation models, the NVIDIA NeMo platform, and dedicated capacity on NVIDIA DGX Cloud.

Starting with a baseline of state-of-the-art accuracy, the Mistral-NeMo-Minitron-8B enables versions downsized using AI Foundry to maintain a high level of accuracy. This method requires only a fraction of the training data and computational infrastructure, thus providing significant cost savings. Pruning reduces the neural network by removing model weights that contribute the least to accuracy, while distillation retrains the pruned model on a smaller dataset to recover accuracy.

The new model joins NVIDIA’s portfolio of small language models. This includes Nemotron-Mini-4B-Instruct, another model optimised for low memory usage and faster response times on NVIDIA GeForce RTX AI PCs and laptops. Nemotron-Mini-4B-Instruct is available as an NVIDIA NIM microservice for cloud and on-device deployment and is part of NVIDIA ACE, a suite of digital human technologies delivering speech, intelligence, and animation via generative AI.

Mistral-NeMo-Minitron-8B-Base, packaged as an NVIDIA NIM microservice, promises to lead in various benchmarks for language models, spanning tasks such as language understanding, common sense reasoning, mathematical reasoning, summarisation, coding, and generating truthful answers. The pruning and distillation techniques used by NVIDIA make this model a cost-effective and efficient solution for deploying high-accuracy AI on more constrained computational resources.

Share on: