Introduction to DBRX

Databricks, a leading data and AI company, has just unveiled DBRX - an open, general-purpose large language model (LLM) that sets a new state-of-the-art among open LLMs. With 132B total parameters (36B active), DBRX was pre-trained on an enormous 12T token dataset spanning both text and code.

Introducing DBRX: A New State-of-the-Art Open LLM | Databricks

What sets DBRX apart is its fine-grained mixture-of-experts (MoE) architecture. By employing 16 experts and selecting 4 per input, it achieves 65x more expert combinations compared to peers like Mixtral and Grok-1. This granular approach, combined with techniques like rotary position encodings, gated linear units, and grouped query attention, enables DBRX to reach unprecedented heights.

DBRX's State-of-the-Art Performance

Across a wide range of benchmarks, DBRX Instruct (the instruction-tuned variant) outperforms leading open models like Mixtral, LLaMA2-70B, and Grok-1. On composite benchmarks like the HuggingFace Open LLM Leaderboard and Databricks Model Gauntlet, it achieves top scores of 74.5% and 66.8% respectively.

DBRX particularly shines in programming and math, surpassing even specialized models like CodeLLaMA-70B on HumanEval (70.1%) and besting peers substantially on GSM8K (66.9%). It also leads in language understanding as measured by MMLU (73.7%).

Impressively, DBRX Instruct matches or exceeds GPT-3.5 on nearly all benchmarks and proves competitive with powerful closed models like Gemini 1.0 Pro and Mistral Medium. It even handles long-context and retrieval-augmented generation tasks with aplomb.

Model Architecture and Training Data

  • 132B total parameters, 36B active parameters
  • Fine-grained mixture-of-experts (MoE) architecture with 16 experts, 4 selected per input
  • 65x more expert combinations than peer MoE models like Mixtral and Grok-1
  • Pre-trained on 12T tokens of text and code data, estimated to be 2x more effective than data used for previous Databricks models

The scale of DBRX is staggering, with over 132 billion parameters in total. However, the MoE approach allows for efficient computation by activating only a subset of experts (36B parameters) for each input. The granular 16-expert setup provides a massive 65x increase in combinatorial power compared to other MoE models, likely contributing to DBRX's strong performance.

Just as important as the model architecture is the quality of training data. Databricks curated a vast 12T token dataset for DBRX that they believe is twice as effective as the data used for their previous MPT models. This underscores the critical role of data in building high-performing LLMs.

Benchmark Performance

  • 74.5% on HuggingFace Open LLM Leaderboard, 66.8% on Databricks Model Gauntlet (leading open models)
  • 70.1% on HumanEval, 66.9% on GSM8K (programming and math)
  • 73.7% on MMLU (language understanding)
  • Matches or exceeds GPT-3.5 on most benchmarks, competitive with Gemini 1.0 Pro and Mistral Medium

DBRX Instruct consistently outperforms top open models across a diverse set of benchmarks, showcasing its versatility. The model's particular strength in programming, math, and language understanding suggests potential for high-impact applications in fields like software development, research, and knowledge work.

Perhaps most exciting is DBRX's ability to go toe-to-toe with leading closed models. Matching GPT-3.5 and nearing the performance of Gemini 1.0 Pro and Mistral Medium, DBRX narrows the gap between open and closed LLMs and could accelerate open innovation.

Training and Inference Efficiency

  • 2x FLOP-efficiency in training vs. dense models
  • 4x reduction in pretraining compute requirements due to pipeline advancements
  • 2x faster inference than LLaMA2-70B, up to 150 tokens/sec/user on Databricks' serving platform

DBRX's MoE architecture and Databricks' optimized pipeline deliver substantial efficiency gains. Halving the pretraining FLOP requirements relative to dense models is a significant achievement that could make large-scale LLM development more accessible.

Similarly, the 2x inference speedup over LLaMA2-70B demonstrates how MoE can provide a better trade-off between performance and efficiency. The ability to serve DBRX at up to 150 tokens/sec/user with quantization showcases its readiness for real-world, interactive applications.

DBRX outperforms established open source models on language understanding (MMLU), Programming (HumanEval), and Math (GSM8K). (Courtesy Databricks)

Efficiency Gains in Training and Inference

DBRX achieves its remarkable performance while delivering major efficiency gains. The MoE architecture provides a 2x boost in FLOP-efficiency during training compared to dense models. Databricks' overall pipeline advancements, including improved data and tokenization, have yielded a nearly 4x reduction in pretraining compute.

For inference, DBRX leverages NVIDIA TensorRT-LLM to optimize serving. Despite its high parameter count, the MoE approach allows DBRX to use half the active parameters of LLaMA2-70B, resulting in up to 2x faster throughput. On Databricks' model serving platform, DBRX can generate up to 150 tokens/sec/user with 8-bit quantization.

How Databricks Built DBRX

Databricks harnessed its suite of powerful tools to develop DBRX over a focused 3-month period, building upon years of prior LLM research. Apache Spark and Databricks notebooks enabled data processing, while Unity Catalog provided governance. Mosaic AI Training orchestrated distributed training across 3072 NVIDIA H100 GPUs.

Key to DBRX's success was a carefully curated 12T token dataset, estimated to be 2x more effective than the data used for Databricks' previous MPT models. Lilac AI assisted in data exploration, MLflow tracked experiments, and Inference Tables collected human feedback for refinement.

Getting Started with DBRX on Databricks

Databricks has made it easy for users to start working with DBRX right away through their Mosaic AI Foundation Model APIs. The AI Playground offers a pay-as-you-go chat interface for quick experimentation.

For production use cases, Databricks provides performance guarantees, finetuning support, and enhanced security through provisioned throughput. DBRX is also available for private hosting by downloading from the Databricks Marketplace and deploying via Model Serving.

Future Outlook

DBRX represents a major milestone in open-source LLMs, combining state-of-the-art performance with remarkable efficiency. However, Databricks sees this as just the beginning. They aim to empower enterprises to control their own data and destiny in the emerging GenAI landscape.

Databricks' success in using their own tools to develop DBRX in just three months highlights the maturing LLM development ecosystem. As these tools continue to advance, we may see more organizations creating custom LLMs tailored to their specific needs and domains.

By open-sourcing DBRX and the tools used to build it, Databricks hopes to drive further innovation in the community. They have already trained thousands of custom LLMs with customers and anticipate an exciting journey ahead as DBRX is leveraged for a wide range of ambitious applications.

With DBRX, Databricks stands on the shoulders of giants in the open and academic world. As they continue to refine and expand DBRX's capabilities, they invite enterprises and researchers to build upon this groundbreaking foundation. The future of large language models looks brighter than ever.

Share this post