Hymba Hybrid-Head Architecture Boosts Small Language Model Performance

Transformers, with their attention-based architecture, have become the dominant choice for language models (LMs) due to their strong performance, parallelization capabilities, and long-term recall through key-value (KV) caches. However, their quadratic computational cost and high memory demands pose efficiency challenges. In contrast, state space models (SSMs) like Mamba and Mamba-2 offer constant complexity and efficient hardware optimization but struggle with memory recall tasks, affecting their performance on general benchmarks.

NVIDIA researchers recently proposed Hymba, a family of small language models (SLMs) featuring a hybrid-head parallel architecture that integrates transformer attention mechanisms with SSMs to achieve both enhanced efficiency and improved performance. In Hymba, attention heads provide high-resolution recall, while SSM heads enable efficient context summarization.

The novel architecture of Hymba reveals several insights:

  1. Overhead in attention: Over 50% of attention computation can be replaced by cheaper SSM computation.
  2. Local attention dominance: Most global attention can be replaced by local attention without sacrificing performance on general and recall-intensive tasks, thanks to the global information summarized by SSM heads.
  3. KV cache redundancy: Key-value cache is highly correlated across heads and layers, so it can be shared across heads (group query attention) and layers (cross-layer KV cache sharing).
  4. Softmax attention limitation: Attention mechanisms are constrained to sum to one, limiting sparsity, and flexibility. We introduce learnable meta-tokens that are prepended to prompts, storing critical information and alleviating the “forced-to-attend” burden associated with attention mechanisms.

This post shows that Hymba 1.5B performs favorably against state-of-the-art open-source models of similar size, including Llama 3.2 1B, OpenELM 1B, Phi 1.5, SmolLM2 1.7B, Danube2 1.8B, and Qwen2.5 1.5B. Compared to Transformer models of similar size, Hymba also achieves higher throughput and requires 10x less memory to store cache.

Hymba 1.5B is released to the Hugging Face collection and GitHub.

Hymba 1.5B performance

Figure 1 compares Hymba 1.5B against sub-2B models (Llama 3.2 1B, OpenELM 1B, Phi 1.5, SmolLM2 1.7B, Danube2 1.8B, Qwen2.5 1.5B) in terms of average task accuracy, cache size (MB) relative to sequence length, and throughput (tok/sec).

A figure showing three performance metrics comparing seven different AI language models in terms of average accuracy, cache size (MB) relative to sequence length, and throughput (tok/sec).
Figure 1. Performance comparison of Hymba 1.5B Base against sub-2B models 

In this set of experiments, the tasks include MMLU, ARC-C, ARC-E, PIQA, Hellaswag, Winogrande, and SQuAD-C. The throughput is measured on an NVIDIA A100 GPU with a sequence length of 8K and a batch size of 128 using PyTorch. For models encountering out of memory (OOM) issues during throughput measurement, the batch size was halved until the OOM is resolved to measure the maximal achievable throughput without OOM.

Hymba model design 

SSMs such as Mamba were introduced to address the quadratic complexity and large inference-time KV cache issues of transformers. However, due to their low-resolution memory, SSMs struggle with memory recall and performance. To overcome these limitations, we propose a road map for developing efficient and high-performing small LMs in Table 1.

Configuration Commonsense reasoning (%) ↑ Recall (%) ↑ Throughput (token/sec) ↑ Cache size (MB) ↓ Design reason
Ablations on 300M model size and 100B training tokens
Transformer (Llama) 44.08 39.98 721.1 414.7 Accurate recall while inefficient 
State-space models (Mamba) 42.98 19.23 4720.8 1.9 Efficient while inaccurate recall
A. + Attention heads (sequential) 44.07 45.16 776.3 156.3 Enhance recall capabilities
B. + Multi-head heads (parallel) 45.19 49.90 876.7 148.2 Better balance of two modules
C. + Local / global attention 44.56 48.79 2399.7 41.2 Boost compute/cache efficiency
D. + KV cache sharing 45.16 48.04 2756.5 39.4 Cache efficiency
E. + Meta-tokens  45.59 51.79 2695.8 40.0 Learned memory initialization
Scaling to 1.5B model size and 1.5T training tokens
F. + Size / data 60.56 64.15 664.1 78.6 Further boost task performance
G. + Extended context length (2K→8K) 60.64 68.79 664.1 78.6 Improve multishot and recall tasks
Table 1. Design road map of the Hymba model

Fused hybrid modules 

Fusing attention and SSM heads in parallel within a hybrid-head module outperforms sequential stacking, according to the ablation study. Hymba fuses attention and SSM heads in parallel within a hybrid head module, enabling both heads to process the same information simultaneously. This architecture improves reasoning and recall accuracy.

A diagram showing the architecture of a dual-path attention mechanism. The flow starts with an Input Projection, leading to Latent Feature extraction which splits into two parallel paths. The upper path (in blue) contains SSM Feature processing through SSM Heads and Gate Normalization. The lower path (in red) processes Attention Features through Attention Heads and Gate Normalization. Both paths converge at a Mean operation before final Output Projection. Arrows indicate the flow of data through the system.
Figure 2. The hybrid-head module in Hymba

Efficiency and KV cache optimization

While attention heads improve task performance, they increase KV cache requirements and reduce throughput. To mitigate this, Hymba optimizes the hybrid-head module by combining local and global attention and employing cross-layer KV cache sharing. This improves throughput by 3x and reduces cache by almost 4x without sacrificing performance. 

A diagram showing the architecture of a neural network model with Hymba Blocks. The model flows from left to right, starting with an Embedding layer, followed by alternating Hymba Blocks with Full Attention (in red) and SWA (in blue). The blocks are connected with KV sharing every 2 layers, shown in dotted green boxes labeled 'Repeat (N-3)/2'. Below the main flow, there's a detailed view of a module containing Layer norm, Hybrid-head module, another Layer norm, and FFN components. The diagram ends with an LM Head layer on the right.
Figure 3. Hymba model architecture

Meta-tokens

A set of 128 pretrained embeddings prepended to inputs, functioning as learned cache initialization to enhance focus on relevant information. These tokens serve a dual purpose: 

  • Mitigating attention drain by acting as backstop tokens, redistributing attention effectively
  • Encapsulating compressed world knowledge
A diagram illustrating the Fading Memory architecture from SSM (State Space Model). The image shows three layers: At the top is a blue rectangular box labeled 'Fading Memory (From SSM)'. Below it are seven gray input tokens arranged horizontally. At the bottom are two sets of memory blocks: on the left are two green blocks labeled 'Meta Memory (Meta Tokens)', and on the right are three red blocks labeled 'Snapshot Memory (From Attn)'. Green arrows connect the Meta Memory to the input tokens, while red arrows connect the Snapshot Memory to the rightmost input tokens. A blue arrow loops back from the Fading Memory box to itself.
Figure 4. Interpretation of Hymba from the memory aspect

Model analysis

This section presents an apples-to-apples comparison across different architectures under the same training settings. We then visualize the attention maps of SSM and Attention in different pretrained models. Finally, we perform head importance analysis for Hymba through pruning. All the analyses in this section help to illustrate how and why the design choices for Hymba are effective. 

Apples-to-apples comparison 

We performed an apples-to-apples comparison of Hymba, pure Mamba2, Mamba2 with FFN, Llama3 style, and Samba style (Mamba-FFN-Attn-FFN) architectures. All models have 1 billion parameters and are trained from scratch for 100 billion tokens from SmolLM-Corpus with exactly the same training recipe. All results are obtained through lm-evaluation-harness using a zero-shot setting on Hugging Face models. Hymba performs the best on commonsense reasoning as well as question answering and recall-intensive tasks. 

Table 2 compares various model architectures on language modeling and recall-intensive and commonsense reasoning tasks, with Hymba achieving strong performance across metrics. Hymba demonstrates the lowest perplexity in language tasks (18.62 for Wiki and 10.38 for LMB) and solid results in recall-intensive tasks, particularly in SWDE (54.29) and SQuAD-C (44.71), leading to the highest average score in this category (49.50). 

Model Language (PPL) ↓ Recall intensive (%) ↑ Commonsense reasoning (%) ↑
Mamba2 15.88 43.34 52.52
Mamba2 w/ FFN 17.43 28.92 51.14
Llama3 16.19 47.33 52.82
Samba 16.28 36.17 52.83
Hymba 14.5 49.5 54.57
Table 2. Comparison of architectures trained on 100 billion tokens under the same settings

In commonsense reasoning and question answering, Hymba outperforms other models in most tasks, such as SIQA (31.76) and TruthfulQA (31.64), with an average score of 54.57, slightly above Llama3 and Mamba2. Overall, Hymba stands out as a balanced model, excelling in both efficiency and task performance across diverse categories.

Attention map visualization

We further categorized elements in the attention map into four types: 

  1. Meta: Attention scores from all real tokens to meta-tokens. This category reflects the model’s preference for attending to meta-tokens. In attention maps, they are usually located in the first few columns (for example, 128 for Hymba) if a model has meta-tokens. 
  2. BOS: Attention scores from all real tokens to the beginning-of-sequence token. In the attention map, they are usually located in the first column right after the meta-tokens. 
  3. Self: Attention scores from all real tokens to themselves. In the attention map, they are usually located in the diagonal line. 
  4. Cross: Attention scores from all real tokens to other real tokens. In the attention map, they are usually located in the off-diagonal area. 

The attention pattern of Hymba is significantly different from that of vanilla Transformers. In vanilla Transformers, attention scores are more concentrated on BOS, which is consistent with the findings in Attention Sink. In addition, vanilla Transformers also have a higher proportion of Self attention scores. In Hymba, meta-tokens, attention heads, and SSM heads work complementary to each other, leading to a more balanced distribution of attention scores across different types of tokens. 

Specifically, meta-tokens offload the attention scores from BOS, enabling the model to focus more on the real tokens. SSM heads summarize the global context, which focuses more on current tokens (Self attention scores). Attention heads, on the other hand, pay less attention to Self and BOS tokens, and more attention to other tokens (that is, Cross attention scores). This suggests that the hybrid-head design of Hymba can effectively balance the attention distribution across different types of tokens, potentially leading to better performance.

A diagram showing the composition of the Hymba attention mechanism. It consists of three components that are added together: Meta Tokens (shown as a vertical green stripe on the left), Sliding Window Attention (displayed as a diagonal green band), and SSM (Mamba) (represented as a triangular green gradient). These three patterns combine to form the final Hymba pattern on the right, which shows a triangular area filled with green squares of varying intensity. Each component is displayed in a square grid format, and the combination is shown using plus signs between the components and an equals sign before the final result.
Figure 5. Schematics of the attention map of Hymba as a combination of meta-tokens, sliding window attention, and Mamba contributions
A comparative visualization showing attention patterns across different language models. The image consists of three main parts: 1) Three attention heatmaps for Llama 3.2 3B and Hymba 1.5B models, showing diagonal patterns in purple, yellow, and blue colors. 2) A grid diagram showing BOS (Beginning of Sequence) token connections with Meta and Cross sections marked. 3) Three horizontal stacked bar charts comparing percentage distributions of Meta, BOS, Cross, and Self attention patterns across Llama 3.2 3B and two variants of Hymba models, with percentages clearly labeled in different colors.
Figure 6. Sum of the attention score from different categories in Llama 3.2 3B and Hymba 1.5B

Heads importance analysis 

We analyzed the relative importance of attention and SSM heads in each layer by removing them and recording the final accuracy. Our analysis reveals the following: 

  • The relative importance of attention/SSM heads in the same layer is input-adaptive and varies across tasks, suggesting that they can serve different roles when handling various inputs.
  • The SSM head in the first layer is critical for language modeling, and removing it causes a substantial accuracy drop to random guess levels.
  • Generally, removing one attention/SSM head results in an average accuracy drop of 0.24%/1.1% on Hellaswag, respectively.
A line graph comparing the Hellswag Accuracy (y-axis ranging from 0.45 to 0.50) across 32 different layers (x-axis). The graph shows three elements: a horizontal dashed line labeled Orig Model at approximately 0.493, and two sets of bars in blue and orange representing Remove Attn and Remove SSM, respectively. The bars fluctuate slightly above and below the original model line, with most values falling between 0.47 and 0.495. The graph compares the impact of removing attention mechanisms versus SSM components at different layers of the model.
Figure 7. The achieved accuracy, measured using 1K samples from Hellaswag, after removing the Attention or SSM heads in each layer

Model architecture and training best practices

This section outlines key architectural decisions and training methodologies for Hymba 1.5B Base and Hymba 1.5B Instruct.

Model architecture

  • Hybrid architecture: Mamba is great at summarization and usually closer focuses on the current token, while attention is more precise and acts as snapshot memory. Combining them in parallel merges these benefits, but standard sequential fusion does not. We chose a 5:1 parameter ratio between SSM and attention heads.
  • Sliding window attention: Full attention heads are preserved in three layers (first, last, and middle), with sliding window attention heads used in the remaining 90% layers.
  • Cross-layer KV cache sharing: Implemented between every two consecutive attention layers. It is done in addition to GQA KV cache sharing between heads.
  • Meta-tokens: These 128 tokens are learnable with no supervision, helping to avoid entropy collapse problems in large language models (LLMs) and mitigate the attention sink phenomenon. Additionally, the model stores general knowledge in these tokens. 

Training best practices 

  • Pretraining: We opted for two-stage base model training. Stage 1 maintained a constant large learning rate and used less filtered large corpus data. Continuous learning rate decay was then performed to 1e-5 using high-quality data. This approach enables continuous training and resuming of Stage 1.
  • Instruction fine-tuning: Instruct model tuning is performed in three stages. First, SFT-1 provides the model with strong reasoning abilities by training on code, math, function calling, role play, and other task-specific data. Second, SFT-2 teaches the model to follow human instructions. Finally, DPO is leveraged to align the model with human preferences and improve the model’s safety.
Training pipeline for the Hymba model family divided into five sections that read (left to right) General pretraining, LR annealing, SFT-1, SFT-2, and DPO.
Figure 8. Training pipeline adapted for the Hymba model family

Performance and efficiency evaluation 

With only 1.5T pretraining tokens, the Hymba 1.5B model performs the best among all small LMs and achieves better throughput and cache efficiency than all transformer-based LMs. 

For example, when benchmarking against the strongest baseline, Qwen2.5, which is pretrained on 13x more tokens, Hymba 1.5B achieves a 1.55% average accuracy improvement, 1.41x throughput, and 2.90x cache efficiency. Compared to the strongest small LM trained on fewer than 2T tokens, namely h2o-danube2, our method achieves a 5.41% average accuracy improvement, 2.45x throughput, and 6.23x cache efficiency.

Model # Params Train tokens Token/s Cache (MB) MMLU 5-shot ARC-E 0-shot ARC-C 0-shot PIQA 0-shot Wino. 0-shot Hella. 0-shot SQuAD-C 1-shot Avg.
Open
ELM-1
1.1B 1.5T 246 346 27.06 62.37 19.54 74.76 61.8 48.37 45.38 48.57
Rene
v0.1
1.3B 1.5T 800 113 32.94 67.05 31.06 76.49 62.75 51.16 48.36 52.83
Phi
1.5
1.3B 0.15T 241 1573 42.56 76.18 44.71 76.56 72.85 48 30.09 55.85
Smol
LM
1.7B 1T 238 1573 27.06 76.47 43.43 75.79 60.93 49.58 45.81 54.15
Cosmo 1.8B .2T 244 1573 26.1 62.42 32.94 71.76 55.8 42.9 38.51 47.2
h20
danube2
1.8B 2T 271 492 40.05 70.66 33.19 76.01 66.93 53.7 49.03 55.65
Llama3.2 1B 1.2B 9T 535 262 32.12 65.53 31.39 74.43 60.69 47.72 40.18 50.29
Qwen
2.5
1.5B 18T 469 229 60.92 75.51 41.21 75.79 63.38 50.2 49.53 59.51
AMD
OLMo
1.2B 1.3T 387 1049 26.93 65.91 31.57 74.92 61.64 47.3 33.71 48.85
Smol
LM2
1.7B 11T 238 1573 50.29 77.78 44.71 77.09 66.38 53.55 50.5 60.04
Llama
3.2 3B
3.0B 9T 191 918 56.03 74.54 42.32 76.66 69.85 55.29 43.46 59.74
Hymba 1.5B 1.5T 664 79 51.19 76.94 45.9 77.31 66.61 53.55 55.93 61.06
Table 2. Hymba 1.5B Base model results

Instructed models 

The Hymba 1.5B Instruct model achieves the highest performance on an average of all tasks, outperforming the previous state-of-the-art model, Qwen 2.5 Instruct, by around 2%. Specifically, Hymba 1.5B surpasses all other models in GSM8K/GPQA/BFCLv2 with a score of 58.76/31.03/46.40, respectively. These results indicate the superiority of Hymba 1.5B, particularly in areas requiring complex reasoning capabilities.

Model # Params MMLU ↑ IFEval ↑ GSM8K ↑ GPQA ↑ BFCLv2 ↑ Avg. ↑
SmolLM 1.7B 27.80 25.16 1.36 25.67 -* 20.00
OpenELM 1.1B 25.65 6.25 56.03 21.62 -* 27.39
Llama 3.2 1.2B 44.41 58.92 42.99 24.11 20.27 38.14
Qwen2.5 1.5B 59.73 46.78 56.03 30.13 43.85 47.30
SmolLM2 1.7B 49.11 55.06 47.68 29.24 22.83 40.78
Hymba 1.5B 1.5B 52.79 57.14 58.76 31.03 46.40 49.22
Table 3. Hymba 1.5B Instruct model results

Conclusion

The new Hymba family of small LMs features a hybrid-head architecture that combines the high-resolution recall capabilities of attention heads with the efficient context summarization of SSM heads. To further optimize the performance of Hymba, learnable meta-tokens are introduced to act as a learned cache for both attention and SSM heads, enhancing the model’s focus on salient information. Through the road map of Hymba, comprehensive evaluations, and ablation studies, Hymba sets new state-of-the-art performance across a wide range of tasks, achieving superior results in both accuracy and efficiency. Additionally, this work provides valuable insights into the advantages of hybrid-head architectures, offering a promising direction for future research in efficient LMs.

Learn more about Hybma 1.5B Base and Hymba 1.5B Instruct.

Acknowledgments

This work would not have been possible without contributions from many people at NVIDIA, including Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Nikolaus Binder, Hanah Zhang, Maksim Khadkevich, Yingyan Celine Lin, Jan Kautz, Pavlo Molchanov, and Nathan Horrocks.

Latest articles

Related articles