Boost Large-Scale Recommendation System Training Embedding Using EMBark

Recommendation systems are core to the Internet industry, and efficiently training them is a key issue for various companies. Most recommendation systems are deep learning recommendation models (DLRMs), containing billions or even tens of billions of ID features. Figure 1 shows a typical structure.

A diagram of a typical DLRM model which consists of bottom dense network, embedding, interaction layer, and top dense network. Dense features and categorical features are fed into DLRM and generate click prediction result.
Figure 1. A typical DLRM model structure diagram

In recent years, GPU solutions such as NVIDIA Merlin HugeCTR and TorchRec have significantly accelerated the training of DLRMs by storing large-scale ID feature embeddings on GPUs and processing them in parallel. Using GPU memory bandwidth results in significant improvements compared to CPU solutions.

At the same time, as the number of GPUs used in training clusters increases (from eight GPUs to 128 GPUs), we have found that the communication overhead of the embedding accounts for a larger proportion of the total training overhead. In some large-scale training scenarios (such as on 16 nodes), it even exceeds half (51%). 

This is mainly due to two reasons:

1. As the number of GPUs in the cluster increases, the number of embedding tables on each node gradually decreases, leading to unbalanced loads between different nodes and reducing training efficiency.

2. Compared to intranode bandwidth, internode bandwidth is much smaller, resulting in longer communication times for embedding model parallelism.

The figure shows the proportion of training time for different parts of DLRM under 8 GPUs, 64 GPUs and 128 GPUs. As the cluster size grows, the communication time grows from 25% (on 8 GPUs) to 51% (on 128 GPUs), which becomes the bottleneck for the training performance.
Figure 2. The proportion of training time for different parts of DLRM under various cluster configurations

To help industry users better understand and solve these problems, the NVIDIA HugeCTR team presented EMBark at RecSys 2024. EMBark supports 3D flexible sharding strategies and combines communication compression strategies to finetune the load imbalance problem in large-scale cluster deep recommendation model training and reduce the communication time required for embeddings. The related code and paper are open source.

EMBark introduction

EMBark improves the performance of embeddings in DLRM training under different cluster configurations and accelerates training throughput. It’s implemented using the NVIDIA Merlin HugeCTR open-source recommendation system framework, but the techniques can also be applied to other machine learning frameworks.

EMBark has three key components: embedding clusters, flexible 3D sharding schemes, and a sharding planner. Figure 3 shows the overall architecture of EMBark.

A diagram of the EMBark architecture. In EMBark, each table can be sharded across arbitrary numbers of GPUs using the 3D sharding scheme. The sharding planner can automatically generate the best sharding strategy on given hardware specs and embedding table configurations during the compile stage. During the training stage, highly optimized data distributor, pipeline scheduler and embedding operators facilitate effective training.
Figure 3. EMBark architecture diagram

Embedding clusters

Embedding clusters efficiently train embeddings by grouping them with similar features and applying customized compression strategies to each cluster. These clusters include a data distributor, embedding storage, and embedding operators, working together to convert feature IDs into embedding vectors.

There are three types of embedding clusters: data parallel (DP), reduction-based (RB), and unique-based (UB). Each uses different communication methods during training and is suitable for different embeddings.

  • DP clusters: Don’t compress communication, making them simple and efficient, but they replicate the embedding table on each GPU, so they are only suitable for small tables.
  • RB clusters: Use reduction operations and are significantly effective for compressing multi-hot input tables with pooling operations.
  • UB clusters: Only send unique vectors, which is beneficial for handling embedding tables with obvious access hotspots.

Flexible 3D sharding scheme

The flexible 3D sharding scheme can solve the workload imbalance problem in RB clusters. Unlike fixed sharding strategies such as row-wise, table-wise, or column-wise, EMBark uses a 3D tuple (i, j, k) to represent each shard, where i represents the table index, j represents the row shard index, and k represents the column shard index. This approach enables each embedding to be sharded across an arbitrary number of GPUs, providing flexibility and precise control over workload balance.

Sharding planner

To find the optimal sharding strategy, EMBark provides a sharding planner—a cost-driven greedy search algorithm that identifies the best sharding strategy based on hardware specifications and embedding configurations.

Evaluation

Leveraging embedding clusters, the flexible 3D sharding scheme, and an advanced sharding planner, EMBark significantly enhances training performance. To evaluate its effectiveness, experiments were conducted on a cluster of NVIDIA DGX H100 nodes, each equipped with eight NVIDIA H100 GPUs (total 640 gb HBM, bandwidth of 24TB/s per node)

Within each node, all GPUs are interconnected through NVLink (900 GB/s bidirectional) and the internode communication uses InfiniBand (8x400Gbps).

DLRM-DCNv2 T180 T200 T510 T7
# parameters 48B 70B 100B 110B 470B
FLOPS per sample 96M 308M 387M 1,015M 25M
# embedding tables 26 180 200 510 7
Table dimensions (d) 128 {32, 128} 128 128 {32, 128}
Average hotness (h) ~10 ~80 ~20 ~5 ~20
Achieved QPS ~75M ~11M ~15M ~8M ~74M
Table 1. DLRM evaluation specifications

To demonstrate that EMBark can efficiently train DLRM models of any scale, we tested using the MLPerf DLRM-DCNv2 model and generated several synthetic models with larger embedding tables and different properties (see Table 1). Our training dataset exhibits a power-law skew of a = 1.2.

The evaluation result on EMBark which demonstrates the effectiveness of overlapping operations, flexible sharding, and enhanced clustering techniques for the DLRM-DCNv2, T180, T200, and T510 models. The results indicate an average 1.5x acceleration in end-to-end training throughput, with peak improvements reaching up to 1.77x compared to the baseline.
Figure 4. EMBark evaluation results

The baseline uses a serial kernel execution order, a fixed table-row-wise sharding strategy, and all RB clusters. The experiments successively used three optimizations: overlap, more flexible sharding strategies, and better cluster configurations.  Across four representative DLRM variants (DLRM-DCNv2, T180, T200, and T510), EMBark achieved an average of 1.5x end-to-end training throughput acceleration, up to 1.77x faster than the baseline. 

You can refer to the paper for more detailed experimental results and related analysis.

Faster DLRM training with EMBark

EMBark tackles the challenge of lengthy embedding processes in large-scale recommendation system model training. By supporting 3D flexible sharding strategies and combining different communication compression strategies, it can finetune the load imbalance problem in deep recommendation model training on large-scale clusters and reduce the communication time required for embeddings. 

This improves the training efficiency of large-scale recommendation system models. Across four representative DLRM variants (DLRM-DCNv2, T180, T200, and T510), EMBark achieved an average of 1.5x end-to-end training throughput acceleration, up to 1.77x faster than the baseline. 

We are also actively exploring embedding offloading-related technologies and optimizing torch-related work and look forward to sharing our progress in the future. If you are interested in this area, please contact us for more information.  

Latest articles

Related articles