Open-source large language models (LLMs) excel in English but struggle with other languages, especially the languages of Southeast Asia. This is primarily due to a lack of training data in these languages, limited understanding of local cultures, and insufficient tokens to capture unique linguistic structures and expressions.
To fully meet customer needs, enterprises in non-English-speaking countries must go beyond generic models and customize them to capture the nuances of their local languages, ensuring a seamless and impactful customer experience.
In this blog post, we explore how Viettel Solutions, a fast-growing subsidiary of Viettel Corporation, has leveraged NVIDIA NeMo Curator to process high-quality Vietnamese data for training Llama 3 ViettelSolution 8B, a state-of-the-art LLM that now ranks among the top of the VMLU leaderboard. NeMo Curator is a GPU-accelerated data-curation tool that enables large-scale, high-quality datasets for pretraining LLMs.
A crucial first step in this journey was curating large-scale, high-quality datasets. This post will guide you through the data curation pipeline used, including sample code for each stage and a detailed exploratory data analysis (EDA) to illustrate the impact of each step. By the end of the post, you’ll have a clear road map and reference to easily get started with NeMo Curator, whether for Vietnamese or other languages.
Viettel Solutions, a pioneer in providing digital transformation solutions for the Vietnamese government and enterprises, is focused on addressing the growing demand for AI adoption across various industries. With a vision to lead in generative AI and develop AI-powered products for customers, Viettel collaborated with the NVIDIA NeMo Curator team.
“NeMo Curator’s GPU-accelerated features, including exact and fuzzy deduplication as well as heuristic and classifier filtering, helped increase accuracy by 10%, accelerate training time by three times, and reduce the dataset size by 60%,” according to Tuan Nguyen, head of Data Analytics at Viettel Solutions.
Prerequisites and environment setup
To follow along with the steps presented in this post, make sure you have the following set up:
Installation
To begin, install NeMo Curator by following the instructions to install the CPU and CUDA-accelerated modules in the README file of the NeMo Curator repository.
Next, install the datasets and jsonlines packages, which will be needed later.
pip install datasets
pip install jsonlines
To proceed with data processing requires setting up a Dask environment. Dask is a flexible, open-source library that enables parallel and distributed computing in Python, which enables you to scale computations across multiple cores or even clusters. By distributing tasks, Dask makes the data handling process significantly faster and more efficient.
We ran this experiment on an NVIDIA DGX A100 with a 128-core CPU and 2TB of RAM to handle the dataset size. Depending on your dataset and computing resources, you may need to adjust the Dask worker configuration accordingly. You can start a Dask cluster using the following commands:
import nemo_curator
from dask.distributed import Client, LocalCluster
# Start a Dask cluster with 12 workers, each limited at 64GB of memory. You might need to adjust these numbers according to your computing resources
cluster = LocalCluster(n_workers=12, processes=True, memory_limit= '64GB')
client = Client(cluster)
Data processing pipeline overview
The data curation pipeline includes the following key steps:
- Download and sharding: The datasets are downloaded from various sources, then combined and sharded for efficient distributed processing.
- Unicode reformatting: Texts are standardized into a consistent Unicode format.
- Exact deduplication: Removes exact duplicates to reduce redundancy.
- Quality filtering
- Heuristic filtering: Applies rules-based filters to remove low-quality content.
- Classifier-based filtering: Uses machine learning to classify and filter documents based on quality.
Data collecting
We sourced content from multiple datasets to enrich the diversity and volume of our training data for LLMs. These datasets include:
- The Vietnamese subset of the C4 dataset, a large and diverse collection of web-crawled text data.
- The Vietnamese subset of the OSCAR dataset, version 23.01, an aggregation of web-crawled data.
- Wikipedia’s Vietnamese articles, providing well-structured and informative content.
- A Vietnamese news corpus, offering locally relevant news articles.
Each dataset is accessed and downloaded using the Hugging Face Hub, with additional steps required for OSCAR due to its access restrictions. Note that OSCAR requires accepting the conditions on the dataset page; then use a Hugging Face access token for downloading.
Download and convert datasets to Parquet
Parquet is optimized for distributed systems like Dask, enabling easy partitioning and parallel processing, which boosts performance when handling large-scale data. All dataset phases will be saved in Parquet format for the purposes of this post.
The following code snippet downloads the datasets from Hugging Face and saves them as Parquet files:
import os
from datasets import load_dataset as load_hf_dataset
from datasets import DownloadConfig
data_dir = "./datasets/"
download_config = DownloadConfig(num_proc=4)
# Load and save Vietnamese Wikipedia dataset
ds = load_hf_dataset("wikimedia/wikipedia", "20231101.vi")
ds["train"].to_parquet(os.path.join(data_dir, "wiki_vi_231101.parquet"))
# Load and save Vietnamese news corpus
ds = load_hf_dataset("jetaudio/binhvq_news")
ds["train"].to_parquet(os.path.join(data_dir, "binhvq_news_train.parquet"))
# Load and save OSCAR dataset
ds = load_hf_dataset("oscar-corpus/OSCAR-2301", language="vi", token=True, download_config=download_config, trust_remote_code=True)
ds['train'].to_parquet(os.path.join(data_dir, 'oscar_vi.parquet'))
# Load and save C4 dataset
ds = load_hf_dataset("allenai/c4", data_files='multilingual/c4-vi.*.json.gz', download_config=download_config, trust_remote_code=True)
ds['train'].to_parquet(os.path.join(data_dir, "c4_vi.parquet"))
We leveraged the NeMo Curator domain classifier model to classify the documents into one of the 26 domains supported. As shown in Figure 3, the distribution is relatively even, with many domains occupying between 3% and 6% of the total data. This suggests that the dataset is quite diverse, covering a broad spectrum of topics, which is beneficial for pretraining general-purpose language models.
Combine and standardize format
Once the datasets are downloaded, the next step is to standardize and format the data consistently across all sources. These are combined into a single dataset, keeping only the ‘text’ field because all the textual data used for training the model is in this field. Non-textual data and other information generally don’t help with this task.
from datasets import concatenate_datasets
# Combine datasets and standardize format
datasets = [os.path.join(data_dir, file) for file in ["wiki_vi_231101.parquet", "c4_vi.parquet", "oscar_vi.parquet", "binhvq_news_train.parquet"]]
data_files = {"train": datasets[0]}
ds = load_hf_dataset("parquet", data_files=data_files)
ds = ds["train"].remove_columns([col for col in ds["train"].column_names if col != "text"])
for d in datasets[1:]:
ds_ = load_hf_dataset("parquet", data_files={"train": d})
ds_ = ds_["train"].remove_columns([col for col in ds_["train"].column_names if col != "text"])
ds = concatenate_datasets([ds, ds_])
Shard the combined dataset
The combined dataset is then sharded into smaller chunks. Sharding is performed to distribute the data evenly across multiple workers in the Dask cluster, facilitating efficient parallel processing during the data curation stages.
# Define paths for raw data
raw_data_directory = os.path.join(data_dir, "raw")
# Shard the dataset
num_shards = 256
for shard_idx in range(num_shards):
shard = ds.shard(index=shard_idx, num_shards=num_shards)
shard.to_parquet(os.path.join(raw_data_directory, f"{shard_idx}.parquet"))
High-quality data processing with NeMo Curator
This section covers the different techniques we leveraged from NeMo Curator. Unicode reformatting, exact deduplication, heuristic filtering, and classifier-based filtering are used to process and refine this dataset into a high-quality final version.
Unicode reformatting
Unicode reformatting is an essential preprocessing step to ensure that text data is standardized and free of encoding errors, which are common in web-crawled datasets. The following code demonstrates how to perform Unicode reformatting using NeMo Curator:
from nemo_curator import Modify
from nemo_curator.modifiers import UnicodeReformatter
from nemo_curator.utils.distributed_utils import read_data, write_to_disk
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.datasets import DocumentDataset
# Define paths for Unicode formatted data
unicode_formatted_output_path = os.path.join(data_dir, "formatted")
def load_dataset(input_data_dir, file_type="parquet"):
files = list(get_all_files_paths_under(input_data_dir))
raw_data = read_data(files, file_type=file_type, backend="pandas", add_filename=True)
dataset = DocumentDataset(raw_data)
return dataset
# Load the raw data
raw_data = load_dataset(raw_data_directory, file_type="parquet")
# Initialize the Unicode reformatter
cleaner = Modify(UnicodeReformatter())
# Apply Unicode reformatting
cleaned_data = cleaner(raw_data)
# Save the cleaned data to disk
write_to_disk(cleaned_data.df, unicode_formatted_output_path, write_to_filename=True, output_type="parquet")
Adding custom IDs to documents
Before proceeding with further curation steps, it is advisable to preprocess the dataset by adding a unique ID to each document. These IDs serve as trackers that help in identifying duplicate or low-quality documents throughout the curation process, ensuring that each document remains uniquely identifiable throughout processing.
NeMo Curator offers a AddId
class, which enables users to insert custom IDs into documents using a specified prefix format, such as <prefix>_<id>
. The following code snippet demonstrates this step:
from nemo_curator import AddId
# Define paths for input data and output with added IDs
add_id_input_data_dir = unicode_formatted_output_path
added_id_output_path = os.path.join(data_dir, "add_id")
add_ID_id_prefix = "VI_"
# Load the formatted dataset
dataset = DocumentDataset.read_parquet(add_id_input_data_dir)
# Initialize the AddId class with a specified prefix and start index
add_id = AddId(id_field='id', id_prefix=add_ID_id_prefix, start_index=0)
# Apply the ID addition to the dataset
id_dataset = add_id(dataset)
# Save the dataset with added IDs to disk
write_to_disk(id_dataset.df, output_file_dir=added_id_output_path, write_to_filename=True, output_type="parquet")
Exact deduplication
Exact deduplication removes identical duplicates from the dataset. By eliminating exact duplicates, we ensure that each data point contributes uniquely to the training process, enhancing the diversity and overall quality of the dataset.
This stage leverages GPU acceleration using a GPU Dask cluster. The current cluster is CPU-based, so it must be shut down and a new one started with GPU support.
To close the existing cluster, use the following code:
client.cluster.close()
client.shutdown()
Then initialize the GPU Dask cluster:
os.environ["DASK_DATAFRAME__QUERY_PLANNING"] = "False"
from nemo_curator.utils.distributed_utils import get_client
def pre_imports():
import cudf
client = get_client(cluster_type='gpu', set_torch_to_use_rmm=False)
client.run(pre_imports)
The implementation of exact deduplication is shown below:
from nemo_curator.modules import ExactDuplicates
# Define input and output paths
exact_dedup_input_dataset_dir = added_id_output_path
exact_dedup_base_output_path = os.path.join(data_dir, "exact_dedup")
exact_dedup_log_dir = os.path.join(exact_dedup_base_output_path, "log")
exact_dedup_output_dir = os.path.join(exact_dedup_base_output_path, "data")
deduped_output_dir = os.path.join(data_dir,"remove_duplicate")
# Create directories for logs and output
!mkdir -p {exact_dedup_log_dir}
!mkdir -p {exact_dedup_output_dir}
!mkdir -p {deduped_output_dir}
# Parameters for ExactDuplicates
exact_dedup_dataset_id_field = "id"
exact_dedup_dataset_text_field = "text"
# Load the input dataset
input_dataset = DocumentDataset.read_parquet(exact_dedup_input_dataset_dir, backend="cudf")
# Initialize and run exact deduplication
exact_dup = ExactDuplicates(
logger=exact_dedup_log_dir,
id_field=exact_dedup_dataset_id_field,
text_field=exact_dedup_dataset_text_field,
hash_method="md5",
cache_dir=exact_dedup_output_dir
)
duplicates = exact_dup(dataset=input_dataset)
print(f"Number of exact duplicate files: {len(duplicates)}")
# Load the dataset,exact duplicates to identify and remove duplicate IDs
input_dataset = DocumentDataset.read_parquet(added_id_output_path, backend="cudf")
exact_duplicates = DocumentDataset.read_parquet(
os.path.join(exact_dedup_output_dir, "_exact_duplicates.parquet"), backend="cudf")
# Extract list of duplicate document IDs
exact_docs_to_remove = exact_duplicates.df.map_partitions(
lambda x: x[x._hashes.duplicated(keep="first")]
)
# Remove duplicated documents from the input dataset
result = input_dataset.df[
~input_dataset.df[exact_dedup_dataset_id_field].isin(exact_docs_to_remove[exact_dedup_dataset_id_field].compute())
]
# Save the final deduplicated dataset
write_to_disk(result, output_file_dir=deduped_output_dir, write_to_filename=True, output_type="parquet")
Heuristic quality filtering
Heuristic quality filtering is designed to enhance the quality of the dataset by removing low-quality content based on predefined heuristics. This approach involves applying a series of filters to the dataset to eliminate undesirable data characteristics such as excessive special characters, overly short or long texts, or other criteria that could negatively impact model performance.
We used a configured YAML file to define the heuristic filters. This file lists the filtering criteria and settings used to build a filter pipeline. You can customize the filters or change thresholds based on your needs. The filter_pipeline
helper reads the YAML settings and applies each filter to the dataset step by step.
from nemo_curator.utils.config_utils import build_filter_pipeline
import warnings
# Define paths for input data and output data after heuristic filtering
HF_input_data_dir = deduped_output_dir
HF_output_path = os.path.join(data_dir, "heuristic_filtering")
# Create a directory for the configuration file if it doesn't exist
os.makedirs("config", exist_ok=True)
# Download the YAML configuration file for heuristic filtering
!wget https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/config/heuristic_filter_non-en.yaml -O ./config/heuristic_filter_non-en.yaml
# Specify the path to the configuration file
filter_config_file = "./config/heuristic_filter_non-en.yaml"
os.makedirs(HF_output_path, exist_ok=True)
# Load the filters from the YAML configuration file
filter_pipeline = build_filter_pipeline(filter_config_file)
# Load the dataset
dataset = DocumentDataset.read_parquet(HF_input_data_dir, backend="pandas")
# Suppress specific warnings during filtering
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=UserWarning)
# Apply the heuristic filters to the dataset
result_data = filter_pipeline(dataset)
# Save the filtered dataset to disk
result_data.to_parquet(HF_output_path, write_to_filename=True)
Distribution of token counts
Now examine how heuristic filtering changes the dataset. Before filtering, the dataset contained a wide range of text lengths, with some documents being as short as a few tokens and others extending to more than 16K tokens. Post-filtering, the dataset exhibited a more uniform distribution of text lengths and token counts. The filtering process effectively removed extremely short documents (those under 64 tokens, for example) and trimmed down overly long documents that might have included redundant or irrelevant content.
Character-based metrics
Figure 5 shows a comparative analysis of the dataset before and after heuristic curation for each metric.
The box plots highlight a significant reduction in outliers after heuristic filtering. For symbols, the 95th percentile decreases from 8.84% to 5.47%, and for numbers, it drops from 11.19% to 6.14%. Whitespace also sees the maximum drop from 76.06% to 25.88%, with the 95th percentile remaining steady. These reductions indicate that heuristic filtering effectively targets and removes noisy data with high proportions of symbols, numbers, or spaces, improving overall dataset quality.
The filtering removed extremely long documents, but the general word count distribution across documents remains similar. This indicates the removal of unusually long documents with malformed tokens.
Classifier-based quality filtering
Heuristic filtering removes low-quality content using simple rules, but it can’t catch more complex patterns of quality. Classifier-based filtering uses a trained classifier model to sort content as high or low quality, offering a smarter and more flexible way to handle diverse datasets that simple rules might miss.
Prepare data for training classifier
Training a quality classifier requires representative samples of both high-quality and low-quality content. For high-quality data, we used articles from Wikipedia’s Vietnamese edition, which are generally well-structured and reliable. The low-quality samples come from unfiltered crawled Vietnamese news corpus.
Here’s how the data is prepared:
# Paths for high-quality and low-quality sample data
hq_samples_path = os.path.join(data_dir, "classifier_filtering/train_samples/hq")
lq_samples_path = os.path.join(data_dir, "classifier_filtering/train_samples/lq")
# Load and shard the high-quality dataset
ds = load_hf_dataset("wikimedia/wikipedia", "20231101.vi")
num_shards = 8
for shard_idx in range(num_shards):
shard = ds["train"].shard(index=shard_idx, num_shards=num_shards)
shard.to_parquet(os.path.join(hq_samples_path, f"{shard_idx}.parquet"))
# Load and shard the low-quality dataset
ds = load_hf_dataset("vietgpt/binhvq_news_vi",split="train[:100000]")
num_shards = 32
for shard_idx in range(num_shards):
shard = ds.shard(index=shard_idx, num_shards=num_shards)
shard.to_parquet(os.path.join(lq_samples_path, f"{shard_idx}.parquet"))
Training classifier
The classifier is trained using FastText, which offers an efficient and effective method for text classification. Here’s how the classifier is trained using samples labeled as high-quality and low-quality:
from nemo_curator.modifiers import FastTextLabelModifier
import fasttext
import random
# Function to create labeled samples
def create_samples(data_path, label, num_samples):
raw_dataset = DocumentDataset.read_parquet(data_path, backend='pandas')
label_quality = Modify(FastTextLabelModifier(label))
labeled_dataset = label_quality(raw_dataset)
labeled_samples = labeled_dataset.df.sample(frac=num_samples / len(labeled_dataset.df))
return labeled_samples["text"].compute().values.tolist()
# Prepare training data
low_quality_samples = create_samples(lq_samples_path, "__label__lq", 100000)
high_quality_samples = create_samples(hq_samples_path, "__label__hq", 100000)
train_samples = low_quality_samples + high_quality_samples
random.shuffle(train_samples)
# Save training data to a file
train_file = "./cf_model_fasttext.train"
with open(train_file, "w", encoding="utf-8") as f:
for sample in train_samples:
f.write(sample + "\n")
# Train the FastText classifier
model = fasttext.train_supervised(input=train_file, lr=0.01, dim=100, epoch=5, wordNgrams=2)
model_path = "./cf_model_fasttext_model.bin"
model.save_model(model_path)
Classify and filter the dataset
Once trained, the classifier is used to filter the dataset, categorizing documents into high and low quality based on the learned distinctions:
from nemo_curator.filters import FastTextQualityFilter
from nemo_curator import ScoreFilter
# Define paths and load the dataset
CF_input_data_dir = HF_output_path
CF_output_path = os.path.join(data_dir, "classifier_filtering/output")
target_dataset = DocumentDataset.read_parquet(CF_input_data_dir, "parquet")
# Set up the filtering pipeline
filter_pipeline = ScoreFilter(FastTextQualityFilter(model_path), score_field="quality_score", score_type=float)
filtered_dataset = filter_pipeline(target_dataset)
# Save the filtered dataset
write_to_disk(filtered_dataset.df, output_file_dir=CF_output_path, write_to_filename=True, output_type="parquet")
Remove sensitive and sentimental data
Both the Adult and Sensitive Subjects domains, as well as Positive and Negative sentiments, were notably reduced. This makes the model safer, more neutral, and better at handling diverse contexts with appropriate responses.
Preserve content diversity
Run the domain classifier model again and check the diversity of the content of the dataset. The dataset now shows a balanced distribution across domains, with most holding 3% to 8% of the data. This variety, from News and Law to specialized areas like Games and Autos, ensures the model can handle a wide range of topics. Even after filtering to improve quality and remove harmful content, the essential diversity is preserved, which is crucial for building a versatile, general-purpose language model.
Reduce dataset size after each stage
Approximately 90% of the dataset is removed, which are documents that have lower quality, noise, or malformed samples. This selective filtering ensures that the training data is of the highest quality. The largest reduction comes from classifier-based filtering (45.43%), which indicates that a substantial amount of content is flagged as lower quality or harmful and is removed in this phase. Heuristic filtering accounts for 35.74% of data removal, targeting issues like sample length, repeated n-grams, and noise. Exact deduplication filtered a smaller portion of the data (8.31%).
Embedding visualization
The final dataset demonstrates a similar distribution to the original one. The diversity of topics is still preserved, with most domains remaining well represented. Some smaller clusters appear slightly more defined after this step, which could be due to the removal of low-quality or harmful content.
Through both heuristic filtering and classifier-based filtering, the dataset maintains its broad range of domain diversity. The distinct clusters that represent specific domains remain well-defined, while the more general and overlapping domains continue to show interconnections, ensuring that the dataset remains balanced and comprehensive for pretraining purposes.
Conclusion
This blog post showcases the data curation pipeline Viettel Solutions used for Vietnamese text data, along with an analysis to explore how each stage of the curation process impacts the dataset. The pipeline uses NVIDIA NeMo Curator, a valuable tool for preparing large datasets for pretraining language models, focusing on quality, efficiency, and scalability. It offers a range of significant advantages in the data curation process, including:
- Improving dataset quality by removing noise and harmful content using heuristic and classifier-based filters.
- Preserving the essential structure of the dataset, ensuring that the core characteristics remain intact post-curation.
- Adapting to different datasets, providing a tailored approach that meets the specific needs of each corpus.
To see the full code used for this post, reference the Jupyter Notebook. Check out the NeMo Curator example scripts for other techniques such as Fuzzy Deduplication and PII redaction.