Processing High-Quality Vietnamese Language Data with NVIDIA NeMo Curator

Open-source large language models (LLMs) excel in English but struggle with other languages, especially the languages of Southeast Asia. This is primarily due to a lack of training data in these languages, limited understanding of local cultures, and insufficient tokens to capture unique linguistic structures and expressions.

To fully meet customer needs, enterprises in non-English-speaking countries must go beyond generic models and customize them to capture the nuances of their local languages, ensuring a seamless and impactful customer experience.

In this blog post, we explore how Viettel Solutions, a fast-growing subsidiary of Viettel Corporation, has leveraged NVIDIA NeMo Curator to process high-quality Vietnamese data for training Llama 3 ViettelSolution 8B, a state-of-the-art LLM that now ranks among the top of the VMLU leaderboard. NeMo Curator is a GPU-accelerated data-curation tool that enables large-scale, high-quality datasets for pretraining LLMs.

A crucial first step in this journey was curating large-scale, high-quality datasets. This post will guide you through the data curation pipeline used, including sample code for each stage and a detailed exploratory data analysis (EDA) to illustrate the impact of each step. By the end of the post, you’ll have a clear road map and reference to easily get started with NeMo Curator, whether for Vietnamese or other languages.

Viettel Solutions, a pioneer in providing digital transformation solutions for the Vietnamese government and enterprises, is focused on addressing the growing demand for AI adoption across various industries. With a vision to lead in generative AI and develop AI-powered products for customers, Viettel collaborated with the NVIDIA NeMo Curator team.

“NeMo Curator’s GPU-accelerated features, including exact and fuzzy deduplication as well as heuristic and classifier filtering, helped increase accuracy by 10%, accelerate training time by three times, and reduce the dataset size by 60%,” according to Tuan Nguyen, head of Data Analytics at Viettel Solutions.

Prerequisites and environment setup

To follow along with the steps presented in this post, make sure you have the following set up:

Installation

To begin, install NeMo Curator by following the instructions to install the CPU and CUDA-accelerated modules in the README file of the NeMo Curator repository.

Next, install the datasets and jsonlines packages, which will be needed later.

pip install datasets
pip install jsonlines

To proceed with data processing requires setting up a Dask environment. Dask is a flexible, open-source library that enables parallel and distributed computing in Python, which enables you to scale computations across multiple cores or even clusters. By distributing tasks, Dask makes the data handling process significantly faster and more efficient.

We ran this experiment on an NVIDIA DGX A100 with a 128-core CPU and 2TB of RAM to handle the dataset size. Depending on your dataset and computing resources, you may need to adjust the Dask worker configuration accordingly. You can start a Dask cluster using the following commands:

import nemo_curator
from dask.distributed import Client, LocalCluster
# Start a Dask cluster with 12 workers, each limited at 64GB of memory.  You might need to adjust these numbers according to your computing resources
cluster = LocalCluster(n_workers=12, processes=True, memory_limit= '64GB')
client = Client(cluster)

Data processing pipeline overview

The data curation pipeline includes the following key steps:

  • Download and sharding: The datasets are downloaded from various sources, then combined and sharded for efficient distributed processing.
  • Unicode reformatting: Texts are standardized into a consistent Unicode format.
  • Exact deduplication: Removes exact duplicates to reduce redundancy.
  • Quality filtering
    • Heuristic filtering: Applies rules-based filters to remove low-quality content.
    • Classifier-based filtering: Uses machine learning to classify and filter documents based on quality.
Data processing pipeline with NeMo Curator. The pipeline includes these key steps, Download and Sharding, Unicode Reformatting, Exact Deduplication and Quality Filtering.
Figure 1. The data processing pipeline built with NeMo Curator

Data collecting

We sourced content from multiple datasets to enrich the diversity and volume of our training data for LLMs. These datasets include:

  • The Vietnamese subset of the C4 dataset, a large and diverse collection of web-crawled text data.
  • The Vietnamese subset of the OSCAR dataset, version 23.01, an aggregation of web-crawled data.
  • Wikipedia’s Vietnamese articles, providing well-structured and informative content.
  • A Vietnamese news corpus, offering locally relevant news articles.

Each dataset is accessed and downloaded using the Hugging Face Hub, with additional steps required for OSCAR due to its access restrictions. Note that OSCAR requires accepting the conditions on the dataset page; then use a Hugging Face access token for downloading.

Download and convert datasets to Parquet

Parquet is optimized for distributed systems like Dask, enabling easy partitioning and parallel processing, which boosts performance when handling large-scale data. All dataset phases will be saved in Parquet format for the purposes of this post.

The following code snippet downloads the datasets from Hugging Face and saves them as Parquet files:

import os
from datasets import load_dataset as load_hf_dataset
from datasets import DownloadConfig 

data_dir = "./datasets/"
download_config = DownloadConfig(num_proc=4)

# Load and save Vietnamese Wikipedia dataset
ds = load_hf_dataset("wikimedia/wikipedia", "20231101.vi")
ds["train"].to_parquet(os.path.join(data_dir, "wiki_vi_231101.parquet"))

# Load and save Vietnamese news corpus
ds = load_hf_dataset("jetaudio/binhvq_news")
ds["train"].to_parquet(os.path.join(data_dir, "binhvq_news_train.parquet"))

# Load and save OSCAR dataset
ds = load_hf_dataset("oscar-corpus/OSCAR-2301", language="vi", token=True, download_config=download_config, trust_remote_code=True)
ds['train'].to_parquet(os.path.join(data_dir, 'oscar_vi.parquet'))

# Load and save C4 dataset
ds = load_hf_dataset("allenai/c4", data_files='multilingual/c4-vi.*.json.gz', download_config=download_config, trust_remote_code=True)
ds['train'].to_parquet(os.path.join(data_dir, "c4_vi.parquet"))
Pie chart showing proportion of the raw dataset by sources. About 67% of the data comes from C4, 17% from News corpus, 15% from Oscar dataset, and the remaining 1% are Wikipedia articles.
Figure 2. Proportion of the raw dataset by sources

We leveraged the NeMo Curator domain classifier model to classify the documents into one of the 26 domains supported. As shown in Figure 3, the distribution is relatively even, with many domains occupying between 3% and 6% of the total data. This suggests that the dataset is quite diverse, covering a broad spectrum of topics, which is beneficial for pretraining general-purpose language models.

This pie chart illustrates the domain distribution within the raw dataset as identified by a domain classifier model. The largest domain is Business and Industrial at 7.1%, followed closely by Arts and Entertainment at 6.8% and News at 6.4%. Other significant categories include Health at 6.2% and Sensitive Subjects at 5.7%. Smaller domains represented are Shopping (2.3%) and Games (2.5%), highlighting the diverse content within the dataset.
Figure 3. Domain proportion in the raw dataset identified by the Domain classifier model

Combine and standardize format

Once the datasets are downloaded, the next step is to standardize and format the data consistently across all sources. These are combined into a single dataset, keeping only the ‘text’ field because all the textual data used for training the model is in this field. Non-textual data and other information generally don’t help with this task.

from datasets import concatenate_datasets
# Combine datasets and standardize format
datasets = [os.path.join(data_dir, file) for file in ["wiki_vi_231101.parquet", "c4_vi.parquet", "oscar_vi.parquet", "binhvq_news_train.parquet"]]

data_files = {"train": datasets[0]}
ds = load_hf_dataset("parquet", data_files=data_files)
ds = ds["train"].remove_columns([col for col in ds["train"].column_names if col != "text"])

for d in datasets[1:]:
    ds_ = load_hf_dataset("parquet", data_files={"train": d})
    ds_ = ds_["train"].remove_columns([col for col in ds_["train"].column_names if col != "text"])
    ds = concatenate_datasets([ds, ds_])

Shard the combined dataset

The combined dataset is then sharded into smaller chunks. Sharding is performed to distribute the data evenly across multiple workers in the Dask cluster, facilitating efficient parallel processing during the data curation stages.

# Define paths for raw data
raw_data_directory = os.path.join(data_dir, "raw")

# Shard the dataset
num_shards = 256
for shard_idx in range(num_shards):
    shard = ds.shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(os.path.join(raw_data_directory, f"{shard_idx}.parquet"))

High-quality data processing with NeMo Curator

This section covers the different techniques we leveraged from NeMo Curator. Unicode reformatting, exact deduplication, heuristic filtering, and classifier-based filtering are used to process and refine this dataset into a high-quality final version.

Unicode reformatting

Unicode reformatting is an essential preprocessing step to ensure that text data is standardized and free of encoding errors, which are common in web-crawled datasets. The following code demonstrates how to perform Unicode reformatting using NeMo Curator:

from nemo_curator import Modify
from nemo_curator.modifiers import UnicodeReformatter
from nemo_curator.utils.distributed_utils import read_data, write_to_disk
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.datasets import DocumentDataset

# Define paths for Unicode formatted data
unicode_formatted_output_path = os.path.join(data_dir, "formatted")

def load_dataset(input_data_dir, file_type="parquet"):
    files = list(get_all_files_paths_under(input_data_dir))
    raw_data = read_data(files, file_type=file_type, backend="pandas", add_filename=True)
    dataset = DocumentDataset(raw_data)

    return dataset

# Load the raw data
raw_data = load_dataset(raw_data_directory, file_type="parquet")

# Initialize the Unicode reformatter
cleaner = Modify(UnicodeReformatter())

# Apply Unicode reformatting
cleaned_data = cleaner(raw_data)

# Save the cleaned data to disk
write_to_disk(cleaned_data.df, unicode_formatted_output_path, write_to_filename=True, output_type="parquet")

Adding custom IDs to documents

Before proceeding with further curation steps, it is advisable to preprocess the dataset by adding a unique ID to each document. These IDs serve as trackers that help in identifying duplicate or low-quality documents throughout the curation process, ensuring that each document remains uniquely identifiable throughout processing.

NeMo Curator offers a AddId class, which enables users to insert custom IDs into documents using a specified prefix format, such as <prefix>_<id>. The following code snippet demonstrates this step:

from nemo_curator import AddId

# Define paths for input data and output with added IDs
add_id_input_data_dir = unicode_formatted_output_path
added_id_output_path = os.path.join(data_dir, "add_id")
add_ID_id_prefix = "VI_"

# Load the formatted dataset
dataset = DocumentDataset.read_parquet(add_id_input_data_dir)

# Initialize the AddId class with a specified prefix and start index
add_id = AddId(id_field='id', id_prefix=add_ID_id_prefix, start_index=0)

# Apply the ID addition to the dataset
id_dataset = add_id(dataset)

# Save the dataset with added IDs to disk
write_to_disk(id_dataset.df, output_file_dir=added_id_output_path, write_to_filename=True, output_type="parquet")

Exact deduplication

Exact deduplication removes identical duplicates from the dataset. By eliminating exact duplicates, we ensure that each data point contributes uniquely to the training process, enhancing the diversity and overall quality of the dataset.

This stage leverages GPU acceleration using a GPU Dask cluster. The current cluster is CPU-based, so it must be shut down and a new one started with GPU support.

To close the existing cluster, use the following code:

client.cluster.close()
client.shutdown()

Then initialize the GPU Dask cluster:

os.environ["DASK_DATAFRAME__QUERY_PLANNING"] = "False"

from nemo_curator.utils.distributed_utils import get_client

def pre_imports():
    import cudf 

client = get_client(cluster_type='gpu', set_torch_to_use_rmm=False)
client.run(pre_imports)

The implementation of exact deduplication is shown below:

from nemo_curator.modules import ExactDuplicates

# Define input and output paths
exact_dedup_input_dataset_dir = added_id_output_path
exact_dedup_base_output_path = os.path.join(data_dir, "exact_dedup")
exact_dedup_log_dir = os.path.join(exact_dedup_base_output_path, "log")
exact_dedup_output_dir = os.path.join(exact_dedup_base_output_path, "data")
deduped_output_dir = os.path.join(data_dir,"remove_duplicate")

# Create directories for logs and output
!mkdir -p {exact_dedup_log_dir}
!mkdir -p {exact_dedup_output_dir}
!mkdir -p {deduped_output_dir}

# Parameters for ExactDuplicates
exact_dedup_dataset_id_field = "id"
exact_dedup_dataset_text_field = "text"

# Load the input dataset
input_dataset = DocumentDataset.read_parquet(exact_dedup_input_dataset_dir, backend="cudf")

# Initialize and run exact deduplication
exact_dup = ExactDuplicates(
    logger=exact_dedup_log_dir,
    id_field=exact_dedup_dataset_id_field,
    text_field=exact_dedup_dataset_text_field,
    hash_method="md5",
    cache_dir=exact_dedup_output_dir
)
duplicates = exact_dup(dataset=input_dataset)

print(f"Number of exact duplicate files: {len(duplicates)}")

# Load the dataset,exact duplicates to identify and remove duplicate IDs
input_dataset = DocumentDataset.read_parquet(added_id_output_path, backend="cudf")
exact_duplicates = DocumentDataset.read_parquet(
    os.path.join(exact_dedup_output_dir, "_exact_duplicates.parquet"), backend="cudf")

# Extract list of duplicate document IDs
exact_docs_to_remove = exact_duplicates.df.map_partitions(
    lambda x: x[x._hashes.duplicated(keep="first")]
)

# Remove duplicated documents from the input dataset
result = input_dataset.df[
~input_dataset.df[exact_dedup_dataset_id_field].isin(exact_docs_to_remove[exact_dedup_dataset_id_field].compute())
]

# Save the final deduplicated dataset
write_to_disk(result, output_file_dir=deduped_output_dir, write_to_filename=True, output_type="parquet")

Heuristic quality filtering

Heuristic quality filtering is designed to enhance the quality of the dataset by removing low-quality content based on predefined heuristics. This approach involves applying a series of filters to the dataset to eliminate undesirable data characteristics such as excessive special characters, overly short or long texts, or other criteria that could negatively impact model performance.

We used a configured YAML file to define the heuristic filters. This file lists the filtering criteria and settings used to build a filter pipeline. You can customize the filters or change thresholds based on your needs. The filter_pipeline helper reads the YAML settings and applies each filter to the dataset step by step.

from nemo_curator.utils.config_utils import build_filter_pipeline
import warnings

# Define paths for input data and output data after heuristic filtering
HF_input_data_dir = deduped_output_dir
HF_output_path = os.path.join(data_dir, "heuristic_filtering")

# Create a directory for the configuration file if it doesn't exist
os.makedirs("config", exist_ok=True)
# Download the YAML configuration file for heuristic filtering
!wget https://raw.githubusercontent.com/NVIDIA/NeMo-Curator/main/config/heuristic_filter_non-en.yaml -O ./config/heuristic_filter_non-en.yaml

# Specify the path to the configuration file
filter_config_file = "./config/heuristic_filter_non-en.yaml"
os.makedirs(HF_output_path, exist_ok=True)

# Load the filters from the YAML configuration file
filter_pipeline = build_filter_pipeline(filter_config_file)

# Load the dataset
dataset = DocumentDataset.read_parquet(HF_input_data_dir, backend="pandas")

# Suppress specific warnings during filtering
with warnings.catch_warnings():
    warnings.simplefilter("ignore", category=UserWarning)
    # Apply the heuristic filters to the dataset
    result_data = filter_pipeline(dataset)
 
    # Save the filtered dataset to disk
    result_data.to_parquet(HF_output_path, write_to_filename=True)

Distribution of token counts

Now examine how heuristic filtering changes the dataset. Before filtering, the dataset contained a wide range of text lengths, with some documents being as short as a few tokens and others extending to more than 16K tokens. Post-filtering, the dataset exhibited a more uniform distribution of text lengths and token counts. The filtering process effectively removed extremely short documents (those under 64 tokens, for example) and trimmed down overly long documents that might have included redundant or irrelevant content.

This histogram illustrates the frequency of Log2 token counts per sample in the deduplicated and heuristic-filtered datasets, highlighting the removal of extremely short and overly long documents. In the deduplicated dataset, the highest frequency occurs at Log2 = 10 with 29.6 million samples, while heuristic filtering reduces this to 22.3 million. For Log2 = 8, the frequency drops from 19.7 million in the deduplicated dataset to 12.8 million after filtering. At the lower end, the frequency for Log2 = 4 is 0.6 million in the deduplicated dataset but eliminated in the heuristic-filtered dataset.
Figure 4. Comparison of sample length distribution, measured by the log2 of the number of tokens per sample in the deduplicated and heuristic-filtered datasets

Character-based metrics

Figure 5 shows a comparative analysis of the dataset before and after heuristic curation for each metric.

This set of box plots shows the effect of heuristic filtering on symbol, number, and whitespace percentages in the dataset, demonstrating significant noise reduction. In the raw data, the maximum symbol percentage reaches 99.44%, while heuristic filtering reduces it to 81.61%. The number percentage also drops from a maximum of 91.53% in the raw data to 15.40% after filtering. For whitespace percentage, the maximum value decreases from 76.06% to 25.88%, indicating more consistent text formatting post-filtering.
Figure 5. Box plots of symbol, number, and whitespace percentages before and after heuristic curation 

The box plots highlight a significant reduction in outliers after heuristic filtering. For symbols, the 95th percentile decreases from 8.84% to 5.47%, and for numbers, it drops from 11.19% to 6.14%. Whitespace also sees the maximum drop from 76.06% to 25.88%, with the 95th percentile remaining steady. These reductions indicate that heuristic filtering effectively targets and removes noisy data with high proportions of symbols, numbers, or spaces, improving overall dataset quality.

This chart presents box plots of word counts (left) and mean word length (right) before and after heuristic filtering, showing how extreme outliers are reduced. In the raw dataset, the maximum word count is 163,822, while the heuristic-filtered dataset has a reduced maximum of 100,948. The median word count drops from 459 in the raw dataset to 574 post-filtering, indicating a focus on more substantial text samples. For mean word length, the maximum decreases from 9.24 to 5.29, reflecting a more consistent text quality after filtering.
Figure 6. Box plot of Word counts and Mean word length before and after heuristic filtering

The filtering removed extremely long documents, but the general word count distribution across documents remains similar. This indicates the removal of unusually long documents with malformed tokens.

Classifier-based quality filtering

Heuristic filtering removes low-quality content using simple rules, but it can’t catch more complex patterns of quality. Classifier-based filtering uses a trained classifier model to sort content as high or low quality, offering a smarter and more flexible way to handle diverse datasets that simple rules might miss.

Prepare data for training classifier

Training a quality classifier requires representative samples of both high-quality and low-quality content. For high-quality data, we used articles from Wikipedia’s Vietnamese edition, which are generally well-structured and reliable. The low-quality samples come from unfiltered crawled Vietnamese news corpus.

Here’s how the data is prepared:

# Paths for high-quality and low-quality sample data
hq_samples_path = os.path.join(data_dir, "classifier_filtering/train_samples/hq")
lq_samples_path = os.path.join(data_dir, "classifier_filtering/train_samples/lq")

# Load and shard the high-quality dataset
ds = load_hf_dataset("wikimedia/wikipedia", "20231101.vi")
num_shards = 8
for shard_idx in range(num_shards):
    shard = ds["train"].shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(os.path.join(hq_samples_path, f"{shard_idx}.parquet"))

# Load and shard the low-quality dataset
ds = load_hf_dataset("vietgpt/binhvq_news_vi",split="train[:100000]")
num_shards = 32
for shard_idx in range(num_shards):
    shard = ds.shard(index=shard_idx, num_shards=num_shards)
    shard.to_parquet(os.path.join(lq_samples_path, f"{shard_idx}.parquet"))

Training classifier

The classifier is trained using FastText, which offers an efficient and effective method for text classification. Here’s how the classifier is trained using samples labeled as high-quality and low-quality:

from nemo_curator.modifiers import FastTextLabelModifier
import fasttext
import random

# Function to create labeled samples
def create_samples(data_path, label, num_samples):
    raw_dataset = DocumentDataset.read_parquet(data_path, backend='pandas')
    label_quality = Modify(FastTextLabelModifier(label))
    labeled_dataset = label_quality(raw_dataset)
    labeled_samples = labeled_dataset.df.sample(frac=num_samples / len(labeled_dataset.df))
    return labeled_samples["text"].compute().values.tolist()

# Prepare training data
low_quality_samples = create_samples(lq_samples_path, "__label__lq", 100000)
high_quality_samples = create_samples(hq_samples_path, "__label__hq", 100000)
train_samples = low_quality_samples + high_quality_samples
random.shuffle(train_samples)

# Save training data to a file
train_file = "./cf_model_fasttext.train"
with open(train_file, "w", encoding="utf-8") as f:
    for sample in train_samples:
        f.write(sample + "\n")

# Train the FastText classifier
model = fasttext.train_supervised(input=train_file, lr=0.01, dim=100, epoch=5, wordNgrams=2)
model_path = "./cf_model_fasttext_model.bin"
model.save_model(model_path)

Classify and filter the dataset

Once trained, the classifier is used to filter the dataset, categorizing documents into high and low quality based on the learned distinctions:

from nemo_curator.filters import FastTextQualityFilter
from nemo_curator import ScoreFilter

# Define paths and load the dataset
CF_input_data_dir = HF_output_path
CF_output_path = os.path.join(data_dir, "classifier_filtering/output")
target_dataset = DocumentDataset.read_parquet(CF_input_data_dir, "parquet")

# Set up the filtering pipeline
filter_pipeline = ScoreFilter(FastTextQualityFilter(model_path), score_field="quality_score", score_type=float)
filtered_dataset = filter_pipeline(target_dataset)

# Save the filtered dataset
write_to_disk(filtered_dataset.df, output_file_dir=CF_output_path, write_to_filename=True, output_type="parquet")

Remove sensitive and sentimental data

Both the Adult and Sensitive Subjects domains, as well as Positive and Negative sentiments, were notably reduced. This makes the model safer, more neutral, and better at handling diverse contexts with appropriate responses.

This visualization shows the impact of classifier-based filtering in removing sensitive and sentimental data. In the heuristic-filtered dataset, 4542.31K samples belong to the "Sensitive Subjects" domain, which reduces to 273K samples after classifier-based filtering, accounting for 6.97% and 2.02% respectively. For "Adult" content, the count drops from 481.03K in heuristic filtering to 31K in classifier-based filtering, representing 0.74% and 0.23%. On the right, sentiment filtering shows a reduction in "Positive" samples from 4640K in heuristic filtering to 744K, and "Negative" samples reduce from 524K to 77K, making up 7.12% and 5.63% in heuristic filtering versus 0.80% and 0.59% in classifier-based filtering.
Figure 7. Count of sensitive domain (left) and sentimental samples (right) before and after applying classifier-based filtering

Preserve content diversity

Run the domain classifier model again and check the diversity of the content of the dataset. The dataset now shows a balanced distribution across domains, with most holding 3% to 8% of the data. This variety, from News and Law to specialized areas like Games and Autos, ensures the model can handle a wide range of topics. Even after filtering to improve quality and remove harmful content, the essential diversity is preserved, which is crucial for building a versatile, general-purpose language model.

This pie chart illustrates the domain distribution in the final dataset as identified by a domain classifier model. The largest domain is Arts and Entertainment at 7.86%, followed by Health at 7.42%, and People and Society at 7.07%. Other notable categories include Sports (6.66%), News (6.46%), and Food and Drink (5.7%). Smaller domains include Pets and Animals at 1.73% and Finance at 2.67%.
Figure 8. Domain proportions in the final dataset identified by the Domain classifier model

Reduce dataset size after each stage

Approximately 90% of the dataset is removed, which are documents that have lower quality, noise, or malformed samples. This selective filtering ensures that the training data is of the highest quality. The largest reduction comes from classifier-based filtering (45.43%), which indicates that a substantial amount of content is flagged as lower quality or harmful and is removed in this phase. Heuristic filtering accounts for 35.74% of data removal, targeting issues like sample length, repeated n-grams, and noise. Exact deduplication filtered a smaller portion of the data (8.31%).

This bar chart displays the proportions of data filtered at each phase across different datasets, showing that 90% of data was removed to ensure high-quality training data. In the "All" dataset, 45.43% of data was removed through classifier-based filtering, 35.74% through heuristic filtering, and 8.31% due to duplication. For the "Binhvq_News" dataset, classifier-based filtering accounted for the largest reduction at 48.04%, followed by heuristic filtering at 36.6%. The "Wiki_Vietnamese" dataset saw 49.59% of data removed via classifier-based filtering and 37.31% via heuristic filtering, with 11.31% due to duplication.
Figure 9. A breakdown of how much data is filtered out at each curation phase across four different datasets

Embedding visualization

The final dataset demonstrates a similar distribution to the original one. The diversity of topics is still preserved, with most domains remaining well represented. Some smaller clusters appear slightly more defined after this step, which could be due to the removal of low-quality or harmful content.

Through both heuristic filtering and classifier-based filtering, the dataset maintains its broad range of domain diversity. The distinct clusters that represent specific domains remain well-defined, while the more general and overlapping domains continue to show interconnections, ensuring that the dataset remains balanced and comprehensive for pretraining purposes.

This image showcases UMAP visualizations comparing a 5% sample of the raw dataset (left) and a 5% sample of the classifier-based filtered dataset (right). The visualizations demonstrate that the final dataset maintains domain diversity even after filtering, with well-distributed clusters for each domain. Both visualizations depict a range of 26 domains, such as Arts and Entertainment, Health, and News, represented by distinct colors. The classifier-based filtering retains the main structure and diversity of the raw dataset, ensuring well-represented domains post-filtering.
Figure 10. UMAP visualization of 5% of the raw dataset (left) and  5% of the Classifier-based filtered dataset (right)

Conclusion

This blog post showcases the data curation pipeline Viettel Solutions used for Vietnamese text data, along with an analysis to explore how each stage of the curation process impacts the dataset. The pipeline uses NVIDIA NeMo Curator, a valuable tool for preparing large datasets for pretraining language models, focusing on quality, efficiency, and scalability. It offers a range of significant advantages in the data curation process, including:

  • Improving dataset quality by removing noise and harmful content using heuristic and classifier-based filters.
  • Preserving the essential structure of the dataset, ensuring that the core characteristics remain intact post-curation.
  • Adapting to different datasets, providing a tailored approach that meets the specific needs of each corpus.

To see the full code used for this post, reference the Jupyter Notebook. Check out the NeMo Curator example scripts for other techniques such as Fuzzy Deduplication and PII redaction.

Latest articles

Related articles