Faster Causal Inference on Large Datasets with NVIDIA RAPIDS

As consumer applications generate more data than ever before, enterprises are turning to causal inference methods for observational data to help shed light on how changes to individual components of their app impact key business metrics.

Over the last decade, econometricians have developed a technique called double machine learning that brings the power of machine learning models to causal inference problems. It involves training two predictive models across independent dataset samples and combining them to build a de-biased estimate of the target variable.

Open-source Python libraries like DoubleML and others make it easy for data scientists to tap into this new technique, but struggle with the size of data that enterprises need to process on CPUs.

RAPIDS is a collection of open-source GPU-accelerated data science and AI libraries. cuML is a GPU-accelerated machine learning library for Python with a scikit-learn compatible API.

In this blog post, we illustrate how you can use RAPIDS cuML with the DoubleML library for faster causal inference, enabling you to more effectively work with large datasets.

Why causal inference?

Many data science and machine learning use cases are more focused on the quality of predictions than the exact effect sizes of individual features on the outcome variable. As a result, non-parametric models like random forest (available in scikit-learn) and XGBoost have become a go-to choice for many data scientists.

For some problems, we need to measure the causal effect of one variable (feature) on our target outcome variable. The gold standard for doing this effectively is to run a randomized controlled trial or A/B test and measure average treatment effects across groups. 

Unfortunately, this isn’t always practical for enterprises due to the impact that changes can have on their business. Ideally, we’d be able to find out how important a component of the in-app experience is to user churn without risking increasing it. Causal inference techniques enable estimating that relationship from real-world datasets of user behavior, providing critical guidance about where we should invest resources to improve a product.

Historically, it was challenging to use flexible, non-parametric models like random forest and XGBoost for causal inference. Double machine learning allows us to easily tap into these advancements.

Bringing accelerated computing to double machine learning

Using state-of-the-art machine learning algorithms for causal inference increases the computational requirements for the workflow. With small datasets, this isn’t an issue. But as datasets continue to grow, using DoubleML on CPUs in practice can be a challenge.

In the benchmark below, we lightly adapt this example from the DoubleML documentation and run it on a range of dataset sizes using scikit-learn and cuML to see how performance changes.

import doubleml as dml
from doubleml.datasets import make_plr_CCDDHNR2018
from sklearn.ensemble import RandomForestRegressor
from sklearn.base import clone
import cuml

NROWS = [10000, 100000, 1000000, 10000000]
USE_GPU = True

for N in NROWS:
    data = make_plr_CCDDHNR2018(alpha=0.5, n_obs=N, dim_x=100, return_type="DataFrame").astype("float32")
    obj_dml_data = dml.DoubleMLData(data, "y", "d")

    if USE_GPU:
  learner = cuml.ensemble.RandomForestRegressor(n_estimators=200, max_features=100, max_depth=10, min_samples_leaf=2)
    else: # standard scikit-learn
  learner = RandomForestRegressor(n_estimators=200, max_features=100, max_depth=10, min_samples_leaf=2, n_jobs=-1)

    ml_l = clone(learner)
    ml_m = clone(learner)
    dml_plr_obj = dml.DoubleMLPLR(obj_dml_data, ml_l, ml_m).fit()

With hundreds of thousands or millions of records, CPU-based DoubleML pipelines quickly slow down, as the underlying machine learning model becomes the bottleneck. On the 10 million row x 100 column dataset, fitting the DoubleMLPLR pipeline takes more than 6.5 hours. Switching to the GPU-accelerated RAPIDS cuML for the underlying model enables it to finish in just 51 minutes, which is a 7.7x speedup.

A bar chart showing the time required to train the DoubleML PLR using Random Forest Regression from Scikit-learn (CPU) vs. RAPIDS cuML (GPU) as the underlying learner. As datasets grow, GPU-accelerated machine learning can provide significant performance improvements (up to 12x).
Figure 1. Time to fit the DoubleML PLR using Random Forest Regression from Scikit-learn (CPU) vs. RAPIDS cuML (GPU) as the underlying learner.

Based on the results, accelerated machine learning libraries like cuML can provide up to 12x speedups compared to using scikit-learn’s CPU-based RandomForestRegressor as the backend model, with minimal code changes required. 

Conclusion

Causal inference can help enterprises better understand key components of their products, but traditionally it’s been challenging to take advantage of innovations in machine learning focused on prediction.

New techniques like double machine learning are bridging this gap, enabling enterprises to use computationally intensive machine learning algorithms for causal inference problems. As datasets grow, CPU-based infrastructure struggles to keep up with productivity demands.

Using accelerated computing libraries like RAPIDS cuML with DoubleML makes it possible to turn hours of waiting into minutes, with minimal code change.

To learn more about accelerated machine learning, visit the cuML documentation.

Latest articles

Related articles