On 2014, Tianqi Chen took the world by storm with the release of the first efficient implementation of gradient boosted decision trees, XGBoost. In a few months, developers using the new library had already broken several performance records, including winning several kaggle competitions. The implementation could easily outperform the SOTA methods of the time (mainly random forests and neural networks) when dealing with tabular data.

What made XGBoost so fantastically good? In short, it was two things: the first is that the typical CART algorithm for tree-growth was enhanced by adding heavy regularization; and the second one was that it was *fast*. The base code (written in C++) has algorithm-level optimizations (such as the histogram method to quickly pre-sort the data), as well as hardware-level optimizations (the way the data is stored in memory and the parallelization strategy).

In 2016, in their typical spirit of stealing somebody else’s idea and throwing money at it until it becomes massive, Microsoft released LightGBM. This library was originally a mostly drop-in replacement for XGBoost in that it used the same method of tree learning, as well as the histogram method of data sorting. However, this new implementation introduced a couple of interesting things, most of which are designed to make it even faster (and it ended up being so fast that XGBoost can even seem slow beside it), and others that modify the way in which trees grow:

- Like XGBoost, it implements the histogram method of data sorting. I haven’t personally looked at the code, but have reason to suspect they did something different to make it even more efficient (don’t quote me on this though).
- Gradient-based one-side sampling (GOSS) is a very smart way of reducing the time to compute the gradient. When calculating the pseudo-residuals, the algorithm heavily subsamples those elements with small residuals, forcing the model to focus on the hard-to-learn (i.e. those with high gradients) elements à la AdaBoost. This in turn makes things faster because there are less elements to consider when calculating the gradient.
- Exclusive Feature Bundling (EFB) is based on a 1958 paper (seriously) by Fisher. In essence, the optimal way of splitting a categorical variable is to build two groups (or partitions) and have each group go to its own branch. However, since (nominal) categorical variables lack order by definition, strategies like one-hot encoding tend to make this process exceedingly complex and confusing for a tree, usually requiring several consecutive splits to get right. In order to find the best partitions, LightGBM uses a combination of the gradient and hessian to calculate a numerical representation for the categorical feature, ordering them optimally. This strategy (which works both for categorical features and for highly sparse feature combinations) allows both an increase in model performance, as well as optimizing training time.
- Leaf-wise tree growth is a modification of the CART algorithm wherein all available leafs are tested before performing a split, and the split is only performed on the one with the greatest error delta. This tends to create very deep, asymmetric trees which can easily overfit the data.
- Intrinsic handling of missing values makes imputing (theoretically) unnecessary. The general idea here is that values are not randomly generated, so the absence of a value contains some intrinsic information, so imputing it smartly could actually lead to a performance penalty. The implementation learns, for each split, what direction the missing values should go (by default they go right for categorical ones). This is a particularly amazing feature, because it allows the model to extract every last unit of information from the dataset. For more details on the implementation take a look at this.

Like XGBoost and most other machine learning tools, LightGBM is primarily written in C++ with some C sprinkled here and there. Then there’s the wrapper part of the library, with one written in C, one in R and another in Python. There’s also the CUDA section for GPU training… but that’s a story for another time.

I’m a very python-oriented data scientist, so we’ll look at the three main Python APIs: the scikit-learn API, the data API and the training API.

As the name implies, the sklearn API follows that library’s fit/predict/predict_proba way of encapsulating functionality in a very convenient and object-oriented way. The data and training APIs on the other hand, work in concert with each other taking a very functional approach which I think is a bit cleaner, though it doesn’t play nicely with scikit-learn objects.

With this concise (I hope) and information-dense introduction, we are ready to move on to some code. In this article we’ll look at the main classes available through the scikit-learn API and how to use the most important features of lightgbm.

Just like all scikit-learn estimators, the LGBMClassifier and LGBMRegressor inherit from `sklearn.base.BaseEstimator`

and `sklearn.base.ClassifierMixin`

and `sklearn.base.RegressorMixin`

, respectively. This implies that they must implement the following methods:

- set_params, that takes a dictionary and sets the key parameters to their respective values.
- get_params, that returns a dictionary containing all the constructor parameters.
- score, that internally uses the predict method to calculate a gain/loss score for a trained model.

Internally, the scikit-learn API uses lightgbm’s native (functional) API, so these classes are basically wrappers that behave in a scikit-learn friendly way.

There’s a third sklearn-compliant class, the `LGBMRanker`

, which is designed to work for ranking problems (such as recommendations). As sklearn doesn’t have any ranking classes explicitly implemented, this one will also be left for another time.

## On numpy data

Now, let’s get our hands dirty. Like all sklearn objects, LGBMModels can be trained on numpy arrays, scipy sparse matrices and pandas dataframes/series. Let’s start by creating some numpy arrays:

`import numpy as np`

from sklearn.datasets import make_classificationcategorical_data = np.random.choice(

a=['a', 'b', 'c'],

size=(100_000, 1),

)

numerical_data, y = make_classification(

n_samples=100_000,

n_features=20,

n_informative=10,

n_classes=2,

random_state=42

)

So now we have some data stored in two numpy arrays. In this particular case, the categorical data has no information about the target, but in practice you’ll probably have some important categorical variable. Methodologically, however, it’s about the same.

About the modeling, lightgbm is capable of understanding categorical features only as positive integers, so we must first convert them, like so:

`from sklearn.preprocessing import OrdinalEncoder`encoder = OrdinalEncoder()

encoded_categories = encoder.fit_transform(categorical_data)

With that out of the way, we can now join the data and train the model:

`from lightgbm import LGBMClassifier`X = np.hstack((numerical_data, encoded_categories))

model = LGBMClassifier(random_state=42)

model.fit(X, y, categorical_feature=[20])

As of version 4, a pretty annoying warning will appear, saying something in the lines of “categorical_feature keyword has been found in `params`

and will be ignored”. This *does not* mean that lightgbm will take that feature as a normal numerical feature, just that the model is internally overriding a variable set to `None`

, so don’t panic like I did the first time I saw it. If you want to make sure about what’s going on, here is the code block that triggers the warning.

Getting back on track, the `categorical_feature`

parameter takes an iterable object (such as a list or a numpy array), with the indices of the categorical columns (column index 20 in this case); if we had several categorical columns, the list would be longer. In this way, regardless of the data actually being integer numbers, the algorithm is able to use EFB as explained above, ignoring the absolute value of the encoding (which is exactly what we need).

Obviously, we could have put everything inside a `Pipeline`

`ColumnTransformer`

combination:

`from sklearn.pipeline import Pipeline`

from sklearn.compose import ColumnTransformerpreprocessor = ColumnTransformer([

('encoder', OrdinalEncoder(), [20])

], remainder='passthrough')

model = Pipeline(

('encoder', preprocessor),

('model', LGBMClassifier(random_state=42))

)

model.fit(X, y, model__categorical_feature=[20])

It’s very important to remember that sklearn composite estimators use the double underscore notation to refer to its internal objects (`model__categorical_feature`

refers to the `categorical_feature`

argument of the `model`

object).

That was extremely comfortable, wasn’t it? Let me tell you, it’s about to get even better once we move on to pandas.

## On pandas data

The first step is to create a dataframe (this will be unnecessary if your data is already in pandas form):

`import pandas as pd`X_pd = pd.DataFrame(X, columns=[f'col_{n}' for n in range(X.shape[1])])

So far so good, nothing strange here, just your garden-variety pandas dataframe construction.

Now on to the first interesting part. First we identify the categorical and numerical columns leveraging pandas’ `select_dtypes`

method:

`numerical_columns = X_pd.select_dtypes(exclude=['object']).columns.to_list()`

categorical_columns = X_pd.select_dtypes(include=['object']).columns.to_list()

Second, we cast the categorical columns into a more efficient datatype:

`X_pd[categorical_columns] = X_pd[categorical_columns].astype('category')`

In case you didn’t know about it, pandas has an in-built categorical type that allows us to conveniently use strings with a very efficient representation in memory. Coincidentally, lightgbm perfectly understands pandas categories:

`model = LGBMClassifier(random_state=42)`model.fit(X_pd, y)

The `LGBMClassifier`

will identify the categorical data type and automatically configure everything to use EFB with it. That’s it, no encoder, no pipeline, no column transformer needed!

This is designed for lazy people like me:

- Being based on decision trees, the algorithm is robust to outliers, skewed distributions and just about anything.
- The in-built missing-value learning strategy makes imputing mostly unnecessary.
- The in-built categorical value handling makes encoding optional (though I still recommend trying two or three strategies).

All in all, building a composite estimator can not only be unnecessary but could even negatively impact performance.

We now have a trained model. What now? Just like all other sklearn classifiers, the `LGBMClassifier`

has both `predict`

and `predict_proba`

methods to be used both in testing and production on a different `X_ `

dataset:

`predictions = model.predict(X_)`

scores = model.predict_proba(X_)

The first method will return a numpy array of 0s and 1s (or in the multi-class case with k classes, integers from 0 to k-1). The second method will return a 2D numpy array with one column for each class (two for binary classification, etc.) and a row for every observation; these numbers correspond to the score (which is *very* loosely related to probabilities, but absolutely *isn’t* a probability) of each sample.

It is important to note that these methods don’t need the categorical features to be pointed out, as that information is already stored inside the trained model.

With that, you can make some plots, calculate some metrics, decide what customers to contact, or whatever your heart’s desire is. That’s it, right? Not quite.

Depending on the industry you work on, interpreting the predictions may be more or less important, and shap is an amazing library that does just that. It is, however, quite slow. To solve this issue, lightgbm developers have added an in-built Shapley values feature importance methods. Luckily for us, the usage is extremely simple:

`shap_values = model.predict_proba(X_, pred_contrib=True)`

Considering our original data had 21 columns, this array will have 22: one for every column in the dataset (in the same order), as well as an extra column with the expected value for the dataset (which is usually the proportion of positive to negative class); this last column has a single value and can usually be ignored.

You can now analyze the importances in whatever form you like, though I recommend storing them in a dataframe first:

`importances = pd.DataFrame(`

shap_values,

columns=X_pd.columns.to_list()+['expectation']

)

The other nice and rather unusual thing that you can do is build something called a tree embedding, which is taken from sklearn’s `RandomTreesEmbedding`

. This will generate a vector representation of the sample built of integers where each element of the vector represents a leaf in the entire model (which means the vector will have shape `n_samples*n_trees`

), and will be the index of the leaf for that particular tree. For example, the result `[3, 5, 18]`

means that the chosen element fell in leaf 3 of the first tree, leaf 5 of the second tree and leaf 18 of the third tree.

`embeddings = model.predict_proba(X_, pred_leaf=True)`

These embeddings can be used to study the dataset or generate an alternate representation of the data or just about anything that can be done with embeddings.

Like with any other regressor or classifier, lightgbm classes can be used to make cross-validated predictions and scoring:

`from sklearn.model_selection import cross_val_predict, cross_val_score`fit_params = {'categorical_feature': [20]}

cv_predictions = cross_val_predict(

model,

X_,

cv=5,

method='predict_proba',

fit_params=fit_params

)

cv_scores = cross_val_score(

model,

X_,

y_,

cv=5,

fit_params=fit_params

)

This should be mostly painless (plus or minus the annoying warnings for categorical variables).

Also, lightgbm has many, *many* hyperparameters to tune (don’t panic, we’ll look at lots of them in an upcoming article), so we can use scikit-learn’s tuning classes:

`from sklearn.model_selection import RandomizedSearchCV`

from scipy import statsdistributions = {

'n_estimators': stats.randint(low=50, high=1000),

'max_depth': stats.randint(low=1, high=20),

'random_state': [42]

}

search = RandomizedSearchCV(

model,

param_distributions=distributions,

n_iter=100,

cv=5,

random_state=42

)

LightGBM is a blazing fast implementation of gradient boosted decision trees, even faster than XGBoost, that can efficiently learn from both missing values and categorical features.

One very comfortable way of using it is by using its `LGBMClassifier`

and `LGBMRegressor`

classes, which follow scikit-learn’s `fit/predict/predict_proba`

interface, as well as `get_params`

and `set_params`

. This allows the lightgbm classes to be interchangeably used with any sklearn classifier or regressor, including cross validated prediction/scoring and hyperparameter tuning.