Road to ML Engineer #11

XGBoost, which stands for EXtreme Gradient Boosting, is arguably the most popular and powerful machine learning algorithm as of now. It is fast, versatile, and easy to implement, which are all desirable qualities you want from a machine learning algorithm. In this article, I would like to describe the mathematics behind XGBoost so that you can have a clear understanding of how it works and why it works so well.

XGBoost

The algorithm for XGBoost can be defined as follows:

Input: Data ${(x_i, y_i)}_{I=1}^{n}$ and a differentiable loss function $L(y, F(x))$ .

Step 1: Initialize model with a constant value $F_0(x)$ .

Step 2: for $m=1$ to $M$ :

Compute $r_{im} = -[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}]_{F(x) = F_{m-1}(x)}$ for $i = 1,…,n$ .
Fit a tree to the $r_{im}$ values and create terminal regions $R_{jm}$ for $j = 1, … , J_m$ by maximizing the gain.
For $j = 1, … , J_m$ compute the output value $O_{jm} = \argmin_{O} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + O) + \frac{1}{2}\lambda O^2$
Update $F_m (x) = F_{m-1} (x) + \nu \sum_{j=1}^{J_m} O_m I(x \in R_{jm})$

These steps are quite similar to the definition of normal gradient boosting, but there are a few differences in the definition and implementation that are not explicitly shown here.

Regularization

The most notable difference between normal gradient boosting and XGBoost is the regularization. In addition to the pruning that happens after building the tree in both boosting methods, XGBoost applies L2 regularization by adding a regularization term to its loss function.

L(y, F(x) + O_{value})_{reg} = [\frac{1}{2} (y-(F(x)+O_{value}))^2] + \frac{1}{2} \lambda O_{value}^2 \\ L(y, p_{F(x)} + O_{value})_{clf} = -[y log(p_{F(x)}+O_{value}) + (1-y) log(1-(p_{F(x)}+O_{value}))] + \frac{1}{2} \lambda O_{value}^2

(In XGBoost, the output value is denoted as $O_{value}$ instead of $\gamma$ , and $\gamma$ is used for the penalty rate for pruning.) The regularization term is divided by two to simplify the calculation when taking the derivatives of the loss function. This regularization term impacts how the tree is built and how the output value is calculated. Let's try to compute output values for XGBoost for regression to observe the difference.

\frac{\partial}{\partial O_{value} } \sum_{x_i \in R_{ij}} L(y_i, F_{m-1}(x_i) + O_{value}) = \sum_{x_i \in R_{ij}} (y_i-(F_{m-1}(x_i)+O_{value})) + \lambda O_{value} \\ \sum_{x_i \in R_{ij}} (y_i- F_{m-1}(x_i) + O_{value}) + \lambda O_{value} = 0 \\ (R+\lambda) O_{value} = \sum_{x_i \in R_{ij}} (y_i- F_{m-1}(x_i)) \\ O_{value} = \frac{\sum_{x_i \in R_{ij}} (y_i- F_{m-1}(x_i))}{(R+\lambda)}

, where $R$ is the number of residuals in the leaf. Instead of just taking the average of the residuals for computing each output value like in gradient boosting, XGBoost adds the regularization rate $\lambda$ in the denominator so that the output value can be adjusted to be smaller as the regularization rate increases. We can tune the regularization rate to achieve the best results using cross-validation.

You can see that the effect is similar for classification as well, but the mathematics is a bit more complex due to the Taylor approximation performed on the loss function.

\sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + O_{value}) \approx \sum_{x_i \in R_{ij}} [L(y_i, F_{m-1} (x_i)) + \frac{d}{dF} L(y_i, F_{m-1} (x_i))O_{value} + \frac{1}{2} \frac{d}{dF^2} L(y_i, F_{m-1} (x_i))O_{value}^2] + \frac{1}{2} \lambda O_{value}^2 \\ \frac{\partial}{\partial O_{value}} \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + O_{value}) \approx \sum_{x_i \in R_{ij}} \frac{d}{dF} L(y_i, F_{m-1} (x_i)) + \frac{d}{dF^2} L(y_i, F_{m-1} (x_i)) O_{value} + \lambda O_{value} \\ \sum_{x_i \in R_{ij}} \frac{d}{dF} L(y_i, F_{m-1} (x_i)) + \frac{d}{dF^2} L(y_i, F_{m-1} (x_i)) O_{value} + \lambda O_{value} = 0 \\ O_{value} = - \frac{\sum_{x_i \in R_{ij}} \frac{d}{dF} L(y_i, F_{m-1} (x_i))}{\sum_{x_i \in R_{ij}} \frac{d}{dF^2} L(y_i, F_{m-1} (x_i))+\lambda}

From the previous article, we know that the first-order and second-order derivatives of the loss function are $y-p$ and $p(1-p)$ respectively. Hence, we can rewrite the above as follows:

O_{value} = - \frac{\sum_{x_i \in R_{ij}} y_i - p_{F_{m-1}(x_i)}}{\sum_{x_i \in R_{ij}} p_{F_{m-1}(x_i)}(1-p_{F_{m-1}(x_i)})+\lambda}

We can confirm that XGBoost computes the output values of the terminal regions by adding the regularization rate $\lambda$ to the denominator in both regression and classification.

Tree Construction

The addition of the regularization term in the loss function impacts how trees are built as well. When evaluating the quality of the split, XGBoost uses a measure called the similarity score, which corresponds to the loss of each leaf when the optimal output value is chosen for that leaf. For regression, the loss function and output value are as follows:

L(y, F(x) + O_{value})_{reg} = [\frac{1}{2} (y-(F(x)+O_{value}))^2] + \frac{1}{2} \lambda O_{value}^2 \\ O_{value} = \frac{\sum_{x_i \in R_{ij}} (y_i- F_{m-1}(x_i))}{(R+\lambda)}

Substituting $O_{value}$ into the loss function and rearranging the equation (the specifics are not shown here, but I recommend trying it out yourself), we get:

L(y, F(x) + O_{value})_{reg} = \frac{1}{2}\sum_{x_i \in R_{ij}}(y_i - F_{m-1}(x_i))^2 - \frac{(\sum_{x_i \in R_{ij}} (y_i- F_{m-1}(x_i)))^2}{R+\lambda}

Since minimizing the second term will automatically minimize the first term (computing the first term is expensive and redundant when the objective is to minimize the loss), we use the second term for the similarity score.

Similarity_{score} = \frac{(\sum_{x_i \in R_{ij}})^2}{R+\lambda}

The similarity score for regression turns out to be just the sum of the residuals squared divided by the number of residuals plus the regularization rate. As we negated the loss function, we choose the split that leads to the maximum similarity score instead of the minimum. We stop building the tree when we reach the maximum number of allowed leaves or when we do not observe improvements in the similarity score.

We can use a similar method for computing the similarity score for classification by substituting the corresponding $O_{value}$ into the approximated loss function.

\sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i) + O_{value}) \approx \sum_{x_i \in R_{ij}} [L(y_i, F_{m-1} (x_i)) + \frac{d}{dF} L(y_i, F_{m-1} (x_i))O_{value} + \frac{1}{2} \frac{d}{dF^2} L(y_i, F_{m-1} (x_i))O_{value}^2] + \frac{1}{2} \lambda O_{value}^2 \\ O_{value} = - \frac{\sum_{x_i \in R_{ij}} y_i - p_{F_{m-1}(x_i)}}{\sum_{x_i \in R_{ij}} p_{F_{m-1}(x_i)}(1-p_{F_{m-1}(x_i)})+\lambda} \\ L(y, F(x) + O_{value})_{clf} \approx \sum_{x_i \in R_{ij}} L(y_i, F_{m-1} (x_i)) - \frac{1}{2} \frac{(\sum_{x_i \in R_{ij}} (y_i- F_{m-1}(x_i)))^2}{\sum_{x_i \in R_{ij}} p_{F_{m-1}(x_i)}(1-p_{F_{m-1}(x_i)})+\lambda}

Similar to regression, minimizing loss does not involve minimizing the first term, so we use the second term and negate it to get the similarity score for classification. (We also multiply the second term by 2 because we only care about relative losses.)

Similarity_{score} = \frac{(\sum_{x_i \in R_{ij}} (y_i- F_{m-1}(x_i)))^2}{\sum_{x_i \in R_{ij}} p_{F_{m-1}(x_i)}(1-p_{F_{m-1}(x_i)})+\lambda} \\

The similarity score for classification turns out to be just the sum of the "residuals" squared divided by the sum of the previously predicted probability times 1 minus the previously predicted probability plus the regularization rate.

When we prune the tree, we compare the difference between the similarity scores of all the child nodes and the similarity score of the parent node, which we call gain, and the penalty $\gamma$ , and prune the tree if the gain is smaller than $\gamma$ . As higher $\lambda$ can lower the similarity score and gain, $\lambda$ not only regularizes by making output value smaller but also by pruning with $\gamma$ .

Compatibility with Large Datasets

Although regularization and unique tree construction contribute to the performance of the model, they do not fully explain the popularity of XGBoost. XGBoost is popular largely due to a number of optimizations that make it compatible with large and sparse training datasets. When the dataset is large, XGBoost divides it into blocks containing some features and some rows, and finds the best tree splits in parallel. (Parallel Learning)

Instead of testing all possible splits, it only tests weighted quantiles. (Approximate Greedy Algorithm, Weighted Quantile Sketch) The weights for classification are computed by $p_{F_{m-1}(x)}(1-p_{F_{m-1}(x)})$ , which allows smaller quantiles to be drawn around the data with previously predicted probability is close to 0.5. (The weights for regression is uniform.)

Sparse datasets often contain many missing values, and other machine learning algorithms are not compatible with such data, requiring preprocessing, which is costly when the dataset is large. XGBoost handles missing values without preprocessing by calculating the gains when missing values are assigned to the left and right nodes and deciding which branch to assign those missing values to. (Sparsity-aware Split Finding)

It also optimizes memory usage to make it usable on large datasets. For example, it stores important data that frequently gets used, such as output values and gains, in CPU cache, where memory can be accessed the fastest. (Cache-Aware Access) The moderately used data are stored in RAM, the second fastest to access memory, and the least used data are stored on a disk, the slowest to access memory. As accessing memory inside the disk storage is quite slow, it makes use of compression and sharding (assigning compressed blocks to different disks) when multiple hard drives are available to make it as fast as possible. (Out-of-Core Computation)

Conclusion

XGBoost extremely optimizes various aspects of gradient boosting to be as fast and performant as possible by making use of regularization, unique tree construction, parallel learning, weighted quantile sketch, sparsity-aware split finding, cache-aware access, and out-of-core computation. It requires minimal data preprocessing as it can handle large, complex, and sparse data, which are often encountered in the real world.

It also requires less background knowledge about the data, since you do not have to choose the basis functions or kernels based on knowledge as you do with linear regression and SVM. This reduces the effort of communicating with domain experts. This is why XGBoost is not only performant but extremely popular among data scientists across industries. Now that we understand XGBoost mathematically and conceptually, I would like to introduce how XGBoost can be implemented in Python in the next article.

Resources

Chen, T. & Guestrin, C. 2016. XGBoost: A Scalable Tree Boosting System. arXiv.
StatQuest with Josh Starmer. 2020. XGBoost Part 3 (of 4): Mathematical Details. YouTube.
StatQuest with Josh Starmer. 2020. XGBoost Part 4 (of 4): Crazy Cool Optimizations. YouTube.