Algorithm Description
Extreme Gradient Boosting (XGBoost) is a decision tree based Machine Learning algorithm, used for classification and regression problems. The Gradient Boosting algorithm builds decision trees sequentially (instead of in parallel and independently, like Random Forest) such that each subsequent tree aims to reduce the errors of the previous tree. Each tree learns from its predecessor and updates the residual errors. Hence, the tree that grows next in the sequence is learning from the mistakes of the previous tree.
The base learners in gradient boosting are ‘weak learners’, where bias is high, and the predictive power is slightly better than random guessing. Each of the weak learners (trees) contributes vital information for prediction, enabling the boosting technique to produce a strong learner by combining the weak learners. The final strong learner brings down both bias and variance. Additionally, in contrast to Random Forest, in which trees are grown to their maximum extent, boosting makes use of trees with fewer splits.
The intuition behind gradient boosting is to repetitively leverage the patterns in the residuals and strengthen a model with weak predictions and make it better. Once a stage is reached that residuals do not have any pattern that can be modeled, modeling can be stopped. Algorithmically, we are minimizing our loss function such that test loss reaches its minima.
Looking at the algorithm mathematically, here are the steps that are followed:
-
A simple decision tree is fit on the data
-
Error residuals are calculated on the original tree
-
A new tree is built with the error residuals as the target variable, and the same input predictors
-
Add the predicted residuals to the previous predictions (y_predicted2 = y_predicted1 + e1_predicted)
-
Fit another tree on the new residuals (e2 = y-y_predicted2). Repeat steps 2-5 until it starts to overfit, or the sum of residuals becomes constant. Overfitting can be controlled by checking accuracy on validation data.
Below are some features of the XGBoost implementation that make it a powerful algorithm.
- Parallelization: XGBoost approaches the process of sequential tree building using parallelized implementation methods.
-
Tree Pruning: The stopping criterion for tree splitting within GBM framework is greedy in nature and depends on the negative loss criterion at the point of split. XGBoost uses ‘max_depth’ parameter as specified instead of criterion first, and starts pruning trees backward. This ‘depth-first’ approach improves computational performance significantly.
-
Hardware Optimization: This algorithm has been designed to make efficient use of hardware resources. This is accomplished by cache awareness by allocating internal buffers in each thread to store gradient statistics.
-
Regularization: XGBoost penalizes more complex models through both LASSO and Ridge regularization to prevent overfitting
-
Sparsity Awareness: Naturally admits sparse features for inputs by automatically ‘learning’ best missing value depending on training loss and handles different types of sparsity patterns in the data more efficiently.
-
Weighted Quantile Sketch: Employs the ‘weighted quantile sketch’ algorithm to effectively find the optimal split points among weighted datasets
-
Cross Validation: Comes with built in cross-validation method at each iteration, taking away the need to explicitly program this search, and to specify the exact number of boosting iterations required in a single run.
Lityx IQ Parameters
Booster Type [default = GB Tree] - This parameter relate to which booster we are using to do boosting, commonly a tree or linear model. Research has shown that ‘tree’ and ‘linear base’ learners yields comparable results for classification problems, while tree learners are superior for regression problems. With that, it was also found that tree based XGBoost models suffer from higher estimation variance compared to their linear counterparts.
-
Options include ‘gbtree’ or ‘gblinear’. gbtree uses tree-based models, while gblinear uses linear functions.
Number of Rounds - The number of rounds for boosting.
-
Generally recommended to do at least 250. This parameter should be trained with ‘learning rate’ in mind, as a balance between these two parameters prevents overfitting while maximizing performance.
-
The more rounds, the longer the training time.
Learning Rate [default =.05, alias: ‘eta or ‘shrinkage’]
-
range [0,1]
-
Step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features, and the ‘eta’ shrinks the feature weights to make the boosting process more conservative.
-
Ideally, ‘eta’ should be set as low as possible. However, as the ‘learning rate’ gets lower, training rounds needs to increase.
-
Increasing ‘learning rate’ makes computation faster but decreases model performance. Decreasing ‘eta’ makes computation slower, but model performance is often better.
-
There should often be a tradeoff between number of trees and learning rate.
Maximum Tree Depth [default=6]
-
Range [0, ∞]
-
This parameter controls the maximum depth of a tree. Increasing the value makes the model more complex, and more likely to overfit. We would generally suggest going no higher than 10, although using 6 or lower is probably best.
Minimum Node Sample [default=1]
-
Range [0, ∞]
-
Minimum sum of instance weights needed in a node. If the tree partition step results in a leaf node with the sum of instance weight less than minimum node sample, then the building process will give up further partitioning.
-
The larger the value, the more conservative the algorithm will be.
Ratio of Training Dataset [default=1]
-
Range (0,1]
-
% of rows used to build each tree. Subsampling happens each time a tree is build, and generally helps prevent overfitting.
Column Subsampling Percent [default=1]
-
Range (0,1]
-
% of features (randomly selected) that will be used to train each tree
Minimum Loss Reduction [default=0, alias: ‘gamma’]
-
Range [0, ∞]
-
Gamma is a regularization parameter that controls the complexity of a given tree.
-
Minimum loss reduction required to make a further partition on a leaf node of the tree.
-
When gamma is specified to be greater than 0, xgboost will grow the tree to the max depth specified, but then prune the tree to find and remove splits that do not meet the specified gamma.
-
The larger gamma is, the more conservative the algorithm will be.
L2 Regularization Term on Weights [default=1, alias: ‘lambda’]
-
Range [0, ∞]
-
L2 regularization term on weights. Increasing the value will make the model more conservative.
L1 Regularization Term on Weights [default=0, alias: ‘alpha’]
- Range [0, ∞]
- L1 regularization term on weights. Increasing the value will make the model more conservative.