XGBoost Overfitting?

Apr 19, 2022

Gary Robinson agent wrote

XGBoost is great, but there may be concern that it results in models that are too good and overfitted. Are there ways to reduce the "tightness" of the model? In the estimation of the predictor coefficients? Related, is there any generally accepted measure of overfitting?

1 Answer

Apr 19, 2022

Gary Robinson agent wrote

From: Lance Culnane

Thanks Christina- I was just about to say similar things! I agree that it is a good idea to try different xg (or any ML) parameters in a step-wise fashion, while keeping an eye on overfitting vs underfitting.

In addition, I sometimes alter these parameters:

• Decrease # of rounds (the default 2000 may be too high for some applications. I sometimes use 100)
• Increase L1 and L2 regularization- these are designed to help prevent overfitting.

There is no one method which always overfits or underfits. It is up to the user to keep track of whether it is overfitting or underfitting, and then adjust the incoming data and ML parameters appropriately. The default parameters probably won’t overfit too bad on datasets >100k

From: Christina Back

This can be a difficult question to answer, as there are things you can do to limit overfitting of the model, but there isn’t generally just one parameter that will do it, You also won’t really be able to tell if the parameter change has worked unless you score up the out of sample dataset and calculate performance metrics….which can sometimes be a laborious task in Lityx IQ. In short, an XGboost model is prone to overfitting when it is ‘too complex’. There are a few different ways to control overfitting within xgboost.

One way is to add randomness to make training more robust to noise, which can be done by tuning Ratio of training dataset and Column Subsampling Percent. Overall, these parameters are probably the easiest to understand.

1. Column Subsampling Percent is the ratio of columns used for each tree. Lower ratios help avoid overfitting. When I run models, I generally lower this if there are ‘a lot ‘ of predictors. If there are a lot, you can try using values from .3 to .8, and if there aren’t too many, I would use .8 to 1 (default)
2. Ratio of Training dataset is the rows used for each tree. Again, lower ratios help avoid overfitting. I wouldn’t recommend taking out too many rows, as performance would drop a lot. Take values between .8 and 1 (default)

Another way is by controlling model complexity, which can be done by tuning Maximum Tree Depth, Minimum Node Sample, and/or Minimum Loss Reduction (aka Gamma).

1. Maximum Tree Depth –Lower values help avoid overfitting. Right now, the default is 6 but I might try decreasing this to as low as 3.
2. Minimum Node Sample – Larger values help avoid overfitting. Default is 2, but you could make this as large as 5.
3. Minimum Loss Reduction – This is like a regularization parameter. Larger values help avoid overfitting. I would adjust this to some value between 1 and 5 if you are seeing overfitting.

The other parameters can also be tuned to help with overfitting, but these might be more user friendly. I will generally try to tune one or two parameters at a time, although this can be somewhat time consuming. I would not try to adjust all these values at once. You can start by adjusting the column subsampling or ratio of training dataset down a little bit (.8 maybe?) then start playing with the max depth. You can also try to set the Minimum Loss reduction to 1. How the parameters need to be tuned is very data dependent, so there really is no exact strategy. These parameters are also very technical, so they can be hard to explain/understand if you’re not too familiar with xgboost, or at least decision trees in general.