TransWikia.com

Consequences of using XGBoost regressor for small dataset(< 500 rows)

Data Science Asked by Anubhav Nehru on August 5, 2021

I am using XGBoost regressor to train my model for 322 rows of data and the train and test split is as follows: ((257, 9), (257,), (65, 9), (65,))

I am using the following parameters for hyper-parameter tuning:

{'max_depth': 3,
 'min_child_weight': 6,
 'eta': 0.3,
 'subsample': 0.9,
 'colsample_bytree': 0.7,
 'objective': 'reg:linear',
 'eval_metric': 'rmse',
 'reg_lambda': 0,
 'reg_alpha': 0.5,
 'gamma': 0}

I am getting the following results:

Train results:

MAE =  43.95317769328908
RMSE =  69.32233101307436
R2 score =  0.7500463354991436

--------------------------------------------    

Test results : 

MAE =  51.21307032658503
RMSE =  79.65759750390318
R2 score =  0.6569142423871053

What are the drawbacks of training XGBoost model on such a small dataset? I know about overfitting, but I can control it to some extend with regularization.

One Answer

The one other problem you have because of this is you are stuck with just this solution and this may not be the best one. Yes, you can use regularization but still, what if some other better solution exists than this. Since, in boosted trees, you aim is to reduce your errors, each iteration of boosting tries to overfit on the examples which the last iteration predicted wrong.

But there is a way to confirm if this is the best solution. Do cross-validation. Or try LightGBM and CatBoost also, since, they are also good libraries for the same purpose. And finally, you combine all three of them to come with an ensemble that will you the best of the three worlds.

Answered by Abhishek Verma on August 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP