TransWikia.com

Dealing with unseen data/categories in machine learning models for stream data

Data Science Asked by Maeaex1 on April 28, 2021

I want to build a machine learning model (xgb and lgbm) that has to handle streaming data on a weekly basis. The models are trained on a bi-weekly basis. The data includes order information and I want to predict the likelyhood that the order will be delivered. The orders are entered in the system and a week after one can say if the orders were indeed delivered or not.

For nominal data like supplier I use pd.get_dummies() for transformation. However, lets say I receive my order data for the orders that arrive next week. There is a new supplier that the trained model doesn’t know yet as the column supplier_new_unkown_supplier does not exist in the saved model parameters.

Does anyone know how to deal with such cases?

One Answer

If you want to make an inference on an order where a categorical variable was not seen in the training data, you could train the model on a hash bucket representation of that variable.

If using tensorflow, you can leverage: https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_hash_bucket

Or implement yourself.

Answered by zfact0rial on April 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP