TransWikia.com

How to perform data scaling/standardization on dataset containing grouped values?

Data Science Asked by Pieterism on September 2, 2021

So I have a dataset containing the results of executing problem instances with different given solver strategies. Simplified example:

| Problem_instance | Problem_Size | Used_Solver | Cost |
| P1               |           50 | A           |   75 |
| P1               |           50 | B           |  125 |
| P1               |           50 | C           |  225 |
| P1               |           50 | D           |  100 |
| P2               |          150 | A           |  165 |
| P2               |          150 | B           |  360 |
| P2               |          150 | C           |  275 |
| P2               |          150 | D           |   45 |
| P3               |           25 | A           |   35 |
| P3               |           25 | B           |   65 |
| ...              |          ... | ...         |  ... |

I’m trying to use machine learning to predict the best performing Solver for a given problem instance. In data processing stage, I need to standardize or scale my data, but I’m not sure how to this best.

Firstly, I’m not sure which sklearn’s Scaler to use (StandardScalar/ MinMaxScaler/..).

Secondly, I’m confused how to handle the different records for each instance. When I group the data first based on problem_instance and then use a MinMaxScaler, the record with Cost = 0 would be the best solution for this problem and Cost=1 the worst. But if I use the same strategy to scale the Problem_Size this would be equal to 0 everywhere. On the other hand if I use a global scaling, the information about which Solver is the best for each instance is lost.

Can someone help me how to handle the data preprocessing for this problem?

One Answer

There's not one right answer to this question, because what scaler works best really depends on the data and the algorithm you use to make the prediction. You should try different scalers combined with different algorithms to decide which preprocessing is best by comparing the cross validation results of each pipeline.

Of course, you don't have unlimited time. I would:

  1. First look at the distribution of cost globally and see if you think you need to transform the target to make it more normal
  2. Consider a tree-based algorithm for a first-pass which won't require you to scale your Problem_Size
  3. Scale globally first, and then if you have time experiment with scaling based on problem type

You'll eventually run out of time or patience, but I think going in this order will make the most efficient use of your time.

Remember, there's not 1 best way to do it for every problem.

Answered by Josh on September 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP