TransWikia.com

Determining which categorical data is beneficial in predictive modelling

Data Science Asked on November 10, 2020

I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like:

JobID   Manager     City        Design          ClientType      TaskDuration
a1      George      Brisbane    BigKahuna       Personal        10
a2      George      Brisbane    SmallKahuna     Business        15
a3      George      Perth       BigKahuna       Investor        7

Thus far, my model has been relatively basic, following these basic steps:

  1. Aggregate the historical data based on each category, calculating the mean, and counting how many times it occurs. From the previous example, the result would be:
Category        Value           Mean    Count
Manager         George          10.66   3
City            Brisbane        12.5    2
City            Perth           7       1
Design          BigKahuna       8.5     2
Design          SmallKahuna     15      1
ClientType      Personal        10      1
ClientType      Business        15      1
ClientType      Investor        7       1
  1. For each job in the system, calculate the job duration based on the above. For example:
JobID   Manager     City        Design          ClientType
b5      George      Brisbane    SmallKahuna     Investor

Category        Value           CalculatedMean      CalculatedCount     Factor (Mean * Count)
Manager         George          10.66               3                   31.98
City            Brisbane        12.5                2                   25
Design          SmallKahuna     15                  1                   15
ClientType      Investor        7                   1                   7       

TaskDuration    = SUM(Factor) / SUM(CalculatedCount)
                = 78.98 / 7
                = 11.283
                ~= 11 days

After testing my model on a few hundred finished jobs from the last four months, I calculated average discrepancies ranging from -15% to +25%.

I think the one of my issues is that I may be taking into account categories that actually have no effect on the build time, and are skewing my results. In reality, I’m taking 15 categories into account from ~400 completed jobs, and some of these categories might have results that only appear once or twice (for example, we might only have a single job in Perth).

How can I determine which categories are actually beneficial to the model, and which should be ignored?

Related question here.

One Answer

You can try two things -

  1. Try finding the correlation between the Categories and the Target.
    Since, It's between Categorial features and a Continuous Feature, you should -
    Get the r-square Or Adjusted R-square score of Regression, see which one is best and drop the lowest few and try.
    Read more - Kaggle

  2. Calculate Feature Importance using random Forest.
    Read here - MachineLearningMastery

Answered by 10xAI on November 10, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP