TransWikia.com

Imputing features with NA values in classification task

Data Science Asked on December 15, 2020

I currently have a dataset where each observation is a person’s traffic ticket history over districts.

For each column, which represents a district:

  • 1 represents that a person has received 1+ traffic violations in a district in 2018
  • 0 if they have been to that district but have no traffic violations.
  • NA otherwise

GOAL: to (rank districts to) see which district should have more police presence due to increased traffic violation and also use the features to predict whether or not that person has 1+ traffic accidents in 2019.

PROBLEM: not all people have been to every district. I currently just encode the value to 0 if the person has never been to that district. But this should be a valid NA value.
For example, it seems illogical to rank a district if only one person (in the dataset) has been to that district.

QUESTION(S): How exactly should I handle this? I don’t think imputing as 0 is the right call here.

Original Data:

PersonId DistA DistB DistC DistD DistE Accident19
1        0     1     1     0      NA     1
2        NA    0     0     0      1      0
3        0     1     1     0      NA     1
4        1     0     0     0      NA     0

Imputed Data:

PersonId DistA DistB DistC DistD DistE Accident19
1        0     1     1     0      0     1
2        0     0     0     0      1     0
3        0     1     1     0      0     1
4        1     0     0     0      0     0

Many thanks in advance!

One Answer

Thank you for clarifying the question @Eisen. So the question looks at two main things:

  1. To show which districts need more police presence.
  2. To classify people as to whether they will commit 1+ traffic violations, regardless of the district, given their previous number of violations, by district.

For the first point, I think what would be a good idea is to yes show the break down of people visiting each of the districts and committing 1+ traffic violations. But, I think adding (95%) confidence intervals would be particularly helpful to see what is a reliable estimate of people who commit 1+ traffic violations in a given district.

In terms of the second point, maybe you can use a feedforward neural network which will take as input the traffic violation categories for each district and output whether the person has committed an accident in 2019 (Accident19). The architecture is pretty much up to you, but the fail layer would need to be a 2-node softmax layer, which will create the probability distribution over the two classes (has had an accident in 2019 or not).

To represent this categorical data, I would suggest making the NA it's own category. Furthermore, best to represent the district traffic violation categories as one-hot encoded vectors. Then for a particular person, you concatenate these one-hot encoded vectors from all districts in your dataset.

Answered by shepan6 on December 15, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP