How to leverage description data in multi-class classification (dimensionality reduction)

Question

I'm currently working with a dataset of 55k records and seven columns (one target variable), three of which are nominal categorical. The other three are 'description' fields with high cardinality, as would be expected in many cases of description data:

in>>
df[['size description', 'weight Description', 'height Description']].nunique()

out>>
 size Description       4066
 weight Description      736
 height Description     3173
 dtype: int64

some examples of these values could be:

Product                  Product Description
  ---------               ------------------------
   Ball                 Round bouncy toy for kids
   Bat                  Stick that kids use to hit a ball
   Go-Kart red/black    Small motorized vehicle for kids
   Go-Kart blue/green   Small motorized vehicle for kids
   Wrench               Tool for tightening or loosening bolts
   Ratchet              Tool for tightening or loosening bolts
   Reclining arm-chair  Cushioned seat for lounging

I think that the descriptions are standardized if they fall within a particular category but at this time I cannot confirm if the number of unique descriptions are finite. At this time, my assumption would be to treat these as nominal-categorical, as these are literally descriptive and not qualitative.

To that end, my question is what are some best practices for handling categorical features such as these?

Things I have considered:

Label encoding is obviously not viable in this situation, as the descriptions have no hierarchy. 
One-hot encoding seems an unlikely solution as it balloons the shape of the dataset from (55300 , 6) to (55300 , 65223) due to the high cardinality of the other variables. However, I tried it anyway and generated 98% accuracy on my test set but very poor results on an out-of-sample validation set (5k records, 0-5% accuracy). Seems pretty clear it's over-fitting and thus, not viable.
Hashing, for whatever reason, will not apply to one of the columns, but I suppose it could still be viable. I just need to figure out why it's not hashing all of my features (I suppose this would be suited best for a separate question?)
PCA - could be viable, but if I'm understanding correctly the cardinality after one-hot encoding is too great and PCA will throw an error. In fairness, I have not tried this yet.
Binning doesn't seem feasible since I could have a value of 3.5 or three and 1/2. Each one would be considered a separate bin and thus not a solution to my problem.

Thanks to all that can share their insight/opinion.

Erwan · Answer

You have textual descriptions, i.e. unstructured data. So you should probably use one of the standard representation methods for text. There are many options including sentence embeddings and this kind of advanced methods, but I'm going to describe the simple traditional option:

Each description value can be represented as a vector of feature, one for each word in the full vocabulary (i.e. over all the values for this field).
The value of each feature can be boolean (i.e. does this word appear in this description) or better a TF-IDF weight for the word.
Obviously this would lead to too many features, so one needs to apply select the most relevant ones. This part is very experimental, you might have to try various options to find the right one:

Get rid of the stop words since they provide no semantic information.
Discard all the words which appear only once, and probably also all the words which appear less than some minimum frequency $N$ (try with $N=2,3,4...). The rationale is that rare words are more likely to cause overfitting than to really help any kind of classification.
Beyond that, you could use general feature selection (e.g. information gain) or feature clustering.

[ obsolete answer to the first version of the question ]

I would definitely try to normalize these values, because semantically they are numerical and in their original form they are almost useless, no matter the method to categorize them. Making them categorical loses a lot of information, especially the ones which actually provide a number. Since some of the strings used are very vague I would probably try using intervals, i.e. two numeric values for every original input value:

three and a half inches, three and 1/2 inches -> min 3.4 - max 3.6
27.6234 inches -> min 27.6234 - max 27.6234
tall -> large range high values 
short, kinda short -> large range low values

Normally there are not that many ways to give numbers as words, only a few patterns should be enough to capture all the variants. For the other non-standard cases such as "kinda short" I would start by looking in the data their distribution: if a value is frequent enough (e.g. probably "short", "tall") then manually predefined range. Values which are not frequent can be ignored, e.g. replaced with NA (since they are not frequent that shouldn't affect the data too much).

Yash Jakhotiya · Answer

Your considerations regarding label encoding, one-hot encoding, and the likes are fairly accurate. For dealing with descriptive data, converting each description into a meaningful vector surely helps.

Meaningful vectors are vectors that catch the essence of what is being said in the descriptions.

Let,

football = "round bouncy toy for kids to kick"
basketball = "round bouncy toy for kids to dribble"
spark_plug = "device for delivering electric current from an ignition system to the combustion chamber of a spark-ignition engine"

Meaningful vectors will have properties such that,

dist(vector(football), vector(basketball)) < dist(vector(football), vector(spark_plug))

For vector representation learned on common English corpora, properties like the following emerge -

vector('king') - vector('male') + vector('female') = vector('queen')

Word2Vec is an efficient way to calculate word vectors - vectors that capture the meaning of individual words. To use Word2Vec you can either -

Use vectors obtained by pre-training Word2Vec on a large public corpus (like Wikipedia). As your descriptions contain words that do not have domain-specific meanings, this might seem to be preferable. However, this method has a disadvantage. In the spirit of obtaining accurate representations of the entire English vocabulary, these learned vectors tend to have high dimensions (300 and 1000 are more common), which might make them unusable for your task.
Train Word2Vec on your own vocabulary. With this method, you can set the learned-vector dimensions enough to capture word meanings but to not blow-up your feature space.

For your descriptions, you can average out the vectors of its constituent words, thereby treating each description as a bag of words or you can use sophisticated techniques like Doc2Vec, which is based on Word2Vec but tries to capture the relative order of words.

Gensim's implementation of Word2Vec and Doc2Vec has a fairly simple API exposed which you can quickly learn to use for your particular task.

How to leverage description data in multi-class classification (dimensionality reduction)

2 Answers

Add your own answers!

Ask a Question