TransWikia.com

Merging sparse and dense data in machine learning to improve the performance

Data Science Asked on December 11, 2021

I have sparse features which are predictive, also I have some dense features which are also predictive. I need to combine these features together to improve the overall performance of the classifier.

Now, the thing is when I try to combine these together, the dense features tend to dominate more over sparse features, hence giving only 1% improvement in AUC compared to model with only dense features.

Has somebody come across similar problems? Really appreciate the inputs, kind of stuck. I have already tried lot of different classifiers, combination of classifiers, feature transformations and processing with different algorithms.

Thanks in advance for the help.

Edit:

I have already tried the suggestions which are given in the comments. What I have observed is, for almost 45% of the data, sparse features perform really well, I get the AUC of around 0.9 with only sparse features, but for the remaining ones dense features perform well with AUC of around 0.75. I kind of tried separating out these datasets, but I get the AUC of 0.6, so, I can’t simply train a model and decide which features to use.

Regarding the code snippet, I have tried out so many things, that I am not sure what exactly to share 🙁

5 Answers

Try PCA only on sparse features, and combine PCA output with dense features.

So you'll get dense set of (original) features + dense set of features (which were originally sparse).

+1 for the question. Please update us with the results.

Answered by Tagar on December 11, 2021

In addition to some of the suggestions above, I would recommend using a two-step modeling approach.

  1. Use the sparse features first and develop the best model.
  2. Calculate the predicted probability from that model.
  3. Feed that probability estimate into the second model (as an input feature), which would incorporate the dense features. In other words, use all dense features and the probability estimate for building the second model.
  4. The final classification will then be based on the second model.

Answered by Vishal on December 11, 2021

This seems like a job for Principal Component Analysis. In Scikit is PCA implemented well and it helped me many times.

PCA, in a certain way, combines your features. By limiting the number of components, you fetch your model with noise-less data (in the best case). Because your model is as good as your data are.

Consider below a simple example.

from sklearn.pipeline import Pipeline
pipe_rf = Pipeline([('pca', PCA(n_components=80)),
                    ('clf',RandomForestClassifier(n_estimators=100))])
pipe_rf.fit(X_train_s,y_train_s)

pred = pipe_rf.predict(X_test)

Why I picked 80? When I plot cumulative variance, I got this below, which tells me that with ~80 components, I reach almost all the variance. cumulative variance

So I would say give it a try, use it in your models. It should help.

Answered by HonzaB on December 11, 2021

The best way to combine features is through ensemble methods. Basically there are three different methods: bagging, boosting and stacking. You can either use Adabbost augmented with feature selection (in this consider both sparse and dense features) or stacking based (random feature - random subspace) I prefer the second option you can train a set of base learners ( decisions. Trees) by using random subsets and random feature ( keep training base learners until you cover the whole set of features) The next step is to test the Training set to generate the meta data. Use this meta data to train a meta classifier. The meta classifier will figure out which feature is more important and what kind of relationship should be utilized

Answered by Bashar Haddad on December 11, 2021

The variable groups may be multicollinear or the conversion between sparse and dense might go wrong. Have you thought about using a voting classifier/ ensemble classification? http://scikit-learn.org/stable/modules/ensemble.html That way you could deal with both above problems.

Answered by Diego on December 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP