TransWikia.com

Knowing Feature Importance from Sparse Matrix

Data Science Asked on December 7, 2021

I was working with a dataset that had a textual column as well as numerical columns, so I used TFIDF for the textual column and created a sparse matrix, similarly for the numerical features I created a sparse matrix using scipy.sparse.csr_matrix and combined them with the text sparse features.

Then I’m feeding the algorithm to a gradient boosting model and doing the rest of the training and prediction.
However I want to know, is there any way I can plot the feature importance, of this sparse matrix and will be able to know the important feature column names?

One Answer

You would have a map of your features from the TFIDF map.

column_names_from_text_features = vectorizer.vocabulary_
rev_dictionary = {v:k for k,v in vectorizer.vocabulary_.items()}
column_names_from_text_features = [v for k,v in rev_dictionary.items()]

Since you know the column names of your other features, the entire list of features you pass to XGBoost (after the scipy.hstack) could be

all_columns = column_names_from_text_features + other columns

(or depending on the order in which you horizontally stacked)

Now, once you run the XGBoost Model, you can use the plot_importance function for feature importance. Your code would look something like this:

from xgboost import XGBClassifier, plot_importance
fig, ax = plt.subplots(figsize=(15, 8))
plot_importance(<xgb-classifier>, max_num_features = 15, xlabel='F-score', ylabel='Features', ax=ax)
plt.show()

These features would be labeled fxxx, fyyy etc where xxx and yyy are the indices of the features passed to xgboost.

Using the all_columns constructed in the first part, you could map the features to in indices in the plot encoding.

Answered by srjit on December 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP