TransWikia.com

Procedure for selecting optimal number of features with Python's Scikit-Learn

Data Science Asked on March 16, 2021

I have a dataset with 130 features (1000 rows) . I want to select the best features for my classifier. I started with RFE but Its taking too long, i done this:

number_of_columns = 130

for i in range(1, number_of_columns):
    rfe = RFE(model, i)
    fit = rfe.fit(x_train, y_train)
    acc = fit.score(x_test, y_test

Because this took to long, I changed my approach, and I want to see what you think about it, is it good / correct approach.

First I did PCA, and I found out that each column participates with around 1-0.4%, except last 9 columns. Last 9 columns participate with less than 0.00001% so I removed them. Now I have 121 features.

pca = PCA()
fit = pca.fit(x)

Then I split my data into train and test (with 121 features).

Then I used SelectFromModel, and I tested it with 4 different classifiers. Each classifier in SelectFromModel reduced the number of columns. I chosed the number of column that was determined by classifier that gave me the best accuracy:

model = SelectFromModel(clf, prefit=True)
#train_score = clf.score(x_train, y_train)
test_score = clf.score(x_test, y_test)
column_res = model.transform(x_train).shape

End finally I used ‘RFE’. I have used number of columns that i get with ‘SelectFromModel’.

rfe = RFE(model, number_of_columns)
fit = rfe.fit(x_train, y_train)
acc = fit.score(x_test, y_test)

Is this a good approach, or I did something wrong?

Also, If I got the biggest accuracy in SelectFromModel with one classifier, do I need to use the same classifier in RFE?

2 Answers

You may have a try on Lasso (l1 penalty) which does automatic feature selection by „shrinking“ parameters. This is one of the standard approaches to data with many columns and „not so many“ rows.

sklearn.linear_model.LogisticRegression(penalty=’l1‘,...

See also this post.

Edit:

The book „Introduction to Statistical Learning“ gives a really good overview. Here are the Python code examples from the book. Section 6.6.2 covers the Lasso.

Answered by Peter on March 16, 2021

For that amount of features I use Selectbest sklearn.feature_selection.SelectKBest

To do this, I take 1/4, 1/3, 1/2, 2/3, 3/4 of all the feaures and analyze how the score used to measure the error varies.

OTHER OPTION:

I use LassoCV sklearn.linear_model.LassoCV

as follows:

kfold_on_rf = StratifiedKFold(
    n_splits=10, 
    shuffle=False, 
    random_state=SEED
)

lasso_cv = LassoCV(cv=kfold_on_rf, random_state=SEED, verbose=0)
sfm = SelectFromModel(lasso_cv)

Answered by Victor Villacorta on March 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP