TransWikia.com

Why do I get two different values in heatmap and feature_importances?

Data Science Asked on January 3, 2021

I’m running a feature selection using sns.heatmap and one using sklearn feature_importances.

When using the same data I get two difference values.

Here is the heatmapenter image description here

and heatmap code

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

training_data = pd.read_csv(
    "/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv")

df_model = training_data.copy()
df_model = df_model.dropna()
df_model = df_model.drop(['Money_Line', 'Money_Line_Percentage', 'Money_Line_Money', 'Money_Line_Move', 'Money_Line_Direction', "Spread", 'Spread_Percentage', 'Spread_Money', 'Spread_Move', 'Spread_Direction',
                          "Win", "Money_Line_Percentage", 'Cover'], axis=1)

X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
                     'Under_Percentage', 'Under_Money']]  # independent columns
y = df_model['Over_Under']  # target column

# get correlations of each features in dataset
corrmat = df_model.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20, 20))
# plot heat map
g = sns.heatmap(
    df_model[top_corr_features].corr(), annot=True, cmap='hot')

plt.xticks(rotation=90)
plt.yticks(rotation=45)

plt.show()

Here is the feature_importances bar graph
enter image description here

and the code

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedKFold
from sklearn.inspection import permutation_importance

training_data = pd.read_csv(
    "/Users/aus10/NFL/Data/Betting_Data/CBB/Training_Data_Betting_CBB.csv", index_col=False)

df_model = training_data.copy()
df_model = df_model.dropna()

X = df_model.loc[:, ['Total', 'Total_Move', 'Over_Percentage', 'Over_Money',
                     'Under_Percentage', 'Under_Money']]  # independent columns
y = df_model['Over_Under']  # target column

model = RandomForestClassifier(
    random_state=1, n_estimators=100, min_samples_split=100, max_depth=5, min_samples_leaf=2)

skf = StratifiedKFold(n_splits=2)

skf.get_n_splits(X, y)

StratifiedKFold(n_splits=2, random_state=None, shuffle=False)

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

model.fit(X_train, y_train)
# use inbuilt class feature_importances of tree based classifiers
print(model.feature_importances_)
# plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
perm_importance = permutation_importance(model, X_test, y_test)
feat_importances.nlargest(5).plot(kind='barh')

print(perm_importance)

plt.show()

I’m not sure which one is more accurate or if I’m using them in the correct way? Should I being using the heatmap to eliminate collinearity and the feature importances to actually selection my group of features?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP