TransWikia.com

How to do feature selection for clustering and implement it in python?

Data Science Asked by Akash Dubey on January 12, 2021

I am trying to implement k-means clustering on 60-70 features and I came across a post for feature selection technique on quora by Julian Ramos, but I fail to understand few steps mentioned. I am also wondering if its the right method to select the best features for clustering?

These are the steps mentioned in the post :

Sf={∅} #Set of features selected, initially empty

  1. Perform k-means on each of the features individually for some k.
  2. For each cluster measure some clustering performance metric like the Dunn’s index or silhouette.
  3. Take the feature which gives you the best performance and add it to Sf
  4. Perform k-means on Sf and each of the remaining features individually
  5. Take the feature which gives you the best performance and add it to Sf
  6. If you have reached the desired number of features stop, else go back to 4

Also, how do we implement the same in python. I wish to write function for the same that selects best k and implement all the other steps.

2 Answers

You might be interested in this paper and python implementation of various other feature selection for clustering tools and papers:

http://www.public.asu.edu/~huanliu/papers/pakdd00clu.pdf

https://github.com/danilkolikov/fsfc

An excerpt sumarizing the approach:

We address the problem of selecting a subset of important features for clus tering for the whole data and not just for clusters unlike in [1,2] This helps in knowing the important features before doing clustering and the clustering task becomes more ecient and focused as only the important features can be used Finding the important original features for the whole data helps in under standing the data better unlike principal components Data storage collection and processing tasks become more efficient and noise is reduced as the data is pruned

Our approach is a 2-step method we first rank and then select a subset of important features. Ranking of features is done according to their importance on clustering An entropy based ranking measure is introduced We then select a subset of features using a criterion function for clustering that is invariant with respect to different numbers of features A novel scalable method based on random sampling is introduced for large data commonly found in data mining applications

Additionally, this paper gives an overview of different methods available:

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.295.8115&rep=rep1&type=pdf

Answered by Jed on January 12, 2021

Often people confuse between unsupervised feature selection (UFS) and dimensionality reduction (DR) algorithms as the same. For instance, a famous DR algorithm is Principal Component Analysis (PCA) which is often confused as a UFS method! Researchers have suggested that PCA is a feature extraction algorithm and not feature selection because it transforms the original feature set into a subset of interrelated transformed features, which are difficult to emulate (Abdi & Williams, 2010).

An UFS approach present in literature is Principal Feature Analysis PFA. The way it works is given as; Steps:

  • Compute the sample covariance matrix or correlation matrix,
  • Compute the Principal components and eigenvalues of the Covariance or Correlation matrix A.
  • Choose the subspace dimension n, we get new matrix A_n, the vectors Vi are the rows of A_n.
  • Cluster the vectors |Vi|, using K-Means
  • For each cluster, find the corresponding vector Vi which is closest to the mean of the cluster.

A possible python implementation of PFA is given below;

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler
import pandas as pd
# create some dummy data
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, 2, 1, 8]},
                  index=['falcon', 'dog', 'spider', 'fish'])
print(df)
class PFA(object):
    def __init__(self, n_features, q=None):
        self.q = q
        self.n_features = n_features

    def fit(self, X):
        if not self.q:
            self.q = X.shape[1]

        sc = StandardScaler()
        X = sc.fit_transform(X)

        pca = PCA(n_components=self.q).fit(X) # calculation Cov matrix is embeded in PCA
        A_q = pca.components_.T

        kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
        clusters = kmeans.predict(A_q)
        cluster_centers = kmeans.cluster_centers_

        dists = defaultdict(list)
        for i, c in enumerate(clusters):
            dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
            dists[c].append((i, dist))

        self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
        self.features_ = X[:, self.indices_]
        
# Usage
pfa = PFA(n_features=3)
pfa.fit(df)
# To get the transformed matrix
x = pfa.features_
print(x)
# To get the column indices of the kept features
column_indices = pfa.indices_

Results

        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8
[[-0.50709255  1.73205081  1.23942334]
 [ 0.16903085 -0.57735027 -0.84802649]
 [ 1.52127766 -0.57735027 -1.10895772]
 [-1.18321596 -0.57735027  0.71756088]]

Reference

Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433-459.

Answered by mnm on January 12, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP