How to do feature selection for clustering and implement it in python?

Question

I am trying to implement k-means clustering on 60-70 features and I came across a post for feature selection technique on quora  by Julian Ramos, but I fail to understand few steps mentioned. I am also wondering if its the right method to select the best features for clustering?

These are the steps mentioned in the post :

Sf={∅} #Set of features selected, initially empty

Perform k-means on each of the features individually for some k.
  For each cluster measure some clustering performance metric like the Dunn's index or silhouette.
  Take the feature which gives you the best performance and add it to Sf
  Perform k-means on Sf and each of the remaining features individually
  Take the feature which gives you the best performance and add it to Sf
  If you have reached the desired number of features stop, else go back to 4

Also, how do we implement the same in python. I wish to write function for the same that selects best k and implement all the other steps.

Jed · Answer

You might be interested in this paper and python implementation of various other feature selection for clustering tools and papers:
http://www.public.asu.edu/~huanliu/papers/pakdd00clu.pdf
https://github.com/danilkolikov/fsfc
An excerpt sumarizing the approach:

We address the problem of selecting a subset of important features for clus tering for the whole data and not just for clusters unlike
in [1,2]    This helps in knowing the important features before
doing clustering and the clustering task becomes more ecient and
focused as only the important features can be used  Finding the
important original features for the whole data helps in under
standing the data better unlike principal components  Data storage
collection and processing tasks become more efficient and noise is
reduced as the data is pruned
Our approach is a 2-step method we first rank and then select a
subset of important features.  Ranking of features is done according to
their importance on clustering  An entropy based ranking measure is
introduced  We then select a subset of features using a criterion
function for clustering that is invariant with respect to different
numbers of features  A novel scalable method based on random sampling
is introduced for large data commonly found in data mining
applications

Additionally, this paper gives an overview of different methods available:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.295.8115&rep=rep1&type=pdf

mnm · Answer

Often people confuse between unsupervised feature selection (UFS) and dimensionality reduction (DR) algorithms as the same. For instance, a famous DR algorithm is Principal Component Analysis (PCA) which is often confused as a UFS method! Researchers have suggested that PCA is a feature extraction algorithm and not feature selection because it transforms the original feature set into a subset of interrelated transformed features, which are difficult to emulate (Abdi & Williams, 2010).
An UFS approach present in literature is Principal Feature Analysis PFA. The way it works is given as;
Steps:

Compute the sample covariance matrix or correlation matrix,
Compute the Principal components and eigenvalues of the Covariance or Correlation matrix A.
Choose the subspace dimension n, we get new matrix A_n, the vectors Vi are the rows of A_n.
Cluster the vectors |Vi|, using K-Means
For each cluster, find the corresponding vector Vi which is closest to the mean of the cluster.

A possible python implementation of PFA is given below;
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler
import pandas as pd
# create some dummy data
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, 2, 1, 8]},
                  index=['falcon', 'dog', 'spider', 'fish'])
print(df)
class PFA(object):
    def __init__(self, n_features, q=None):
        self.q = q
        self.n_features = n_features

def fit(self, X):
        if not self.q:
            self.q = X.shape[1]

sc = StandardScaler()
        X = sc.fit_transform(X)

pca = PCA(n_components=self.q).fit(X) # calculation Cov matrix is embeded in PCA
        A_q = pca.components_.T

kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
        clusters = kmeans.predict(A_q)
        cluster_centers = kmeans.cluster_centers_

dists = defaultdict(list)
        for i, c in enumerate(clusters):
            dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
            dists[c].append((i, dist))

self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
        self.features_ = X[:, self.indices_]
        
# Usage
pfa = PFA(n_features=3)
pfa.fit(df)
# To get the transformed matrix
x = pfa.features_
print(x)
# To get the column indices of the kept features
column_indices = pfa.indices_

Results
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8
[[-0.50709255  1.73205081  1.23942334]
 [ 0.16903085 -0.57735027 -0.84802649]
 [ 1.52127766 -0.57735027 -1.10895772]
 [-1.18321596 -0.57735027  0.71756088]]

Reference
Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433-459.

How to do feature selection for clustering and implement it in python?

2 Answers

Add your own answers!

Ask a Question