How to apply distance-based clustering or dimensionality reduction for too many samples

I have a dataset with 200K samples (cases) and 30 variables. Every distance-based method for clustering or dimension reduction technique that I use, such as DBSCAN, Hierarchical Clustering, LLE, Isomap and … fail to run on my machine (normally I get R Session Terminated error) due to the large distance file being generated. (Distance calculation requires o(n^2) time and space)

Is there any solution to handle this problem? Is there any good package for the mentioned clustering or dimensionality reduction in R or Matlab that is suitable ?

Cross Validated Asked by Matin Kh on December 27, 2020

Maybe you could try Mini-Batch K-Means. I have Matlab code for it:

function [c,counts,idx] = mbkmeans(x,k,c,counts)
[N,D] = size(x);
if ~exist('c','var') || isempty(c)
c = x(1:min([k N]),:) + bsxfun(@times,randn(min([k N]),D)*0.001,std(x));
if N < k
c(N+1:k,:) = bsxfun(@plus,mean(x),bsxfun(@times,randn(k-N,D),std(x)));
end;
end;
if ~exist('counts','var') || isempty(counts)
counts = zeros(k,1);
end;
idx = knnsearch(c,x,'k',1);
lr = 1 ./ counts(idx);
for i = 1:N
c(idx(i),:) = (1 - lr(i)) * c(idx(i),:) + lr(i) * x(i,:);
end;

Usage:

clusters = mbkmeans(yourdata,numberofclusters);

You may feed it your entire dataset at once and you're done. Or you may feed it smaller subsets. In this case, use it like this:

[c1, counts1] = mbkmeans(subset1,numberofclusters);
[c2, counts2] = mbkmeans(subset2,numberofclusters, c1, counts1); %start clustering using previously created clusters
[c3, counts3] = mbkmeans(subset3,numberofclusters, c2, counts2);
...
[cn, countsn, indices] = mbkmeans(subsetn,numberofclusters, c(n-1), counts(n-1));

For R, there is the stream package (explanation here). You may also take a look at this, this and this.

Answered by rcpinto on December 27, 2020

Related Questions

Is there a word in statistics for “mean divided by absolute difference”?

0  Asked on December 1, 2021 by user989761

SPSS – Automatic Linear Modeling “Importance” Numbers

1  Asked on December 1, 2021 by josh-davis

Is the pooled AUC calculation for imputated data in (psfmi package) mivalext_lr() correct?

0  Asked on December 1, 2021 by yy-shi

Am I okay in not using EC model when series are co-integrated?

1  Asked on December 1, 2021

How does propensity score matching that uses only a small proportion of eligible patients affect generalizability?

1  Asked on December 1, 2021 by diana-petitti

Logistic regression model predicts only one outcome, producing a high specificity but very low sensitivity. How do I improve the model?

1  Asked on November 29, 2021

Why does the Lasso provide Variable Selection?

4  Asked on November 29, 2021 by zhi-zhao

Why do increasing regularization weights make objective function not monotonically decrease?

1  Asked on November 29, 2021

Do we need to demean and standardize all variables in a model?

1  Asked on November 29, 2021 by ama-perera

linear causal model

1  Asked on November 29, 2021 by markowitz

What is the point of test set in ML?

4  Asked on November 29, 2021 by lelouche-lamperouge

Proof that Cov(W+Y, Y-V) = 0 given that W, Y, and V are uncorrelated but not independent

2  Asked on November 29, 2021 by user292024

Can linear and logistic regression coefficients be combined using an inverse variance weighted average?

1  Asked on November 29, 2021

How to construct one sided CI for Superiority Randomized Controlled Trial?

1  Asked on November 29, 2021 by user292068

Working out expected steps of absorbent Markov Chain with more than one sink

0  Asked on November 29, 2021

How do I calculate confidence level or interval?

0  Asked on November 29, 2021 by user810739

Power of two-sample test of binomial proportions

1  Asked on November 29, 2021 by afternoon

What is the most sound way to perform variable selection on an lmer() model?

1  Asked on November 29, 2021

Comparing AUC and classification loss for binary outcome in LASSO cross validation

1  Asked on November 29, 2021 by atakan

Examples of Simpson’s Paradox being resolved by choosing the aggregate data

4  Asked on November 29, 2021 by richie-cotton