How to get a KNN model (using quantiles to scale variables due to non-normal distributed data) to be better suited for non-extreme values in the data?

Question

I want to cluster my data via k-means/modes. As the variables in my data are not normal distributed, I am not using the z-transformation to scale my data. I am scaling my data by categorizing each column of the data by its quantiles (0, 0.2, 0.4, 0.6, 0.8, 1 quantile) – e.g. if the value is between the 0 and 0.2 quantile, it gets labelled as 1. Here an example data frame – each column represents percentages (sorry for the long code but I need to include a certain amount of data points to still get a similar distribution of the variables compared to the original data):

mydf <- structure(list(perc1 = c(0.639, 0, 0, 0, 0, 100, 0, 0, 0, 0, 
0, 0, 0, 0, 5.5556, 0, 0, 0, 11.1111, 0, 0, 3.3058, 0, 0, 0, 
0, 0, 0, 0.9901, 0, 0, 2.5641, 0, 16.6667, 0, 0, 0, 0, 0, 0, 
33.3333, 0, 0, 0, 0, 100, 0, 0, 6.25, 8.6957, 11.1111, 0, 0, 
0, 19.0476, 0, 3.8462, 0, 0, 100, 0, 0, 14.2857, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0.2041, 16.6667, 0, 4.878, 15.3846, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 37.5, 0, 0, 0, 0, 0, 0, 100, 0, 0), 
    perc2 = c(1.278, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 88.8889, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.9901, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 62.5, 
    0, 0, 0, 0, 0, 0, 0, 7.6923, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    13.3333, 0, 0, 0, 0, 0, 0, 0.8163, 16.6667, 0, 0, 0, 0, 0, 
    0, 0, 28.5714, 0, 0, 0, 100, 0, 0, 50, 0, 0, 0, 0, 0, 0, 
    0, 0, 0), perc3 = c(97.4441, 0, 0, 0, 0, 0, 68.5185, 0, 0, 
    0, 0, 76.4706, 0, 25, 33.3333, 30.7692, 0, 71.4286, 0, 0, 
    0, 76.0331, 0, 0, 0, 0, 0, 0, 95.5446, 0, 0, 64.1026, 0, 
    0, 92.3077, 88.8889, 0, 66.6667, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 31.5789, 0, 0, 47.619, 97.6077, 46.1538, 0, 
    0, 0, 0, 0, 0, 0, 55.5556, 0, 0, 0, 0, 20, 0, 35.7143, 50, 
    0, 98.6735, 0, 38.4615, 78.0488, 0, 100, 0, 0, 100, 0, 0, 
    100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), perc4 = c(0, 
    30, 50, 0, 0, 0, 5.5556, 40, 35.1351, 100, 0, 0, 16.6667, 
    0, 55.5556, 38.4615, 75, 7.1429, 0, 80, 100, 2.4793, 57.1429, 
    0, 0, 0, 0, 0, 0.495, 0, 0, 17.9487, 100, 25, 7.6923, 0, 
    100, 16.6667, 0, 100, 33.3333, 0, 50, 16.6667, 20, 0, 42.8571, 
    0, 0, 86.9565, 22.2222, 21.0526, 50, 33.3333, 4.7619, 0, 
    19.2308, 0, 71.4286, 0, 50, 25, 42.8571, 40, 11.1111, 100, 
    14.2857, 20, 0, 20, 0, 0, 50, 40, 0, 33.3333, 38.4615, 7.3171, 
    30.7692, 0, 0, 0, 0, 28.5714, 22.2222, 0, 88.8889, 0, 42.1053, 
    0, 12.5, 75, 0, 0, 0, 100, 50, 0, 18.75, 0), perc5 = c(0.639, 
    70, 50, 100, 100, 0, 25.9259, 60, 64.8649, 0, 100, 23.5294, 
    83.3333, 75, 5.5556, 30.7692, 25, 21.4286, 0, 20, 0, 18.1818, 
    42.8571, 100, 100, 100, 100, 100, 1.9802, 100, 100, 15.3846, 
    0, 58.3333, 0, 11.1111, 0, 16.6667, 100, 0, 33.3333, 100, 
    50, 83.3333, 80, 0, 57.1429, 100, 31.25, 4.3478, 66.6667, 
    47.3684, 50, 66.6667, 28.5714, 2.3923, 23.0769, 100, 28.5714, 
    0, 50, 75, 42.8571, 60, 33.3333, 0, 85.7143, 66.6667, 100, 
    60, 100, 64.2857, 0, 60, 0.3061, 33.3333, 23.0769, 9.7561, 
    53.8462, 0, 100, 100, 0, 42.8571, 77.7778, 0, 11.1111, 0, 
    57.8947, 100, 0, 25, 100, 100, 100, 0, 50, 0, 81.25, 100)), class = "data.frame", row.names = c(NA, -100L))

When checking the distribution of the 5 variables, we can see that variable 1-3 have mostly zero values and the last variable has lots of 100% percent values:

> quantile(mydf[,1], probs = 0:5/5)
      0%      20%      40%      60%      80%     100% 
  0.0000   0.0000   0.0000   0.0000   1.3049 100.0000 
> quantile(mydf[,2], probs = 0:5/5)
  0%  20%  40%  60%  80% 100% 
   0    0    0    0    0  100 
> quantile(mydf[,3], probs = 0:5/5)
       0%       20%       40%       60%       80%      100% 
  0.00000   0.00000   0.00000   0.00000  39.99996 100.00000 
> quantile(mydf[,4], probs = 0:5/5)
       0%       20%       40%       60%       80%      100% 
  0.00000   0.00000   0.29700  21.52044  50.00000 100.00000 
> quantile(mydf[,5], probs = 0:5/5)
       0%       20%       40%       60%       80%      100% 
  0.00000   0.57242  28.57140  60.00000 100.00000 100.00000

Now I scale my variables and use k-modes (with 10 clusters):

require(klaR)
mydf_scaled <- do.call(cbind, lapply(mydf, function (x) {
  return(as.character(.bincode(x, quantile(x, probs = 0:5/5), include.lowest = T)))
}))
mymodel <- klaR::kmodes(mydf_scaled, modes = 10)

Then I get the following 10 clusters:

> mymodel$modes
   perc1 perc2 perc3 perc4 perc5
1      1     1     1     3     4
2      1     1     5     1     1
3      1     1     1     5     1
4      5     1     1     5     2
5      1     1     1     1     4
6      1     1     1     4     3
7      1     1     4     3     3
8      1     1     5     3     2
9      5     1     5     3     2
10     5     1     1     1     1

The problem I am having now is that for perc1 I only get values 1 or 5 due to the mostly zero quantiles and for perc2 I only get ones as most values for that variable are zero. For perc5 I never get category 5 as the 80% quantile is already 100%.
Therefore, I do not get a good differentiation of certain variables. For perc2 I cannot get any difference although there are non-zero values which are of interest to me. Similar for perc1, I would want a more detailed differentiation between the positive values compared to only having two values 1 and 5 (I can only say it is either a zero value or something positive, rather than getting an actual feeling on how the positive values differ in clusters).
How can I refine my clusters to give me more information about how positive values differ in the clusters without getting a completely wrong picture of my data? I do not want to delete any data.

One idea I had was to only take the quantiles of positive values within my data frame to scale my variables (and add a zero in the beginning to account for the zero values – so I would take 0 and then the 0.2, 0.4, 0.6, 0.8 and 1 quantile of the positive values):

mydf_scaled2 <- do.call(cbind, lapply(mydf, function (x) {
  return(as.character(.bincode(x, c(0, quantile(x[x > 0], probs = 1:5/5)), include.lowest = T)))
}))
mymodel2 <- klaR::kmodes(mydf_scaled2, modes = 10)

Which returns the following clusters:

> mymodel2$modes
   perc1 perc2 perc3 perc4 perc5
1      1     1     1     5     1
2      2     2     2     2     1
3      1     1     1     1     4
4      1     1     1     4     2
5      1     1     1     3     3
6      1     3     1     3     2
7      2     4     1     1     1
8      1     1     1     3     2
9      1     1     5     1     1
10     3     1     1     2     3

This would result in more detailed information about non-zero values within my variables. However, I am not sure whether it makes sense to use this approach and the outcome I get represents my data or whether it over-represent the non-zero values due to the different approach in calculating the quantiles.

Has someone an idea how I could tackle my problem (of not being able to differ between the positive values within my clusters) and still get clusters which well represent my data? Should I use a different approach to scale my variables? Thanks!

clustering feature scaling k means r

How to get a KNN model (using quantiles to scale variables due to non-normal distributed data) to be better suited for non-extreme values in the data?

Add your own answers!

Ask a Question