TransWikia.com

Clustering without information about identifier

Data Science Asked by bazinga on January 18, 2021

I have a data-set with different products and binary value if it was sold in a store or not. I looks like:

product_id  store_1 store_2 store_3 store_4 store_5 store_6
0   A   1   0   0   1   0   1
1   B   1   1   0   0   1   0

Is there any way to cluster these products with any information about the products itself?

One thought I had was to generate distance between products and then cluster the product X product matrix.
Is this problem sort of similar to market basket analysis?

Any guidance will help.

One Answer

1) About clustering the products with no product information

Clustering is always possible but not always what you expected

You can cluster the products based on your dataset using the Hamming distance (if values can only be 0 or 1). Two products separated by a small distance are products that are likely to be sold in the same shops. However, there is no assurance that these products will be similar in terms of their characteristics.

Example

  • expensive ham (A) & cheap ham (B)
  • expensive cheese (C) & cheap cheese (D)

It is likely that shops located in wealthy neighborhoods (Shop1) will sell expensive ham and expensive cheese. It is also likely that shops in modest neighborhoods (Shop2) will sell more cheap cheese and cheap ham.

  Shop1 Shop2
A   1     0      (Exp. Ham)
B   0     1      (Cheap Ham)
C   1     0      (Exp. Cheese)
D   0     1      (Cheap Cheese)

Based on this :

  • A and C are very similar (distance = 0)
  • B and D are very similar (distance = 0)

So the clustering algorithm will (in my example) cluster products, not on their descriptions (ham vs cheese) but based on the prices. (because of shops location in this case)

I have highlighted the price-shop relation, but it is possible that some shops are clothing-only stores and thus products sold in those will likely be clothing.

Conclusion: you will discover clusters of products that are sold in the same stores but it will be difficult to understand why without further information on the shops.

2) About the (Product x Product) distance matrix clustering

The points on the resulting matrix are product duo. AB, BC, DA, etc. The value of the points are the distance between the products of the duo. So clustering this would result is clusters of duos that have the same similarity. I don't think this makes sense.

Hope this will help.

Answered by Rusoiba on January 18, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP