TransWikia.com

What is the best to identify the proper hierarchy of this data?

Data Science Asked by Steven Cunden on August 25, 2021

So I worked on a hierarchical clustering algorithm to be able to determine which items are most similar, and what attributes are most important. I have two tables:

Table 1: contains a bunch of item codes, and it’s attribute (brand, flavor, sales, and so on). It looks something like:

Item_code | Brand | Flavor   | Caloric_content | ... | sales
006891313 | Coke  | Original | 0               | ....
002349823 | Fanta | Orange   | 200             | ...

The other table that I have is what i run my clustering algorithm on. It’s an NxN matrix, (where N is the number of distinct item_codes in the previous table) An entry [i,j] in the matrix, corresponds to the number of times j was bought after i was purchased on a previous trip. So more clearly, in the matrix below, what the number 1223 means is that, 1223 times after item 003428734 was purchase on an initial trip to the store, item 003428734 was purchased on the next trip.

          |003428734 |009849328 | 09840202 |....
003428734 |  1223    |    13    |    0     |
009849328 |    12    |   945    |    34    | ....
   .
   .
   .

I apply a hierchical clustering on that matrix, using Ward 2, squared euclidean distance and standard Z scores. The final output is a dendrogram, with all the item_codes on the branches of the dendrogram.

This is where the tedious process is. The ultimate goal, is to have a hierarchy for our products, and see which attribute (brand, flavor, size…) falls where on the dendrogram. The only way i can think of doing it, is organizing the item codes in the first in the same order that they are in the dendrogram, put a picture of the dendrogram side by side, and eyeball which attribute clusters the most. You’ll see in the link below, it looks like brand A and B are clustering together, so I would say that products cluster by brand first. I would then take a closer look at the smaller cluster within the dendrogram and try to identify where the other attributes fit in the hierarchy (the picture is just a snippet of the dendrogram, it is in reality much, much bigger)

dendrogram

This eye balling process is very tedious and annoying. Is there a way to run a similar cluster, which shows a hierarchy within the products, as well as indicate which attribute fit where in that hierarchy? Or is this even the best way to approach this?

I should mention that I’m very new to these clustering algorithm, so apologies if this is a dumb question

One Answer

I don't think your clustering makes sense in the first place. That you actually get some output is more of a coincidence than a result. Just invoking some functions in a way they don't fail is not enough to get a trustworthy result...

Your matrix is vaguely resembling a similarity matrix, except that it is not symmetric, and the diagonald are not necessarily the maximum values.

Now Ward usually needs a squared Euclidean distance matrix, something very different. The reason why your program doesn't fail supposedly is because it now computes such a distance matrix based on the rows from your similarity matrix. That will usually give somewhat okay looking results (as similar items should have similar rows) but causes subtly problematic bias based on the number of product types.

Later on you can evaluate each hierarchy level based on the agreement with some known label (e.g., manufacturer or product class) and you then can argue which agreement comes earlier. But on a result that isn't valid this doesn't make much sense.

Answered by Has QUIT--Anony-Mousse on August 25, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP