TransWikia.com

Segmentation using cluster analysis in SPSS

Cross Validated Asked by desperate-about-statistics on February 11, 2021

I am doing a segmentation project and am struggling with cluster analysis in spss right now. Could you please help me get this answered:

How do I determine the quality of the clustering in spss?

In many articles/ tutorials I’ve read it’s advisable to run a hierarchical clustering to determine the number of clusters based on agglomeration schedule and a dendogram – and then to do k-means clustering. Let’s say the 1st step results are not clear and I am hesitant between 4 and 5 cluster-solutions. I can try both with k-means method – but how do I see which one’s best?
Same goes for any re-running of k-means clustering procedure, since every time the output is slightly different.

Thanks a lot for any info!
I would also be grateful for link to any good ready tutorials on cluster analysis in spss. What I’ve found so far is very random and limited… Articles on cluster analysis are not enough for me, because then I don’t know how to run the tests mentioned in them in spss :-[

2 Answers

Some users report getting insight from silhouette plots. You can read about them here http://en.wikipedia.org/wiki/Silhouette_(clustering).

They are available in Statistics via the STATS CLUS SIL available from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) and require the Python Essentials available through that site or with your Statistics installation materials (automatically installed with Statistics V22 or later)

Answered by JKP on February 11, 2021

In my opinion, the quality of your cluster solution is inherently subjective. There's no right or wrong but rather "makes sense and is actionable for us" or "doesn't make sense, is not actionable".

Now first, what you should know is that all three clustering algorithms are affected by the order of your cases. This really raises questions about the soundness of the procedure altogether. But given that there's randomness in the result anyway, you may perhaps just as well capitalize on it as long as you're transparent about what you did and your reasoning behind it.

You can run the same clustering syntax repeatedly (say 10 times) and randomly reorder cases between runs. You can automate this with Python, an example that comes close to this is Regression over Many Dependent Variables. Now you'll get 10 different solutions, compare their quality and choose what you find best.

My personal opinion is that two pieces of information matter:

  • 1 How are the frequencies of cases over clusters? You usually don't want very small or very large clusters.
  • 2 Can you describe what the clusters mean? For this you may run means plots of the input variables with the cluster variable.

if these results make sense (they're explicable) and they're actionable, I'd say you have a good solution.

Answered by RubenGeert on February 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP