TransWikia.com

Similarity matching between two distinct datasets (marketing case study)

Data Science Asked on January 2, 2021

I am working for a company that sells different products to customers. My objective is to find customers that are likely to purchase product X based on the profiles of customers that already purchased product X.

My first idea was:

  • to collect relevant variables for customers that already purchased product X (dataset A)
  • perform a cluster analysis of this dataset to generate customer personas for dataset A
  • collect the same variables for customers that have not purchased product X (dataset B)
  • and finally measure the distance between customers in dataset B to the centroids of medoids of the generated clusters of dataset A

Unfortunately, this is less straightforward than I thought:

  • For one, I would have to cluster categorical and numerical data. Therefore, I would compute the gower-distance to get a dissimilarity matrix between data points of dataset A that I would then cluster by means of PAM (partitioning around medoids) clustering. I do not know how to apply data points of dataset B to infer a distance to the PAM medoids because those medoids relate to the dissimilarity matrix of dataset A rather than the actual data points.
  • Secondly, the generated clusters of dataset A are less descriptive than expected.

In conclusion, I would like to have a second opinion. Is the way I described really a good way to tackle the problem at hand? Or do you have other ideas?

Would be happy for your input – best wishes.

One Answer

If I understand your question correctly, you have two groups of people: Group A, each of whom has purchased the product, say yogurt; and Group B, each of whom has not purchased yogurt. Your problem at hand is to find all people in Group B who will be likely to purchase your yogurt, if they have a similar profile as people in Group A.

This seems to be a very common problem in causal inference, where you need to match the treated people with the control group, but since one person could not be both treated and untreated, we need to find "similar" people on both sides such that they are comparable (in terms of their characteristics, or variables) so that we can make a causal inference from there.

Now, returning to your problem, I don't think it is necessary to do the clustering for matching. Instead, you could consider the "matching" approach commonly used for causal inference. Here is an r packages that comes to my mind: MatchIt. In essence, what you need to do is to regard Group A as treatment and Group B as control. I believe they provide many different ways of matching algorithm and you can certainly play around to see which one works for you the best.

Answered by Mark.X on January 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP