TransWikia.com

Evaluate document similarity / content-based recommender system

Data Science Asked on January 7, 2021

I’m planning on building a basic content-based recommender system with word2vec and cosine similarity.
The data consists of 300k documents in varying length.

How do I evaluate my model if I have no labels / categories whatsoever?

2 Answers

If you're trying to create a content-based document recommender system, you want to measure success via some sort of ranking metric like precision@k.

But since you don't have user-document interaction histories, you're either going to have to make them yourself, or just do a bunch of document queries and see if they make sense.

If you're going to make user-document interaction histories yourself, I would just do 10-20 queries and go through the first 5 documents that get returned and label whether or not they match. Calculate precision@k for those results and now you have an idea of how you're doing.

Not sure if you're familiar with ranking metrics but the best way to look at it is to always compare to some baseline model. In your case, I would calculate precision@k for BoW, tfidf, LSA, and LDA with cosine similarity as other models to compare to.

Unfortunately not a ton of other options for the task of content-recommendation without interaction data to test on. But I also would add that just eyeballing the results a lot of the time will tell you how the model is doing.

Correct answer by mkerrig on January 7, 2021

When you don't have label/categories then it's called Unsupervised Learning. You can solve this problem via Latent Dirichlet Allocation (LDA) model and then evaluate your model by splitting the texts in half and compare the topic assignment for each half using cosine similarity. The more similar the topic assignment, the better.

Example

Answered by prashant0598 on January 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP