TransWikia.com

Clustering of documents using the topics derived from Latent Dirichlet Allocation

Data Science Asked by Swan87 on March 4, 2021

I want to use Latent Dirichlet Allocation for a project and I am using Python with the gensim library. After finding the topics I would like to cluster the documents using an algorithm such as k-means(Ideally I would like to use a good one for overlapping clusters so any recommendation is welcomed). I managed to get the topics but they are in the form of:

0.041*Minister + 0.041*Key + 0.041*moments + 0.041*controversial + 0.041*Prime

In order to apply a clustering algorithm, and correct me if I’m wrong, I believe I should find a way to represent each word as a number using either tfidf or word2vec.

Do you have any ideas of how I could “strip” the textual information from e.g. a list, in order to do so and then place them back in order to make the appropriate multiplication?

For instance the way I see it if the word Minister has a tfidf weight of 0.042 and so on for any other word within the same topic I should be to compute something like:

0.041*0.42 + … + 0.041*tfidf(Prime) and get a result that will be later on used in order to cluster the results.

Thank you for your time.

3 Answers

Assuming that LDA produced a list of topics and put a score against each topic for each document, you could represent the document and it's scores as a vector:

Document | Prime | Minister | Controversial | TopicN | ...
   1       0.041    0.042      0.041          ...
   2       0.052    0.011      0.042          ...

To get the scores for each document, you can run the document. as a bag of words, through a trained LDA model. From the gensim documentation:

>>> lda = LdaModel(corpus, num_topics=100)  # train model
>>> print(lda[doc_bow]) # get topic probability distribution for a document

Then, you could run the k-means on this matrix and it should group documents that are similar together. K-means by default is a hard clustering algorithm implying that it classifies each document into one cluster. You could use soft clustering mechanisms that will give you a probability score that a document fits within a cluster - this is called fuzzy k-means. https://gist.github.com/mblondel/1451300 is a Python gist showing how you can do it with scikit learn.

ps: I cant post more than 2 links

Answered by Ash on March 4, 2021

Complementary to the previous answer you should better not just run kmeans directly on the compositional data derived from the lda topic-doc distribution, instead use some compositional data transformation to project them to the euclidean space like ilr or clr.

(Example)

Answered by Anestis Fachantidis on March 4, 2021

Another approach would be to use the document-topic matrix that you obtained by training the LDA model in order to extract the topic with the maximum probability and let that topic be your label.

This will give a result that is somewhat interpretable to the degree your topics are.

Answered by josescuderoh on March 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP