TransWikia.com

Cosine Similarity Intuition

Cross Validated Asked by ccb on January 6, 2021

I understand what cosine similarity is and how to calculate it, specifically in the context of text mining (i.e. comparing tf-idf document vectors to find similar documents). What I’m looking for is some better intuition for interpreting the results/similarity scores I come up with.

My question: If I have a cosine similarity of less than 0.707 (i.e. an angle greater than 45 degrees), is is fair to say that those respective documents/vectors are more “different” (less “similar”) since the angle between them is more orthogonal? My initial thought was ‘yes,’ but in practice for me so far it doesn’t seem like that’s the right way to read into the numbers.

3 Answers

The answer mentioned here is correct. The measure of cosine distance as a measure of similarity only makes sense under some specific assumptions.

  1. That it is possible to represent multiple complex objects as commensurable entities.
  2. We can use quantitative methods to find qualitative answers. For example, we can measure the similarity using some numbers.

One can argue that these assumptions can be used as a guiding principle for a wide range of measures. Indeed they are. But cosine similarity is more popular because they match well with our spatial intuition and common sense. If you ask someone totally unrelated, to state the similarity of two documents between -1 and 1, then the answer would probably close to what cosine similarity would give as well.

There are other factors that also make cosine similarity a good choice. When thinking about the similarity of two documents, we do not care about the order of the words or other specific grammatical constructs. Hence the cosine distance, you will notice, does not take into account these factors and hence kind of captures that essence.

Lastly another advantage of cosine similarity is that, in high dimensional spaces such as text word embedding, are essentially non-intuitive to humans due to our inability to grasp non-euclidean spaces. Hence to understand meaning in a non euclidean space we need some way of mapping this high dimension to a low dimension space. Thus cosine distance helps the researcher to visualise the vector space to a large extent.

So to answer your question, the absolute value of cosine distance does not make sense by itself. It only makes sense if you are comparing between multiple choices. If you started with "king", then chances are that the text is talking about a "man" rather than a "woman" or a "bird" based on the cosine distances. Just the cosine distance between "man" and "king" has not value by itself.

Answered by joydeep bhattacharjee on January 6, 2021

I believe another difference between cosine similarity and TF-IDF is that cosine similarity is done in an embedding space, such as one created by doc2vec.

Such an embedding puts words that are used in similar contexts near to each other, so you could use clustering to find similar documents. But cosine distance probably makes more sense for a couple of reasons:

  1. An embedding like doc2vec encodes information in direction and distance. Look at the examples of king - man + woman yielding queen. I'd guess that direction dominates this comparison.

  2. In high-dimensional spaces, "nearby" (distance) can begin to lose its meaning, so directional measures -- which are also by definition finite and determined a priori -- might make more sense if the "inner product space" supports it. (I threw the last part in there not totally understanding what an "inner product space" is, but it sounds cool and it is related... I just couldn't explain how.)

So, given that, I'd say that the idea of "orthogonality" isn't meaningful here. Two documents are either together in a smaller wedge of the space or a larger wedge of the space and that's that: 100 degrees apart is farther apart than 90 degrees, and 80 degrees apart is closer.

Answered by Wayne on January 6, 2021

Cosine similarity is computed all the text documents after pre processing , like removals of stop words, stemming and applying Term frequency. Let us say A,B,C,D are four documents, I need to find out the similarity I can apply cosine similarity after going through all the preprocessing and calculating the weigths. Finally determine which are similarity based on the Angel, Now I would like to see the key tokens which contribution, (like man of the match in Cricket) and what semantic it is occurring weather is object or thing or Noun or Verb and predict the tokens which is relevant , For instance ,I live in Java and place is good. Here the context is JAVA ia Island , not bike nor oops. These way CS(cosine similarity is applied and predictive approach is dtermeined)

Answered by Agar on January 6, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP