TransWikia.com

Euclidean distance of all pandas rows to single row

Stack Overflow Asked by aquamad96 on December 30, 2021

I have a dataset that gives the values of some songs, ie something that looks like:

    acousticness danceability energy instrumentalness key  liveness  loudness 
0        0.223      0.780      0.72       0.111        1     0.422    0.231
1        0.4        0.644      0.88       0.555        0.5   0.66     0.555
2        0.5        0.223      0.145      0.76         0     0.144    0.567
.
.
.

I want to find the songs/ rows that are numerically closest to another song, such as song 0, using the euclidean distance.So I’d like to obtain something like:

    acousticness danceability energy instrumentalness key  liveness  loudness Euclidean to song 0
0        0.223      0.780      0.72       0.111        1     0.422    0.231       0
1        0.4        0.644      0.88       0.555        0.5   0.66     0.555      1.334
2        0.5        0.223      0.145      0.76         0     0.144    0.567     1.442
.
.
.

One Answer

The usual procedure for what you're trying to do, is to use one of sklearn's pairwise metrics, such as the cosine_similarity, and build a similarity matrix with it:

from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

cosine_similarity(df)
array([[1.        , 0.86597679, 0.38431913],
       [0.86597679, 1.        , 0.71838491],
       [0.38431913, 0.71838491, 1.        ]])

This gives you a square matrix with the indices representing the dataframe song index.


Similarity with a single item

If you're only interested in the similarities with a specific song, say song 0, you can specify a second a array as, so that the similarities are obtained using all items in the input matrix with a given item.

Since you mentioned the euclidean distance, here's one using sklearn's euclidean_distances. Note that we have tu subtract the result from 1, since we have distances. If we want the actual distance, we can just keep the resulting array:

1-euclidean_distances(df, df.to_numpy()[0,None])
array([[ 1.        ],
       [-0.16977006],
       [-1.15823261]]) 

For the distance, just:

euclidean_distances(df, df.to_numpy()[0,None])
array([[0.        ],
       [1.43266989],
       [2.64328432]])

To update as a new column:

df['Similarity with song 0'] = 1-euclidean_distances(df, df.to_numpy()[0,None]).squeeze()

print(df)

   acousticness  danceability  energy  instrumentalness  key  liveness  
0         0.223         0.780   0.720             0.111  1.0     0.422   
1         0.400         0.644   0.880             0.555  0.5     0.660   
2         0.500         0.223   0.145             0.760  0.0     0.144   

   loudness  Similarity with song 0  
0     0.231                1.000000  
1     0.555               -0.169770  
2     0.567               -1.158233  

Answered by yatu on December 30, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP