TransWikia.com

How do I read the cord_19_embeddings_2020-07-16.csv from the COVID-19 Open Research Dataset Challenge (CORD-19) on Kaggle?

Data Science Asked by Tobias Kolb on September 5, 2021

table view of the file in TAD csv viewer

What I’m trying to do

I wanted to use the CORD19 word embeddings csv to map them to certain findings from the rest of the dataset, but as we can see there are no stings in the first column.

The way I know word or sentence embeddings, is what they map a word or a sentence to multiple vectors.
The values in the first column look somewhat like hashes, and they are the main problem I can’t use the dataset.

Can somebody give me a pointer on what I’m looking at and how to use them?

I have not found documentation, usage examples or submissions on kaggle that have explained or outlined how this file is supposed to be used.

One Answer

So, after a lot of digging, I found something in the comment section.

They are document embeddings.

enter image description here

Relevant Comments from the Kaggle Comment section on the Data Update Log for the CORD19 Dataset:

Examples how to visualize the embeddings in a Jupyter Notebook:

import pandas as pd
from whatlies import Embedding, EmbeddingSet
#Docs: https://rasahq.github.io/whatlies/api/embeddingset/


#transponse dataframe
sample_df = pd.read_csv('data/cord_embeddings_sample.csv', header=None, delimiter=',', index_col=0).T

def to_ems(df):
    ems_dict = {}
    for columnName, columnData in df.iteritems():
        ems_dict.update({str(columnName): Embedding(columnName, columnData)})
    return EmbeddingSet(ems_dict)

ems = to_ems(sample_df.head(10))
ems.plot_correlation()

enter image description here

ems.plot_interactive("ug7v899j","02tnwd4m")

enter image description here

You can even do NLP with the jsons from the dataset and link them to the embeddings via the UUID and SHA from metadata.csv.

Example:

Find words that relate to smoking and color the respective papers:

I created 2 EmbeddingSets where I filtered the embeddings for papers that have smoking-related words in their text body and subtracted their UUIDs from the list. Both EmbeddingSets can be displayed in the plot.

from whatlies.transformers import Umap

# add 2 embedding sets

emb1 = non_smoking_ems.add_property('set', lambda d: 'non smoking papers')
emb2 = smoking_ems.add_property('set', lambda d: 'smoking papers')

both = emb1.merge(emb2)

#add a clustering transformer that reduces dimensionality (like umap) and visualise them
both.transform(Umap(2)).plot_interactive('umap_0', 'umap_1',color='set', annot=False)

enter image description here

Correct answer by Tobias Kolb on September 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP