TransWikia.com

europmc and tidypmc R library for extracting or making metadata from publication

Bioinformatics Asked on August 30, 2021

I’m trying to use this europmc r libray where I have a list of pmids to look for. I tried with pubtator but its bit complicated.In Europmc i can all the annotated terms etc.

library("europepmc")

example list of PMIDS
30024784
30555165
30510081
31688884
31516032
28588019
29286103

Now what I’m doing is looking each ID using epmc_details function which is not the way i would do if i have to look for hundreds.

epmc_details(ext_id = '30510081')

I have to question how can I run the epmc_details in a loop or some other way where i can look for PMIDS one by one and save the result in a data frame.

The epmc_details is returned as a list. The structure of the list is as such

[1] "basic"           "author_details"  "journal_info"    "ftx"             "chemical"        "mesh_topic"      "mesh_qualifiers"
[8] "comments"        "grants"    

I would only like to save basic,chemical,mesh_topic,mesh_qualifiers in one data frame.

For example if my first id is this 30510081 the dataframe should have first column as my ID which is basically basic[1]and rest of the information appended to the next columns.

such as

 ID chemical mesh_qualifiers mesh_topic gene 

Any suggestion or help would be really and highly appreciated

I was looking at europmc site through the browser

this was one of my query

when i highlight the keyterms i do see in the abstract itself all the keyterms are getting annotated but when i do the same query search through R I do see empty results as such why there is a difference?

$chemical
# A tibble: 0 x 0

$mesh_topic
# A tibble: 0 x 0

$mesh_qualifiers
# A tibble: 0 x 0

I found better way of getting data from pubmed using the tidypmc library.

library(tidypmc)
doc <- pmc_xml("PMC6365492")
doc
txt <- pmc_text(doc)
txt

count(txt, "section")

cap1 <- pmc_caption(doc)
filter(cap1, sentence == 1)

tab1 <- pmc_table(doc)

sapply(tab1, nrow)

tab1[[1]]
attributes(tab1[[2]])

collapse_rows(tab1, na.string="-")

library(tibble)
x <- xml_name(xml_find_all(doc, "//*"))
tibble(tag=x) %>% count("tag")

library(tidytext)
x1 <- unnest_tokens(txt, word, text) %>%
  anti_join(stop_words) %>%
  filter(!word %in% 1:100)
#  Joining, by = "word"
#filter(x1, str_detect(section, "Case description")) 


filter(x1, str_detect(section, "Results"))
count(a$word)

tbls <- pmc_table(doc)
map_int(tbls, nrow)

tbls[[1]]

collapse_rows(tbls, na.string="-")

But if i understand it can use one PMC id at a time. Again keeping my original question how can i put this in a loop to query lets say i have 100 PMCID and get it result and store in a dataframe.

After using tidypmc i found that i parse all the publication based on attributes or tags. Such as title,abstract, results etc etc.

Lets say I’m interested in the table tags information where they have metadata of patients as well as others.So if a paper contain multiple tables I would like to store each of them in a data-frame under the respective publication.Since I have multiple IDs to search and do the save as above mentioned. How to put this through a loop or can it be done without using loop ?

Any suggestion or help would be really appreciated as always.

One Answer

The idea for this code is to first convert PIDs to PMCIDs, then run the tidypmc in a loop over the PMCIDs. The only problem is that tidypmc failed to retrieve tables from most of the IDs in your example list.

library(tidyverse)
library(tidypmc)
library(httr)
library(jsonlite)
example_pids <- c(30024784, 30555165, 30510081, 31688884, 31516032, 28588019, 29286103) %>% as.character()

#-- Convert to PMC ids
convertPIDtoPMCID <- function(pids) {
    #-------- Make API request
    pids4query <- paste(pids, collapse = "%0D%0A")
    idconv_req <- paste0("https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids=", pids4query, "&idtype=pmid&format=json&versions=no&showaiid=no&tool=&email=&.submit=Submit")
    pids_json <- GET(idconv_req) %>% content("text") %>% fromJSON()
    #-------- Get info from JSON
    pid2pmc <- pids_json$records %>% 
        select(pmcid, pmid) %>% 
        as.data.frame()
    rownames(pid2pmc) <- pid2pmc$pmid
    pmcids <- pmcids <- pid2pmc[pids, "pmcid"]
    return(pmcids)
}
example_pmcids <- convertPIDtoPMCID(example_pids)

#-- Try to get data with tidypmc
pub_tables <- lapply(example_pmcids, function(pmc_id) {
    message("-- Trying ", pmc_id, "...")
    doc <- tryCatch(pmc_xml(pmc_id), 
                    error = function(e) {
                        message("------ Failed to recover PMCID")
                        return(NULL)
                        })
    if(!is.null(doc)) { 
        #-- If succeed, try to get table
        tables <- pmc_table(doc)
        if(!is.null(tables)) {
            #-- If succeed, try to get table name
            table_caps <- pmc_caption(doc) %>%
                filter(tag == "table")
            names(tables) <- paste(table_caps$label, table_caps$text, sep = " - ")
        }
        return(tables) 
    } else {
        #-- If fail, return NA
        return(NA)
    }
})
names(pub_tables) <- example_pids

#-- Inspect results
pub_tables$`30555165`$`Table 1 - Patient demographic and baseline characteristics`
pub_tables$`29286103`$`Table I - Sample summary.`

Tables will require quite a bit of tidying after this. Good luck!

Correct answer by csgroen on August 30, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP