TransWikia.com

Break string into new rows after grepl

Stack Overflow Asked by ip2018 on February 21, 2021

I have a dataframe as:

example_df = data.frame(Genes=c("A", "B"),
                        Sequence = c("MAMAMAM", "ABABABAB"),
                        Domains = c("DOMAIN 90..122;  /note=ABC transporter 1; DOMAIN 129..231;  /note=ABC transporter 1",
                                    "DOMAIN 10..12;  /note=C2H2 ZNF"))

Ideally within a dplyr pipe, I want to find all instances of the word ‘DOMAIN’ in ‘Domains’ column, break the string and add a new row below while keeping the information for all other columns. The desired output is

output_df = data.frame(Genes=c("A", "A", "B"),
                       Sequence = c("MAMAMAM", "MAMAMAM", "ABABABAB"),
                       Domains = c("DOMAIN 90..122;  /note=ABC transporter 1;",
                                   "DOMAIN 129..231;  /note=ABC transporter 1",
                                   "DOMAIN 10..12;  /note=C2H2 ZNF"))

I have no idea how to tackle this problem. Any help will be much appreciated. Thanks

2 Answers

You can use separate_rows to split rows on 'DOMAIN'.

library(dplyr)
library(tidyr)

example_df %>%
  separate_rows(Domains, sep = '(?=DOMAIN)') %>%
  filter(Domains != '')

#  Genes Sequence Domains                                     
#  <chr> <chr>    <chr>                                       
#1 A     MAMAMAM  "DOMAIN 90..122;  /note=ABC transporter 1; "
#2 A     MAMAMAM  "DOMAIN 129..231;  /note=ABC transporter 1" 
#3 B     ABABABAB "DOMAIN 10..12;  /note=C2H2 ZNF"          

Correct answer by Ronak Shah on February 21, 2021

We can do this in base R with strsplit

lst1 <- strsplit(example_df$Domains, "\s+(?=DOMAIN)", perl = TRUE)
out <- transform(example_df[rep(seq_len(nrow(example_df)), 
       lengths(lst1)),], Domains = unlist(lst1))
row.names(out) <- NULL
out
#  Genes Sequence                                   Domains
#1     A  MAMAMAM DOMAIN 90..122;  /note=ABC transporter 1;
#2     A  MAMAMAM DOMAIN 129..231;  /note=ABC transporter 1
#3     B ABABABAB            DOMAIN 10..12;  /note=C2H2 ZNF

Or with separate_rows by specifying the sep as one or more spaces (\s+) that precedes the 'DOMAIN' keyword

library(dplyr)
library(tidyr)
example_df %>%
     separate_rows(Domains, sep = "\s+(?=DOMAIN)")
# A tibble: 3 x 3
#  Genes Sequence Domains                                  
#   <chr> <chr>    <chr>                                    
#1 A     MAMAMAM  DOMAIN 90..122;  /note=ABC transporter 1;
#2 A     MAMAMAM  DOMAIN 129..231;  /note=ABC transporter 1
#3 B     ABABABAB DOMAIN 10..12;  /note=C2H2 ZNF         

Answered by akrun on February 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP