TransWikia.com

Looking up values in one table and outputting it into another using join/awk

Unix & Linux Asked by tacrolimus on December 2, 2020

Table 1 (tab separated):

NC_000001.11 1243 A T 0.14 
NC_000005.11 1432 G C 0.0006 
NC_000012.12 1284 A T 0.93428 
NC_000021.9 9824 T C 0.9

Lookup table (tab seperated) – this is in fact huge around 6G gzipped :

 NC_000001.11 1243 rs73647921 A   T  
 NC_000005.11 1432 rs75444    G   C
 NC_000012.12 1284 rs754723 A T  
 NC_000021.9 9824 rs865545  T C 

Desired output – I would like to match on the first 4 columns of table 1 which correspond to columns 1/2/4/5 of the lookup table; taking table 1 and transforming it into:

MarkerName P-Value
rs73647921 0.14
rs75444    0.0006
rs754723 0.93428 
rs865545  0.9

I think I need to use join along the lines of:

join -t, -a 1 -a 2 -o0,1.5,2.3 -e ' -' file1 file2

But this doesn’t seem to work. How could I also use a gzipped file?

2 Answers

With awk (and bash) you can write

awk '
    BEGIN {FS = OFS = "t"}
    NR == FNR {pvalue[$1,$2,$3,$4] = $5; next}
    FNR == 1 {print "MarkerName", "P-Value"}
    { key = $1 SUBSEP $2 SUBSEP $4 SUBSEP $5
      sub(/r$/, "", key)
    }
    key in pvalue {print $3, pvalue[key]}
' table1.tsv <(zcat lookup.tsv.gz)

awk uses the SUBSEP variable to join comma-separated indices for arrays.

That last bit of syntax around the zcat is a bash Process Substitution

For multi-field join conditions, join can be painful to work with. It also complains if the files are not sorted.

Correct answer by glenn jackman on December 2, 2020

have you tried .. ?

join -o 2.3,1.5 f1 <(zcat f2.gz )

by default join use first field. (and -1 x -2 x is equivalent to -j x , and -j 1 is equivalent to " " )

I assume

  • file are sorted,
  • key is first field only.

to sort on all fields

join -o 2.2,1.2 
    <(awk '{printf "%s %sn",$1 $2 $3 $4,$5}' f1) 
    <(zcat f2.gz| awk '{printf "%s %sn",$1 $2 $4 $5,$3}' )

where

  • using two awk to pipe to join might not be a good idea (altough this might save memory compared to an awk only solution),
  • awk command build an index from revelant fields,
  • zcat used to ungzip,
  • this leave 0's problem open.

Answered by Archemar on December 2, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP