AnswerBun.com

Looking up values in one table and outputting it into another using join/awk

Unix & Linux Asked by tacrolimus on December 2, 2020

Table 1 (tab separated):

NC_000001.11 1243 A T 0.14 
NC_000005.11 1432 G C 0.0006 
NC_000012.12 1284 A T 0.93428 
NC_000021.9 9824 T C 0.9

Lookup table (tab seperated) – this is in fact huge around 6G gzipped :

 NC_000001.11 1243 rs73647921 A   T  
 NC_000005.11 1432 rs75444    G   C
 NC_000012.12 1284 rs754723 A T  
 NC_000021.9 9824 rs865545  T C 

Desired output – I would like to match on the first 4 columns of table 1 which correspond to columns 1/2/4/5 of the lookup table; taking table 1 and transforming it into:

MarkerName P-Value
rs73647921 0.14
rs75444    0.0006
rs754723 0.93428 
rs865545  0.9

I think I need to use join along the lines of:

join -t, -a 1 -a 2 -o0,1.5,2.3 -e ' -' file1 file2

But this doesn’t seem to work. How could I also use a gzipped file?

2 Answers

With awk (and bash) you can write

awk '
    BEGIN {FS = OFS = "t"}
    NR == FNR {pvalue[$1,$2,$3,$4] = $5; next}
    FNR == 1 {print "MarkerName", "P-Value"}
    { key = $1 SUBSEP $2 SUBSEP $4 SUBSEP $5
      sub(/r$/, "", key)
    }
    key in pvalue {print $3, pvalue[key]}
' table1.tsv <(zcat lookup.tsv.gz)

awk uses the SUBSEP variable to join comma-separated indices for arrays.

That last bit of syntax around the zcat is a bash Process Substitution

For multi-field join conditions, join can be painful to work with. It also complains if the files are not sorted.

Correct answer by glenn jackman on December 2, 2020

have you tried .. ?

join -o 2.3,1.5 f1 <(zcat f2.gz )

by default join use first field. (and -1 x -2 x is equivalent to -j x , and -j 1 is equivalent to " " )

I assume

  • file are sorted,
  • key is first field only.

to sort on all fields

join -o 2.2,1.2 
    <(awk '{printf "%s %sn",$1 $2 $3 $4,$5}' f1) 
    <(zcat f2.gz| awk '{printf "%s %sn",$1 $2 $4 $5,$3}' )

where

  • using two awk to pipe to join might not be a good idea (altough this might save memory compared to an awk only solution),
  • awk command build an index from revelant fields,
  • zcat used to ungzip,
  • this leave 0's problem open.

Answered by Archemar on December 2, 2020

Add your own answers!

Related Questions

Recently downloaded files using wget

1  Asked on December 17, 2020 by mohammad-qodrati

   

watch tail -20 ls -ltr?

1  Asked on December 17, 2020 by dong-gi-kang

   

Route all traffic via OpenVPN

1  Asked on December 17, 2020 by ihor_dvoretskyi

   

Evince ‘save as’ permissions

1  Asked on December 17, 2020 by stan-shank

   

change LS_COLORS behaviour

1  Asked on December 16, 2020 by barud

       

How to adjust NVIDIA GPU fan speed on a headless node?

2  Asked on December 16, 2020 by aleksandr-dubinsky

       

Linux VM unresponsive

0  Asked on December 16, 2020 by ykaly

 

Upgrading Berkeley DB rpm in a Centos OS

1  Asked on December 16, 2020 by gr68

         

How to redirect all network traffic through Tor?

0  Asked on December 15, 2020 by ananthakrishna-k

     

Feed script variable to command that prompts for user input

1  Asked on December 15, 2020 by vishnu-sajeevan

   

Configuring SNMP Traps

1  Asked on December 15, 2020 by kubomd

   

Wrong notification system in KDE Plasma

1  Asked on December 14, 2020 by arthur-m

       

start chromium on startup using systemd

1  Asked on December 14, 2020

     

DRBD Failure: (127) Device minor not allocated

1  Asked on December 14, 2020 by tanatos-daniel

       

Using bash shell function inside AWK

7  Asked on December 13, 2020 by tiny

   

Chrome on Gnome randomly fails to go full screen

1  Asked on December 13, 2020 by mikev

       

Ask a Question

Get help from others!

© 2022 AnswerBun.com. All rights reserved. Sites we Love: PCI Database, MenuIva, UKBizDB, Menu Kuliner, Sharing RPP, SolveDir