TransWikia.com

How to add different string to duplicates in a single column of a tab delimited file in order to make them unique

Unix & Linux Asked by zerotimer on October 31, 2021

I have a 4 column tab delimited file and the last column sometimes has duplicates. This is an excerpt from that file:

chr7    116038644       116039744       GeneA
chr7    116030947       116032047       GeneA
chr7    115846040       115847140       GeneA
chr7    115824610       115825710       GeneA
chr7    115801509       115802609       GeneA
chr7    115994986       115996086       GeneA
chrX    143933024       143934124       GeneB
chrX    143933119       143934219       GeneB
chrY    143933129       143933229       GeneC

For every set of duplicate in that column, I want to convert them into something like this (without really touching the non duplicate values in that column):

chr7    116038644       116039744       GeneA-1
chr7    116030947       116032047       GeneA-2
chr7    115846040       115847140       GeneA-3
chr7    115824610       115825710       GeneA-4
chr7    115801509       115802609       GeneA-5
chr7    115994986       115996086       GeneA-6
chrX    143933024       143934124       GeneB-1
chrX    143933119       143934219       GeneB-2
chrY    143933129       143933229       GeneC

How can I do this with awk or sed or Bash’s for loop?

5 Answers

A simple two-pass awk command:

$ awk -F 't' '
    BEGIN { OFS=FS }
    pass == 1 { count[$4]++; next }
    count[$4] > 1 { $4 = $4 "-" ++number[$4] }; 1' pass=1 file pass=2 file
chr7    116038644       116039744       GeneA-1
chr7    116030947       116032047       GeneA-2
chr7    115846040       115847140       GeneA-3
chr7    115824610       115825710       GeneA-4
chr7    115801509       115802609       GeneA-5
chr7    115994986       115996086       GeneA-6
chrX    143933024       143934124       GeneB-1
chrX    143933119       143934219       GeneB-2
chrY    143933129       143933229       GeneC

This assumes that the input file is tab-delimited. Modify or remove -F 't' if that is not the case.

On the first pass through the file, an associative array, count, is populated with the number of times each gene name is see in the fourth column.

On the second pass through the file, a dash and a number is added to the end of the gene name if the gene name has been seen more than once in the data set (in the first pass through the file).

The number added is another counter, also keyed by the gene name, just like in the counter array.

Answered by Kusalananda on October 31, 2021

step1:

awk '{a[$NF]++}END{for (x in a) {print x,a[x]}}' filename| awk '{if($NF >1){for(i=1;i<=$2;i++){print $1"-"i}}else{print $1}}' >first.txt

step2:

awk '{$NF="";print $0}' filename >second.txt

step3:

paste second.txt first.txt | sed -r "s/s+/ /g"

output

chr7 116038644 116039744 GeneA-1
chr7 116030947 116032047 GeneA-2
chr7 115846040 115847140 GeneA-3
chr7 115824610 115825710 GeneA-4
chr7 115801509 115802609 GeneA-5
chr7 115994986 115996086 GeneA-6
chrX 143933024 143934124 GeneB-1
chrX 143933119 143934219 GeneB-2
chrY 143933129 143933229 GeneC

python

#!/usr/bin/python
j=[]
ww=[]
k=open('rrr.txt','r')
for i in k:
    if i.split(' ')[-1].strip() not in j:
        j.append(i.split(' ')[-1].strip())


for g in j:
    h=open('rrr.txt','r')
    d=h.readlines()
    for e in d:
        if g in e.strip():
            ww.append(e.strip())
    if len(ww) > 1:
        for z in  range(0,len(ww),1):
            print "{0}-{1}".format(ww[z],z+1)
            
    else:
        print "".join(ww)
    ww=[]

output

chr7    116038644       116039744       GeneA-1
chr7    116030947       116032047       GeneA-2
chr7    115846040       115847140       GeneA-3
chr7    115824610       115825710       GeneA-4
chr7    115801509       115802609       GeneA-5
chr7    115994986       115996086       GeneA-6
chrX    143933024       143934124       GeneB-1
chrX    143933119       143934219       GeneB-2
chrY    143933129       143933229       GeneC

Answered by Praveen Kumar BS on October 31, 2021

Assuming your input file is grouped by the 4th column as shown in your example:

$ cat tst.awk
$NF != prev {
    prt()
    cnt = 0
    prev = $NF
}
{ rec[++cnt] = $0 }
END { prt() }

function prt() {
    for (i=1; i<=cnt; i++) {
        print rec[i] (cnt > 1 ? "-"i : "")
    }
}

.

$ awk -f tst.awk file
chr7    116038644       116039744       GeneA-1
chr7    116030947       116032047       GeneA-2
chr7    115846040       115847140       GeneA-3
chr7    115824610       115825710       GeneA-4
chr7    115801509       115802609       GeneA-5
chr7    115994986       115996086       GeneA-6
chrX    143933024       143934124       GeneB-1
chrX    143933119       143934219       GeneB-2
chrY    143933129       143933229       GeneC

Answered by Ed Morton on October 31, 2021

This only appends "-number" to the specified (target) field (the fourth one in your sample) if its value is not unique. It also handles the case of input not sorted by the target column and works for an arbitrary number of input columns.

Since the following AWK script needs the input to be sorted by the target field, we use a pipeline to number the original lines, sort them by the (now) fifth field (the first being the prepended number), append the suffix to the non-unique values of the fifth field, bring the lines back to the initial sorting and remove the prepended numbers:

nl file | sort -b -t '<TAB>' -k5,5 -k1n,1n | awk -F 't' -v OFS='t' -v kf=5 '
  function prn () {
    for (i = 1; i <= nfl; i++) {
      if (i == kf)
        printf("%s", prc[i] ( sw || cnt[prc[i]] ? "-"++cnt[prc[i]] : ""))
      else
        printf("%s", prc[i])
      printf("%s", (i == nfl ? ORS : OFS))
    }
  }
  NR > 1 {
    sw = ($kf == prc[kf])
    prn()
  }
  {
    nfl = split($0, prc)
  }
  END {
    if (NR > 0)
      prn()
  } ' | sort -k1n,1n | cut -f 2-

The gist of this AWK script is to print the previous line after checking if its kfth field is equal to that of the current line or if its kfth field has already appeared at least once. In both cases, the kfth field is printed with the number of times it has been seen appended to it.

Make sure to adjust -v kf=5 (and the -k5,5 sort key) to reflect the actual position of the column you want to disambiguate.

Given this sample (yours, with shuffled rows and an added column) as file:

chr7    116038644   116039744   GeneA   foo
chrX    143933024   143934124   GeneB   foo
chr7    116030947   116032047   GeneA   foo
chr7    115824610   115825710   GeneA   foo
chrY    143933129   143933229   GeneC   foo
chr7    115994986   115996086   GeneA   foo
chrX    143933119   143934219   GeneB   foo
chr7    115801509   115802609   GeneA   foo
chr7    115846040   115847140   GeneA   foo

the output would be:

chr7    116038644   116039744   GeneA-1 foo
chrX    143933024   143934124   GeneB-1 foo
chr7    116030947   116032047   GeneA-2 foo
chr7    115824610   115825710   GeneA-3 foo
chrY    143933129   143933229   GeneC   foo
chr7    115994986   115996086   GeneA-4 foo
chrX    143933119   143934219   GeneB-2 foo
chr7    115801509   115802609   GeneA-5 foo
chr7    115846040   115847140   GeneA-6 foo

Answered by fra-san on October 31, 2021

Try this

awk -F't' -v OFS='t' '{$4=$4 "-" (++count[$4])}1' file.tsv

This will store the occurence of each value of the 4th field in a counter array count (where the value of the 4th field is used as "index"), and append the pre-incremented value of that counter to the 4th field, separated by a dash.

The above "simple" example has a disadvantage: it will add a disambiguation number even to those values in column 4 that only appear once in the file. In order to suppress that, the following double-pass approach will work (command broken over two lines via to improve readability):

 awk -F't' -v OFS='t' 'NR==FNR{f[$4]++}
      NR>FNR{if (f[$4]>1) {$4=$4 "-" (++count[$4])}; print}' file.tsv file.tsv

Note that the file to be processed is stated twice as argument, and will therefore be read twice.

  • The first time it is read (indicated by FNR, the per-file line counter, being equal to NR, the global line counter), we simply count how often every distinct value of column 4 appears in the file, and store that in an array f.
  • The second time the file is read, we perform the actual text processing just like in the "simple" approach and append the occurence counter to column 4, but only if the total number of occurences as found in the first pass is larger than 1.

This approach avoids buffering the entire file, which can be an advantage if the file is very large. The processing time is of course longer because the file is read two times.

As a general rule, using shell loops for text processing is rarely necessary, as awk e.g. can perform loop operations by itself in a much more efficient way.

Answered by AdminBee on October 31, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP