TransWikia.com

How to merge two csv files with common but differently ordered headers?

Unix & Linux Asked by Sagar Joshi on December 31, 2021

I have multiple files (>150) with multiple columns (>150). Most of the headers are common but occur in different order (as eg below):

File 1:

Col1 Col2 Col3 Col4 Col5

A    B    C    D    E

File 2:

Col1 Col4 Col3 Col5

P    Q    R    S

Desired output:

Col1 Col3 Col4 Col5

A    C    D    E

P    R    Q    S

Alternatively, nearly 30-40 files have common set of headers (still in different order though). If someone can help me sort the headers (and the corresponding data) so that they appear in the same order in the whole bunch of files, I can then go ahead and remove uncommon columns from 4-5 bunches and merge the common set.

3 Answers

This Perl code takes in all the to-be merged files and determines the common header from the headers of all the files and then rearranges the column printing order to output the files.

     perl -wMstrict -Mvars='*ARGV_orig,*comm_hdr,*prev,*h' -lne '
     BEGIN{
       @::ARGV_orig = @ARGV;

       $::prev = q//;

       sub trim {
          my ($str) = @_ ? @_ : $_;

          for($str) {
             s/^s*//;s/s*$//;
          }

          return $str;
       }

       sub intersection(@@) {
          @{$_[0]} > @{$_[1]} and @_ = reverse @_;

          my @smaller = @{ +shift };
          my @larger  = @{ +shift };

          my @common;

          for my $e (@smaller) {
             push @common, $e
                if grep { $_ eq $e } @larger;
          }

          return @common;
       }

       sub col_print_order {

          my @common_hdr = @{ $_[0]->{common_header} };
          my @header2prn = @{ $_[0]->{header_2print} };

          my @reorder;

          for my $e (@common_hdr) {
             if ( -1 < (my ($l) = grep { $header2prn[$_] eq $e } 0..$#header2prn) ) {
                push @reorder, $l;
             }
          }

          return @reorder;
       }

    }

    if ( $ARGV ne $::prev ) {
       $::h{$ARGV}{header} = $_;
       my @A = split;
       @::comm_hdr = @::comm_hdr ? intersection(@::comm_hdr, @A) : @A;
       $::prev = $ARGV;
    } else {
       push @{ $::h{$ARGV}{data} }, $_;
    }

    END{
       local $, = chr(32);

       my @comm_hdr_sorted = sort @::comm_hdr;;
       print @comm_hdr_sorted;

       for my $argv (@::ARGV_orig) {
          my @current_header = split /s+/, trim $::h{$argv}{header};

          my @order = col_print_order({
             common_header => @comm_hdr_sorted,
             header_2print => @current_header,
          });

          my @file = @{ $::h{$argv}{data} };

          for my $line_num ( 0..$#file ) {
             my $line = trim $file[$line_num];
             my @fields = split /s+/, $line;
             print @fields[ @order ];
          }
       }
    }
 ' yourfile1 yourfile2 yourfile3 # ... specify all your filenames to be merged here

Output

Col1 Col3 Col4 Col5
A    C    D    E
P    R    Q    S

Answered by user218374 on December 31, 2021

GNU awk, can handle an arbitrary number of files (all file content is stored in memory, so depends on your system's memory capacity)

gawk '
    # examine the headers for this file
    FNR == 1 {
        num_files++
        delete this_headers
        for (i=1; i<=NF; i++) {
            all_headers[$i]++
            this_headers[i] = $i
        }
        next
    }
    # this is a line of data
    {
        n++
        for (i=1; i<=NF; i++) {
            data[n][this_headers[i]] = $i
        }
    }
    END {
        # find the headers that are common to all files
        for (header in all_headers) {
            if (all_headers[header] == num_files)
                common_headers[header]
        }
        # sort arrays by index, alphabetically
        PROCINFO["sorted_in"] = "@ind_str_asc"
        # print out the common headers
        for (header in common_headers) {
            printf "%s ", header
        }
        print ""
        # print out the data
        for (i=1; i<=n; i++) {
            for (header in common_headers) {
                printf "%s ", data[i][header]
            }
            print ""
        }
    }
' file1 file2

outputs

Col1 Col3 Col4 Col5 
A C D E 
P R Q S 

Answered by glenn jackman on December 31, 2021

Save the code as file mergecols, make it executable and start it with mergecols -C1=0,2,3,4 -C2=0,2,1,3 file1 file2

#!/usr/bin/perl -s

# mergecols
# -C1=0,2,3,4   columns from file 1
# -C2=0,2,1,3   columns from file 2
# file1         input file 1
# file2         input file 2

($f1,$f2) = @ARGV;

@t1 = map { [split] } do { local @ARGV=($f1); <> };
@t2 = map { [split] } do { local @ARGV=($f2); <> };

@c1 = split /,/, $C1;
@c2 = split /,/, $C2;

for ( $i=0; $t1[$i] or $t2[$i]; $i++ ) {
   print join ' ', @{$t1[$i]}[@c1], "n" if $t1[$i];
   print join ' ', @{$t2[$i]}[@c2], "n" if $t2[$i];
}

Answered by ingopingo on December 31, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP