How to merge two csv files with common but differently ordered headers?

Question

I have multiple files (>150) with multiple columns (>150). Most of the headers are common but occur in different order (as eg below):

File 1:

Col1 Col2 Col3 Col4 Col5

A    B    C    D    E

File 2:

Col1 Col4 Col3 Col5

P    Q    R    S

Desired output:

Col1 Col3 Col4 Col5

A    C    D    E

P    R    Q    S

Alternatively, nearly 30-40 files have common set of headers (still in different order though). If someone can help me sort the  headers (and the corresponding data) so that they appear in the same order in the whole bunch of files, I can then go ahead and remove uncommon columns from 4-5 bunches and merge the common set.

user218374 · Answer

This Perl code takes in all the to-be merged files and determines the common header from the headers of all the files and then rearranges the column printing order to output the files.

perl -wMstrict -Mvars='*ARGV_orig,*comm_hdr,*prev,*h' -lne '
     BEGIN{
       @::ARGV_orig = @ARGV;

$::prev = q//;

sub trim {
          my ($str) = @_ ? @_ : $_;

for($str) {
             s/^s*//;s/s*$//;
          }

return $str;
       }

sub intersection(@@) {
          @{$_[0]} > @{$_[1]} and @_ = reverse @_;

my @smaller = @{ +shift };
          my @larger  = @{ +shift };

my @common;

for my $e (@smaller) {
             push @common, $e
                if grep { $_ eq $e } @larger;
          }

return @common;
       }

sub col_print_order {

my @common_hdr = @{ $_[0]->{common_header} };
          my @header2prn = @{ $_[0]->{header_2print} };

my @reorder;

for my $e (@common_hdr) {
             if ( -1 < (my ($l) = grep { $header2prn[$_] eq $e } 0..$#header2prn) ) {
                push @reorder, $l;
             }
          }

return @reorder;
       }

}

if ( $ARGV ne $::prev ) {
       $::h{$ARGV}{header} = $_;
       my @A = split;
       @::comm_hdr = @::comm_hdr ? intersection(@::comm_hdr, @A) : @A;
       $::prev = $ARGV;
    } else {
       push @{ $::h{$ARGV}{data} }, $_;
    }

END{
       local $, = chr(32);

my @comm_hdr_sorted = sort @::comm_hdr;;
       print @comm_hdr_sorted;

for my $argv (@::ARGV_orig) {
          my @current_header = split /s+/, trim $::h{$argv}{header};

my @order = col_print_order({
             common_header => @comm_hdr_sorted,
             header_2print => @current_header,
          });

my @file = @{ $::h{$argv}{data} };

for my $line_num ( 0..$#file ) {
             my $line = trim $file[$line_num];
             my @fields = split /s+/, $line;
             print @fields[ @order ];
          }
       }
    }
 ' yourfile1 yourfile2 yourfile3 # ... specify all your filenames to be merged here

Output

Col1 Col3 Col4 Col5
A    C    D    E
P    R    Q    S

glenn jackman · Answer

GNU awk, can handle an arbitrary number of files (all file content is stored in memory, so depends on your system's memory capacity)

gawk '
    # examine the headers for this file
    FNR == 1 {
        num_files++
        delete this_headers
        for (i=1; i<=NF; i++) {
            all_headers[$i]++
            this_headers[i] = $i
        }
        next
    }
    # this is a line of data
    {
        n++
        for (i=1; i<=NF; i++) {
            data[n][this_headers[i]] = $i
        }
    }
    END {
        # find the headers that are common to all files
        for (header in all_headers) {
            if (all_headers[header] == num_files)
                common_headers[header]
        }
        # sort arrays by index, alphabetically
        PROCINFO["sorted_in"] = "@ind_str_asc"
        # print out the common headers
        for (header in common_headers) {
            printf "%s ", header
        }
        print ""
        # print out the data
        for (i=1; i<=n; i++) {
            for (header in common_headers) {
                printf "%s ", data[i][header]
            }
            print ""
        }
    }
' file1 file2

outputs

Col1 Col3 Col4 Col5 
A C D E 
P R Q S

ingopingo · Answer

Save the code as file mergecols, make it executable and start it with mergecols -C1=0,2,3,4 -C2=0,2,1,3 file1 file2 #!/usr/bin/perl -s # mergecols # -C1=0,2,3,4 columns from file 1 # -C2=0,2,1,3 columns from file 2 # file1 input file 1 # file2 input file 2 ($f1,$f2) = @ARGV; @t1 = map { [split] } do { local @ARGV=($f1); <> }; @t2 = map { [split] } do { local @ARGV=($f2); <> }; @c1 = split /,/, $C1; @c2 = split /,/, $C2; for ( $i=0; $t1[$i] or $t2[$i]; $i++ ) { print join ' ', @{$t1[$i]}[@c1], "n" if $t1[$i]; print join ' ', @{$t2[$i]}[@c2], "n" if $t2[$i]; }

How to merge two csv files with common but differently ordered headers?

3 Answers

Output

Add your own answers!

Ask a Question