Remove duplicate lines from files recursively in place but leave one - make lines unique across files

Question

I have many folders and folders contain files. The same line might appear multiple times in single file and/or in multiple files. Files are not sorted. So there are some lines duplicated across multiple files and those files are in different folders.
I want to remove duplicate lines and keep only one of them across all files. Also file structure and names should stay same.
I've tried but made only unique in each single file not in all files. This code makes lines unique in each file and keeps file name:
for i in $(find . -type f); do
    awk '!seen[$0]++' "$i" > tmp_file
    mv ./tmp_file "$i"
done

Question: how can I make lines unique across all files in all subfolders while keeping files structure and name?
Here is a sample of my files. To simplify, I'm listing only files here, but files are located in same or different folders.
Input:
$ cat File-1
1
2
3
1

$ cat File-2
2
3
4
1

$ cat File-3
2
4
5
6

Output:
$ cat File-1
1
2
3

$ cat File-2
4

$ cat File-3
5
6

In my case, retaining first occurrence of line is preferred but not required (retained line can be in any file).

fra-san · Answer

What follows will only work if the number of files to process is small enough to make find run awk exactly once. It also assumes you can make a copy of the entire file tree (i.e. you are not storage-constrained).
Assuming your file tree is in the orig directory:
$ cp -pr orig tmp
$ cd tmp
$ find . -type f -exec awk '
  BEGIN { print ARGC }
  FILENAME != fn {
    close( "../orig/"fn )
    printf "" > ( "../orig/"FILENAME )
  }
  !seen[$0]++ { print > ( "../orig/"FILENAME ) }
  { fn = FILENAME; }' {} +

Once you are satisfied with the result, you can rm -r tmp.
print ARGC is used to show how many times awk was invoked. ARGC is the number of elements in the array of command line arguments (including the script itself); seeing it printed more than once means that global line deduplication failed.
(Indeed, if you can compute the total number of files to process, you may change that block into if ( (ARGC - 1) < total_number_of_files) exit to make sure no file is modified if awk is going to be invoked more than once).

Garo · Answer

#!/usr/bin/perl
use File::Find;
my $headdir="/some/path";
my @files=();
my $lines={};
find( { wanted => sub { push @files, $_ }, no_chdir => 1 }, $headdir );
foreach my $file (@files) {
  next unless(-f $file);
  system "cp $file $file". ".old";
  open(my $fhin, "$file".".old");
  open(my $fhout, ">$file");
  while(<$fhin>) {
    if(not defined $lines->{$_}) {
      print $fhout $_;
      $lines->{$_} = 1;
    }
  }
  close($fhin);
  close($fhout);
  #optional: system("rm $file".".old");
}

EDIT: (Only) tested with files mentioned in the question, tiny change in the code was necessary

Remove duplicate lines from files recursively in place but leave one - make lines unique across files

2 Answers

Add your own answers!

Ask a Question