TransWikia.com

How do I remove the first 300 million lines from a 700 GB txt file on a system with 1 TB disk space?

Unix & Linux Asked by Kris on December 4, 2020

How do I remove the first 300 million lines from a 700 GB text file
on a system with 1 TB disk space total, with 300 GB available? 
(My system has 2 GB of memory.) 
The answers I found use sed, tail, head:

But I think (please correct me) I cannot use them due to the disk space being limited to 1 TB and they produce a new file and/or have a tmp file during processing.

The file contains database records in JSON format.

13 Answers

If you have enough space to compress the file, which should free a significant amount of space, allowing you to do other operations, you can try this:

gzip file && zcat file.gz | tail -n +300000001 | gzip > newFile.gz

That will first gzip the original input file (file) to create file.gz. Then, you zcat the newly created file.gz, pipe it through tail -n +300000001 to remove the first 3M lines, compress the result to save disk space and save it as newFile.gz. The && ensures that you only continue if the gzip operation was successful (it will fail if you run out of space).

Note that text files are very compressible. For example, I created a test file using seq 400000000 > file, which prints the numbers from 1 to 400,000,000 and this resulted in a 3.7G file. When I compressed it using the commands above, the compressed file was only 849M and the newFile.gz I created only 213M.

Correct answer by terdon on December 4, 2020

i'd do it as

<?php
$fp1 = fopen("file.txt", "rb");
// find the position of the 3M'th line:
for ($i = 0; $i < 300_000_000; ++ $i) {
    fgets($fp1);
}
// the next fgets($fp1) call will read line 3M+1 :)
$fp2 = fopen("file.txt", "cb");
// copy all remaining lines from fp1 to fp2
while (false !== ($line = fgets($fp1))) {
    fwrite($fp2, $line);
}
fclose($fp1);
// remove every line that wasn't copied over to fp2
ftruncate($fp2, ftell($fp2));
fclose($fp2);

or if i need it to run fast for some reason, i'd do the same in C++ with mmap() memory mapping, this should run much faster:

#include <iostream>
#include <fstream>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>


int main(){
    const std::string target_file = "file.txt";
    std::fstream fp1(target_file, std::fstream::binary);
    fp1.exceptions(std::fstream::failbit | std::fstream::badbit);
    fp1.seekg(0, std::fstream::end);
    const std::streampos total_file_size_before_truncation = fp1.tellg();
    fp1.seekg(0, std::fstream::beg);
    const int fd = open(target_file.c_str(), O_RDWR);
    char *content_mmaped = (char *)mmap(NULL, total_file_size_before_truncation, PROT_READ, MAP_PRIVATE, fd, 0);
    const std::string_view content_view(content_mmaped, total_file_size_before_truncation);
    size_t line_no = 0;
    size_t line_pos = 0;
    size_t i = 0;
    for(; i < total_file_size_before_truncation; ++i){
        if(content_mmaped[i] == 'n'){
            ++line_no;
            line_pos = i;
            if(line_no >= (3000000-1)){
                break;
            }
        }
    }
    // idk why i have to do all those casts...
    fp1.write(&content_mmaped[i], std::streamoff(std::streamoff(total_file_size_before_truncation)-std::streamoff(i)));
    fp1.close();
    munmap(content_mmaped, total_file_size_before_truncation);
    ftruncate(fd, i);
    close(fd);
}
  • this should run significantly faster than every other line-accurate answer here, except user431397's answer (but this works on any filesystem, unlike user431397's approach, which only works on certain filesystems)

(but if i don't need the speed, i would probably use the first approach, as the code is much easier to read and probably less likely to contain bugs as a result)

Answered by hanshenrik on December 4, 2020

There are various approaches to remove the first lines. I recommend you to split up the file into chunks, change them (remove the first lines) and concatenate the chunks again.

In your case it would be very dangerous to change the file in-place. If something goes wrong you have no fallback option!

Here is my working solution (bash). You probably need some improvements ...

function split_into_chunks {
    BIG_FILE=$1

    while [ $(stat -c %s $BIG_FILE) -gt 0 ]
    do
    CHUNK_FILE="chunk.$(ls chunk.* 2>/dev/null | wc -l)"
    tail -10 $BIG_FILE > $CHUNK_FILE
    test -s $CHUNK_FILE && truncate -s -$(stat -c %s $CHUNK_FILE) $BIG_FILE
    done
}

function concat_chunks {
    BIG_FILE=$1
    test ! -s $BIG_FILE || (echo "ERROR: target file is not empty"; return)

    for CHUNK_FILE in $(ls chunk.* | sort -t . -k2 -n -r)
    do
    cat $CHUNK_FILE >> $BIG_FILE
    rm $CHUNK_FILE
    done
}

Test:

$ seq 1000 > big-file.txt 
$ stat -c "%s %n" chunk.* big-file.txt 2>/dev/null | tail -12
3893 big-file.txt
$ md5sum big-file.txt; wc -l big-file.txt 
53d025127ae99ab79e8502aae2d9bea6  big-file.txt
1000 big-file.txt

$ split_into_chunks big-file.txt
$ stat -c "%s %n" chunk.* big-file.txt | tail -12
40 chunk.9
31 chunk.90
30 chunk.91
30 chunk.92
30 chunk.93
30 chunk.94
30 chunk.95
30 chunk.96
30 chunk.97
30 chunk.98
21 chunk.99
0 big-file.txt

$ # here you could change the chunks
$ # the test here shows that the file will be concatenated correctly again

$ concat_chunks big-file.txt
$ stat -c "%s %n" chunk.* big-file.txt 2>/dev/null | tail -12
3893 big-file.txt
$ md5sum big-file.txt; wc -l big-file.txt 
53d025127ae99ab79e8502aae2d9bea6  big-file.txt
1000 big-file.txt

Hint: You definitely need to make sure that all your chunks are not too small (very long processing time) and not too big (not enough disk space)! My example uses 10 lines per chunk - I assume that is too low for your task.

Answered by sealor on December 4, 2020

You can just read and write to the file in place and then truncate the file. There may even be a way to do this with cli tools, not sure, but here it is in Java (untested).

RandomAccessFile out = new RandomAccessFile("file.txt", "rw");
RandomAccessFile in = new RandomAccessFile("file.txt", "r");
String line = null;
long rows = 0;
while( (line=in.readLine()) != null ){
    if( rows > 300000000 ) {
        out.writeBytes(line);
        out.write('n');
    }
    rows++;
}
in.close();
out.setLength( out.getFilePointer() );
out.close();

Answered by Chris Seline on December 4, 2020

Think of Towers of Hanoi. Sort of.

First, move the lines you want into a new file:

find the start of line 3 million and 1
create a new, empty file
repeat {
  read a decent number of blocks from the end of the old file
  append the blocks to the end of the new file
  truncate the old file by that many blocks
} until you get to the start of line 3 million and 1.

You should now have a file that contains just the lines you want, but not in the right order.

So lets do the same thing again to put them into the right order:

Truncate the original file to zero blocks` (i.e. delete the first 3 million lines)
repeat {
  read the same number of blocks from the end of the new file (except the first time, when you won't have an exact number of blocks unless the first 3 million lines were an exact number of blocks long)
  append those blocks to the end of the original file
  truncate the new file by that many blocks
} until you have processed the whole file.

You should now have just the lines you want, and in the right order.

Actual working code is left as an exercise for the reader.

Answered by Ben Aveling on December 4, 2020

What about using vim for in-place editing?

Vim is already capable of reasoning about lines:

vim -c ":set nobackup nowritebackup" -c ":300000000delete" -c ":wq" filename

Explanation:

vim will execute the various command passed to the -c switches as if they where passesed in an interactive session.

So:

  1. we disable backup copy creation
  2. we delete the first 300 million lines (cursor starts at line 0 on startup)
  3. we save the file

That should do the trick. I have used vim in a similar fashion in the past, it works. It may not be copy-paste safe, OP should do some tests and possibly adapt the command to their needs.

Just to be sure, you might want to remove the -c ":wq" switches at the end, and visually inspect the file for correctness.

Answered by znpy on December 4, 2020

With ksh93:

tail -n +300000001 < file 1<>; file

The 1<>; operator is a ksh93-specific variation on the standard 1<> operator (which opens in read+write mode without truncation), that truncates the file after the command has returned at the position the command left its stdout at if that command was successful.

With other shells, you can always do the truncating-in-place-afterwards by hand with perl for instance:

{
  tail -n +300000001 &&
    perl -e 'truncate STDOUT, tell STDOUT'
} < file 1<> file

To get a progress bar, using pv:

{
  head -n 300000000 | pv -s 300000000 -lN 'Skipping 300M lines' > /dev/null &&
    cat | pv -N 'Rewriting the rest' &&
    perl -e 'truncate STDOUT, tell STDOUT'
} < file 1<> file

(using head | pv and cat | pv as pv would refuse to work if its input and output were pointing to the same file. pv -Sls 300000000 would also not work as pv doesn't leave the pointer within the file just after the 300000000th line after existing like head does (and is required to by POSIX for seekable files). pv | cat instead of cat | pv would allow pv to know how much it needs to read and give you an ETA, but it's currently bogus in that it doesn't take into account the cases where it's not reading from the start of that file as is the case here).

Note that those are dangerous as the file is being overwritten in place. There is a chance that you run out of disk space if the first 300M lines contained holes (shouldn't happen for a valid text file), and the rest of the file takes up more space than you have spare space on the FS.

Answered by Stéphane Chazelas on December 4, 2020

Another vote for custom program if you really DO need the task. C or any powerful enough dynamic language like Perl or Python will do. I won't write out the source here, but will describe algorithm that will prevent data loss while you move data around:

  1. Read your big file from the end counting line-breaks. After gathering some pre-defined amount of lines that you can safely fit in free space, write this chunk as separate file and cut the big file's tail. Use chunk's filename to store line numbers.
  2. After that you will end with completely erased big file and lots of much smaller files taking same space.
  3. Count your 300 millions lines - you can delete all chunks corresponding to unnecessary lines right away, since you know what chunks contain which lines.
  4. If you don't actually need the big file, you can simply operate directly on remaining chunks with whatever tools you need using wildcards or stringing them together with cat as necessary.
  5. If you need the big file after all and freed up space is enough to store the sum of remaining chunks after you've deleted unnecessary ones - simply combine them together with cp or cat.
  6. If you need the big file and there is not enough space, write another small program that will do the reverse of step 1: Save list and individual length of each chunk to some list file. Read chunks one-by-one and append them to newly created "big file". Each time you've done appending chunk to big file, you will delete a separate small file containing this chunk, thus allowing you to reassemble file in-place. If you interrupted process of writing chunk at any time, you can restart writing of big file by calculating correct offset for any particular chunk because you've saved each chunk size in advance.

Answered by Oleg V. Volkov on December 4, 2020

I created a tool that may be of use to you: hexpeek is a hex editor designed for working with huge files and runs on any recent POSIX-like system (tested on Debian, CentOS, and FreeBSD).

One can use hexpeek or an external tool to find the 300-millionth newline. Then, assuming that X is the hexadecimal zero-indexed position of the first octet after the 300-millionth newline, the file can be opened in hexpeek and a single command 0,Xk will delete the first X octets in the file.

hexpeek requires no tmpfile to perform this operation; although the optional backup mode does and would likely need to be disabled via the -backup flag (sadly the current backup algorithm does not accommodate a rearrangement affecting more file space than is available for the backup file).

Of course, a custom C program can accomplish the same thing.

Answered by resiliware on December 4, 2020

The limitation of this problem is the amount of storage wherever that is located. Significant RAM is not required since fundamentally you can simply read one byte from wherever your file is stored and then either write or not write that byte [character] out to a new file wherever that may reside. Where the infile and outfile reside can be in totally separate places... on separate partitions, disks, or across a network. You do not need to read and write to the same folder. So for the attached program, you can simply give a full path name for and to work around disk space limitations. You will be at the mercy of other limitations, such as disk or network I/O speed, but it will work. Taking very long to work is better than not being able to happen.

  • adjust LL which is a hardcoded line length I used to read in a whole line at a time from a text file, I set it to 2048 characters. Set it to 1000000 if you like, which would require 1MB of RAM should you have extremely long lines in the text file.
  • if your text file is ridiculously large... I often deal with up to 10GB text files... consider doing a gzip -9 on it to create a mytextfile.gz. Being a text file will likely compress to 5% the size, which is helpful considering disk i/o speed vs cpu speed.
  • I write out your new file with n_deleted_lines to an uncompressed text file, so that will likely be huge.
  • this program is written in standard C, i kept it as simple as possible.
  • it checks and will not harm your original text file.
  • you do not have to compress your original text file for this to work, compressing it is optional.
  • you can have your original file on one disk or network location, and write the output file with N deleted lines to some other disk or network location, just use a full naming convention for example

delete_n_lines.x /home/ron/mybigfile.txt /some_nfs_mounted_disk/mybigfile_deletedlines.txt


/*  this file named    delete_n_lines.c

    compile by    gcc -W delete_n_lines.c -o delete_n_lines.x -lz

    have your huge text file already compressed via "gzip -9" to save disk space

    this program will also read a regular uncompressed text file
*/

# include <stdlib.h>
# include <stdio.h>
# include <string.h>
# include <zlib.h>

# define LL  2048   /* line length, number of characters up to 'n' */


int main ( int argc, char *argv[] )
{
   gzFile fin;
   FILE *fout;
   char line[LL];
   long int i, n = 0;
   long int n_lines_to_delete = 0;

   if ( argc != 4 )
   {
      printf("   Usage: %s  <infile> <outfile> <first_N_lines_to_delete>nn", argv[0] );
      exit( 0 );
   }

   n = sscanf( argv[3], "%d", &n_lines_to_delete );
   if ( n == 0 )
   {
      printf("n   Error: problem reading N lines to deletenn" );
      exit( 0 );
   }

   if ( strcmp( argv[1], argv[2] ) == 0 )
   {
      printf("n   Error: infile and outfile are the same.n" );
      printf("          don't do thatnn");
      exit( 0 );
   }

   fout = fopen( argv[2], "w" );
   if ( fout == NULL )
   {
      printf("n   Error: could not write to %snn", argv[2] );
      exit( 0 );
   }

   fin = gzopen( argv[1], "r" );
   if ( fin == NULL )
   {
      printf("n   Error: could not read %snn", argv[1] );
      fclose( fout );
      exit( 0 );
   }

   n = 0;
   gzgets( fin, line, LL );
   while ( ! gzeof( fin ) )
   {
      if ( n < n_lines_to_delete )
         n++;
      else
         fputs( line, fout );

      gzgets( fin, line, LL );
   }

   gzclose( fin );
   fclose( fout );

   printf("n   deleted the first %d lines of %s, output file is %snn", n, argv[1], argv[2] );


   return 0;
}

Answered by ron on December 4, 2020

You can do it with losetup, as an alternative to the dd method described here. Again, this method is dangerous all the same.

Again, the same test file and sizes (remove lines 1-300 from 1000 lines file):

$ seq 1 1000 > 1000lines.txt
$ stat -c %s 1000lines.txt
3893 # total bytes
$ head -n 300 1000lines.txt | wc -c
1092 # first 300 lines bytes
$ echo $((3893-1092))
2801 # target filesize after removal

Create a loop device:

# losetup --find --show 1000lines.txt
/dev/loop0
losetup: 1000lines.txt: 
Warning: file does not fit into a 512-byte sector; 
the end of the file will be ignored.
# head -n 3 /dev/loop0
1 
2 
3 
# tail -n 3 /dev/loop0
921
922
923

Whoops. There are numbers missing. What's going on?

Loop devices require their backing files to be multiple of sector size. Text files with lines don't usually fit that scheme, so in order to not miss the end of file (last partial sector) content, just append some more data first, then try again:

# head -c 512 /dev/zero >> 1000lines.txt
# losetup --find --show 1000lines.txt
/dev/loop1
losetup: 1000lines.txt: 
Warning: file does not fit into a 512-byte sector; 
the end of the file will be ignored.
# tail -n 3 /dev/loop1
999
1000

The warning persists but the content is complete now, so that's okay.

Create another one, this time with the 300 line offset:

# losetup --find --show --offset=1092 1000lines.txt
/dev/loop2
losetup: 1000lines.txt: 
Warning: file does not fit into a 512-byte sector; 
the end of the file will be ignored.
# head -n 3 /dev/loop2
301
302
303
# tail -n 3 /dev/loop2
999
1000

Here's the nice thing about loop devices. You don't have to worry about truncating the file by accident. You can also easily verify that your offsets are indeed correct before performing any action.

Finally, just copy it over, from offset device to full:

cp /dev/loop2 /dev/loop1

Dissolve loop devices:

losetup -d /dev/loop2 /dev/loop1 /dev/loop0

(Or: losetup -D to dissolve all loop devices.)

Truncate the file to target filesize:

truncate -s 2801 1000lines.txt

The result:

$ head -n 3 1000lines.txt 
301
302
303
$ tail -n 3 1000lines.txt 
998
999
1000

Answered by frostschutz on December 4, 2020

On some filesystems like ext4 or xfs, you can use the fallocate() system call for that.

Answered by pink slime on December 4, 2020

Removing the first n lines (or bytes) can be done in-place using dd (or alternatively using loop devices). It does not use a temporary file and there is no size limit; however, it is dangerous since there is no track of progress, and any error leaves you with a broken file.

Example: Create a sample file with 1000 lines:

$ seq 1 1000 > 1000lines.txt
$ head -n 3 1000lines.txt
1
2
3
$ tail -n 3 1000lines.txt
998
999
1000

We want to remove the first 300 lines. How many bytes does it correspond to?

$ stat -c %s 1000lines.txt
3893 # total bytes
$ head -n 300 1000lines.txt | wc -c
1092 # first 300 lines bytes
$ echo $((3893-1092))
2801 # target filesize after removal

The file is 3893 bytes, we want to remove the first 1092 bytes, leaving us with a new file of 2801 bytes.

To remove these bytes, we use the GNU dd command, with conv=notrunc as otherwise the file would be deleted before you can copy its contents:

$ dd conv=notrunc iflag=skip_bytes skip=1092 if=1000lines.txt of=1000lines.txt
5+1 records in
5+1 records out
2801 bytes (2.8 kB, 2.7 KiB) copied, 8.6078e-05 s, 32.5 MB/s

This removes the first 300 lines, but now the last 1092 bytes repeat, because the file is not truncated yet:

$ truncate -s 2801 1000lines.txt

This reduces the file to its final size, removing duplicated lines at end of file.

The result:

$ stat -c %s 1000lines.txt 
2801

$ head -n 3 1000lines.txt
301
302
303

$ tail -n 3 1000lines.txt
998
999
1000

The process for a larger file is similar. You may need to set a larger blocksize for better performance (the blocksize option for dd is bs).

The main issue is determining the correct byte offset for the exact line number. In general it can only be done by reading and counting. With this method, you have to read the entire file at least once even if you are discarding a huge chunk of it.

Answered by frostschutz on December 4, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP