TransWikia.com

How do I split a file each time a regular expression occurs?

Unix & Linux Asked by edperez on November 26, 2021

I’m trying to get gawk to split a text file into a different file each time a paragraph contains an occurrence of a code of the form "7-04/PNLP-000001". So, for instance, if the original text file contains the following:

Proposición no de Ley 7-04/PNLP-000009, relativa a la línea Ave Sevilla-Córdoba-Madrid.
La señora PRESIDENTA
Proposición no de Ley 7-04/PNLP-000001, relativa a la restitución y mejora de los derechos y la cobertura social de los trabajadores del medio rural andaluz.
La señora PRESIDENTA

I would like to get a file with this content:

Proposición no de Ley 7-04/PNLP-000009, relativa a la línea Ave Sevilla-Córdoba-Madrid.
La señora PRESIDENTA

and another with this content:

Proposición no de Ley 7-04/PNLP-000001, relativa a la restitución y mejora de los derechos y la cobertura social de los trabajadores del medio rural andaluz.
La señora PRESIDENTA

I’m trying to to this with this code:

gawk '
        /^n.+[0-9]-[0-9]{2}/.+-[0-9]{6}$/
        {if (p) close (p)
        p = sprintf("split%05i.txt", ++i) }
            { print > p; }
    ' input.txt

However, this just creates one file per line, whatever its content. Does anyone know what I’m doing wrong? Thanks in advance!

2 Answers

I would do it like this:
perl -ne 'my $fh="/dev/stdout"; if(/7-04/PNLP-(d+)/) { close $fh; open($fh,">/path/to/outputfiles/file$1"); } ; print $fh $_;' < /path/to/inputfile

Answered by Garo on November 26, 2021

You're close:

awk '/[0-9]-[0-9]{2}/[[:upper:]]+-[0-9]{6}/ {
       if (file) close (file)
       file = sprintf("split%05i.txt", ++i)
     }
     file {print > file}' input.txt

You want the { if... } code block to be run for the lines that match the [0-9]... pattern, so, it should be on the same line as the /.../.

The second code block {print > file} is to be run for every record as long as file is set, using file as the condition.

Having n in your pattern here doesn't make sense as each record that awk processes in turn is the contents of each line (as the default record separator (RS) is n), so a record is never going to contain a newline character. You also don't want to anchor your regexp here (^ and $).

I've replaced your .+ with [[:upper:]]+ so as to be more specific. With .+, it would match on blah 5-10/2 blah blah €1000000 for instance. You may need to adapt depending on what you want to accept in place of PNLP.

Note that it also matches on blah 1234-56/XX-1234567890 blah as that does contain a string that matches the pattern (see part in bold).

I've removed the g in gawk as that code is not gawk specific. However note that there are still a few awk implementations that don't support the {2}/{6} operators above (even though that's a POSIX requirement), so if you know gawk is going to be available, you might as well use it to make sure it works.

Answered by Stéphane Chazelas on November 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP