How do I split a file each time a regular expression occurs?

Question

I'm trying to get gawk to split a text file into a different file each time a paragraph contains an occurrence of a code of the form "7-04/PNLP-000001". So, for instance, if the original text file contains the following:

Proposición no de Ley 7-04/PNLP-000009, relativa a la línea Ave Sevilla-Córdoba-Madrid.
La señora PRESIDENTA
Proposición no de Ley 7-04/PNLP-000001, relativa a la restitución y mejora de los derechos y la cobertura social de los trabajadores del medio rural andaluz.
La señora PRESIDENTA

I would like to get a file with this content:

Proposición no de Ley 7-04/PNLP-000009, relativa a la línea Ave Sevilla-Córdoba-Madrid.
La señora PRESIDENTA

and another with this content:

Proposición no de Ley 7-04/PNLP-000001, relativa a la restitución y mejora de los derechos y la cobertura social de los trabajadores del medio rural andaluz.
La señora PRESIDENTA

I'm trying to to this with this code:
gawk '
        /^n.+[0-9]-[0-9]{2}/.+-[0-9]{6}$/
        {if (p) close (p)
        p = sprintf("split%05i.txt", ++i) }
            { print > p; }
    ' input.txt

However, this just creates one file per line, whatever its content. Does anyone know what I'm doing wrong? Thanks in advance!

Garo · Answer

I would do it like this:perl -ne 'my $fh="/dev/stdout"; if(/7-04/PNLP-(d+)/) { close $fh; open($fh,">/path/to/outputfiles/file$1"); } ; print $fh $_;' < /path/to/inputfile

Stéphane Chazelas · Answer

You're close:
awk '/[0-9]-[0-9]{2}/[[:upper:]]+-[0-9]{6}/ {
       if (file) close (file)
       file = sprintf("split%05i.txt", ++i)
     }
     file {print > file}' input.txt

You want the { if... } code block to be run for the lines that match the [0-9]... pattern, so, it should be on the same line as the /.../.
The second code block {print > file} is to be run for every record as long as file is set, using file as the condition.
Having n in your pattern here doesn't make sense as each record that awk processes in turn is the contents of each line (as the default record separator (RS) is n), so a record is never going to contain a newline character. You also don't want to anchor your regexp here (^ and $).
I've replaced your .+ with [[:upper:]]+ so as to be more specific. With .+, it would match on blah 5-10/2 blah blah €1000000 for instance. You may need to adapt depending on what you want to accept in place of PNLP.
Note that it also matches on blah 1234-56/XX-1234567890 blah as that does contain a string that matches the pattern (see part in bold).
I've removed the g in gawk as that code is not gawk specific. However note that there are still a few awk implementations that don't support the {2}/{6} operators above (even though that's a POSIX requirement), so if you know gawk is going to be available, you might as well use it to make sure it works.

How do I split a file each time a regular expression occurs?

2 Answers

Add your own answers!

Ask a Question