TransWikia.com

Simple parser using Flex and C++

Code Review Asked on December 5, 2020

This is an alternative parser based on the specifications from this question. Briefly stated, the input file is a text file which has at least 33 fields separated by semicolons.

If the fourth field begins with either a T or an E, the line is valid and a subset of it written to the output file. Specifically, fields as numbered from $0$, should be output in this order: $ {0, 2, 3, 4, 5, 6, 10, 9, 11, 7, 32}$, each separated by a comma. All other fields are discarded.

One of the other answers there suggested that one could use a Flex-based parser instead. My own efforts were not faster, but I’m hoping that someone can review this and show me how to extract more speed from this version.

lexer.l

%{
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <algorithm>
#include <experimental/iterator>
#include <iterator>
#undef YY_DECL
#define YY_DECL int FileLexer::yylex()


class FileLexer : public yyFlexLexer {
public:
    FileLexer(std::istream& in, std::ostream& out) :
        yyFlexLexer{&in, &out},
        out{out}
    {}
    using FlexLexer::yylex;
    /// the yylex function is automatically created by Flex.
    virtual int yylex();
private:
    /// pointer to the current value
    std::vector<std::string> vec;
    std::ostream& out;
    unsigned fieldcount{0};
    bool valid{true};
};


%}
%option warn nodefault batch noyywrap c++
%option yyclass="FileLexer"

FIELD  [^;n]*
DELIM   ;
%%

{DELIM} { }
n      {
            if (valid && fieldcount >= 33) {
                std::copy(vec.begin(), vec.end(), std::experimental::make_ostream_joiner(out, ","));
                out << 'n';
            }
            vec.clear();
            fieldcount = 0;
            valid = true;
            return 1;
        }
{FIELD} {
            if (valid) {
                switch (fieldcount++) {
                    case 0:
                    case 1:
                    case 4:
                    case 5:
                    case 6:
                    case 7:
                    case 9:
                    case 32:
                        vec.push_back(yytext);
                        break;
                    case 3:
                        if (yytext[0] == 'E' || yytext[0] == 'T') {
                            vec.push_back(yytext);
                            valid = true;
                        } else {
                            valid = false;
                        }
                        break;
                    case 10:
                        {
                            auto n{vec.size()};
                            vec.push_back(yytext);
                            std::iter_swap(vec.begin()+n, vec.begin()+n-2);
                        }
                        break;
                    case 11:
                        {
                            auto n{vec.size()};
                            vec.push_back(yytext);
                            std::iter_swap(vec.begin()+n, vec.begin()+n-1);
                        }
                        break;

                }
            }
        }
%%
int main(int argc, char *argv[]) {
    if (argc >= 3) {
        std::ifstream in{argv[1]};
        std::ofstream out{argv[2]};
        FileLexer lexer{in, out};
        while (lexer.yylex() != 0)
        {}
    }
}

Compile with:

flex -o parsefile.cpp lexer.l
g++ -O2 -std=gnu++17 parsefile.cpp -o parsefile

This works, but it is slow (2.165 s) on my machine, with the same million-line input file as mentioned in my answer to the other question.

I tried it a few different ways, but I was unable to get a version that was faster than the PHP code in the other question. The switch statement logic is arguably a bit overly clever and stores only the needed fields in the desired order, but the speed was about the same as the straightforward implementation.

If it matters, I’m using gcc version 10.1 and flex 2.6.4 on a 64-bit Linux machine.

2 Answers

I see a few small issues in the C++ code, that probably won't give any large performance benefit. Flex is doing the heavy work of reading the input and parsing it, there's not much you can do about that.

Iterator arithmetic

Instead of:

case 10:
    {
        auto n{vec.size()};
        vec.push_back(yytext);
        std::iter_swap(vec.begin() + n, vec.begin() + n - 2);
    }

You can also do iterator arithmetic on the end iterator, thereby avoiding the need to get the size of the vector:

case 10:
    vec.push_back(yytext);
    std::iter_swap(vec.end() - 1, vec.end() - 3);

Don't return 1 after reading a newline character

There is no need to return from yylex() after reading a newline, just remove the return 1 statement. This avoids needing the while-loop in main().

Use emplace_back() instead of push_back()

This avoids having to create a temporary that is being copied into the vector.

Correct answer by G. Sliepen on December 5, 2020

Overview

There is an issue (Bug) here in that yytext points at the beginning of the lexeme. But the lexeme is not null ('') terminated. You need to pass a length if you want to pass the current token into vec

   vec.push_back(yytext);

   // should be:
   vec.emplace_back(yytext, yytext + yylen);

You had a bug in your call to the underlying base class.

    FileLexer(std::istream& in, std::ostream& out) :
        yyFlexLexer{&in, &out},
        out{out}
    {}

Sorry I fixed it before any answers. But you need to pass the address of the streams to yyFlexLexer.


Normally I would return a value for each lexeme (and move any complex processing into methods of FileLexer. BUT this is such a simple class I don't see any issue with your current implementation of putting all the code in the lexer directly (though I would remove the return 1; from the end of line marker to make it consistent with field processing).


This seems to be correct.

std::copy(vec.begin(), vec.end(), std::experimental::make_ostream_joiner(out, ","));
out << 'n';

But it is not obvious how it achieves it. Would be nice to have a comment that points out that field 10/11 are not added to the end but rather to a point not at the end of the end of the vector.


Answered by Martin York on December 5, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP