Parsing and validation of a csv file

Question

I am trying to create a library where I need to validate and parse a file in a CSV-like format and then use this data to generate a Tree data structure.

At the moment I split the process into two steps:

Validate the file and store its lines in a list (in the parsing package)
Read the list plus a string array containing a subset of columns to generate the tree (in the hierarchy package)

I was thinking if the steps can be simplified as a single one, since I would only need to store data just once (and there would be better performance) and I would store less data because of the columns subset. But I am concerned that the single step would result in a merged package that has too many responsibilities: validation and tree preparation.

What would be the best option in terms of best practice of software architecture?

Basilevs · Answer

Do nothing, if you see no performance problems.
Otherwise make parser return Iterator instead of List, that way, you will only need enough memory to store a single line at cost of complicating parser's lifetime management.

FiddleStix · Answer

I need to validate and parse a file in a CSV-like format and then use this data to generate a Tree data structure

Your process is inherently two-stage.  Trying to merge these two separate things into one will certainly result in less legible and harder-to-maintain code.  It is almost certainly not necessary to write everything as one giant package/class/function to get good time and space efficiency.

Your first stage is to import a .csv file and parse it into some table-esque data structure. It is possible to validate that your .csv is valid (check it's a text file, check it's comma delimited) and that it contains tabular data (do all rows have the same number of colums, etc.) without knowing or caring about what the data will later be used for.

Your second stage is to take tabular data (e.g. an array of arrays) and turn it into a tree.  At this point, your hierarchy package will be doing validation but it will only be validating the tree structure (e.g. every node except the root has one parent, etc.).  By this stage, you definitely don't care about invalid file formats or file-not-found errors.

If you want to save space (and I would question whether you really need to), you can write your hierarchy so that it frees each row of the tabular data as it builds the tree or it could modify the tabular data to represent a tree (see the heap data structure for an example of a tree represented as an array).

Parsing and validation of a csv file

2 Answers

Add your own answers!

Ask a Question