TransWikia.com

How to organize data in workflows with user input between steps?

Data Science Asked by meow on July 2, 2021

I am building a multistep workflow.

  • User provides input files input.csv, parameters_global.yml, parameters_step1.yml, parameters_step2.yml.
  • Step 1 loads the data and parameters: input.csv, parameters_global.yml, parameters_step1.yml. It processes the data and x, processes it and produces step1_results.csv.
  • User has to check the results, mark every line OK or not, and saves this as step1_checked.csv.
  • Step 2 loads parameters_global.yml, parameters_step2.yml and step1_checked.csv, processes everything, and produces some step2_results.csv.

Repeat ad infinitum. Are there best practices for how to organize the different steps (data and code) in data input, output etc folders?

Current suggestion is

  • input/01_step1name/, input/02_step2name/ for all user-provided extra input per step
  • output/01_step1name/, output/02_step2name/ for all generated output per step
  • scripts/01_step1name/01_firstscript.R, scripts/01_step1name/02_secondscript.R, scripts/01_step1name/03_thirdscript.py
  • all steps may use input and output from any preceding step
  • within one step, there shall be no additional user input (but potentially multiple scripts).

There are also cases where a user might need to take an output and go to a measurement device, measure something based on that output, come back and add the results as a new input for the next step. I could "symbolize" the measurement by adding an extra step where there is nothing to execute, and manually put the measurements in the output folder when done. Alternatively, I could treat the measurement as user input and use the measurement results as input for the next step…

An extra add-on is that some but not all steps share a code basis, e.g.

  • scripts/02_processingstep2/01_run.R sources a functions.R
  • scripts/03_processingstep3/01_run.R sources the same functions.R
  • scripts/04_processingstep4/01_run.R might use a completely different set of R libraries/environment/functions

I don’t really have a good concept to handle this… perhaps an extra

  • library/mytoolset/ with functions.R
  • library/othertoolset/ with different codebase

I know about Nextflow, but I feel it doesn’t solve the problem of organization, more the problem of execution which is not such a big deal for me.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP