LunchBytes - Workflow Managers
From Robert Turner
Scientific workflows (for example in bioinformatics) often require several different scripts or pieces of software to be run and their outputs fed into each other. Ideally, we’d like to be able to understand the provenance of the output, without having to re-run the entire workflow every time a minor change is made to code or data. This is often difficult to achieve with bash or Python scripts (for example), but workflow managers can help.
In this session we will learn more about the problems of controlling workflows / pipelines using scripts, and about what workflow managers are, and how they can help.The trouble with scripts (Will Furnass)
Scripts let us chain together several bits of software, often taking the output of one and feeding it into another and/or running the same software repeatedly with different inputs. This is often far better than entering commands into a prompt and then having to remember those commands and their sequence when doing the same job again.
However, glueing together the parts of a multi-step or multi-input workflow using scripts can be difficult: it’s all too easy to create scripts that aren’t portable between machines/systems, are brittle/unreliable, are difficult to read and test and aren’t modular or reusable.
For which cases are scripts a good way of running a workflow and for which might they be problematic?Nextflow (Magda Dabrowska)
Nextflow is a relatively new workflow manager which allows for writing scalable computational pipelines. It is based on Java and written in written in ‘Groovy’ (a programming language which compiles to Java byte code), however it provides a multitude of tools and template scripts to allow for a quick and easy access to typical workflows, without the need to learn its programming language. Major advantages of Nextflow are:
- Support for execution on multiple platforms without the need to tailor your script.
- High portability with the support for executors used within the UoS HPC systems: SGE and SLURM.
- Easy project reproducibility with the support for containers such as Docker, Conda and Singularity.
- Effortless parallelism implicitly defined by workflow inputs and outputs.
- Ability to resume the execution from the last successful checkpoint.
Ruffus is one of the oldest of the modern style workflow management systems. Ruffus pipelines consist of a series of python functions that are linked together using a python feature known as decorators. Ruffus is a lighter weight alternative to systems such as Nextflow, while still offering the ability to create rich dependency graphs of tasks and orchestrate their submission to an HPC. It embodies the philosophy that for users to choose to use a reproducible pipeline in the real world, when under pressure, using a pipeline must be as easy, or easier, than the manual alternative. In my talk I will demonstrate the creation of a pipeline for a bioinformatics task consisting of multiple non-trival steps within 15 minutes.Common Workflow Language (Joe Heffer)
Common Workflow Language (CWL) is an open standard for describing how to run command line tools and connect them to create workflows.
The CWL standard is not specific to any language or technology, so enables interoperability between the many different workflow languages, enabling researchers to share workflows. It’s more abstract than any particular language implementation, aiming to support a super-set of the features of other languages. CWL allows portability across these systems because the logical/computational parts of a workflow are decoupled from the code that runs them.