This lesson is still being designed and assembled (Pre-Alpha version)

Intermediate Workflows with the Common Workflow Language

Introduction

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • First learning objective. (FIXME)

FIXME

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Citing the tools in your workflow

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • give credit for all the tools used in their workflow(s)

By the end of this epsiode, learners should be able to explain the importance of correctly citing research software.

See this page.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Turning a shell script into a workflow

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • identify tasks, and data links in a script

  • recognize loops that can be converted into scatters

  • finding and reusing existing CWL command line tool descriptions

By the end of this episode, learners should be able to convert a shell script into a CWL workflow

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Customizing workflows

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • customize a workflow at any of the many levels

By the end of this episode, learners should be able to customize a workflow at any of the many levels:

  1. Change the input object
  2. Change the default values at the workflow level
  3. Add default values to existing inputs at the workflow level
  4. Change default value at the Workflow step level
  5. Add hard coded values (via default or valueFrom) at the Workflow step level
  6. Change hard coded values at the Workflow step level
  7. Change default values in the CLT description
  8. Change hard coded values in the CLT description
  9. Change the container (add helper script)
  10. Change the tool source itself

You’ve been given a workflow by your colleague that runs GATK HaplotypeCaller and must change various points to fix your needs.

Exercise 3:

In this workflow, add a default value for the reference in the inputs section.

cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string reference: File

Solution:

cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string reference: class: File default: type: File location: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

Exercise 5:

Default values in a workflow can be used at both the input object level and the step level. Add a default reference and chromosome inputs to the steps portion of the workflow. The requirement StepInputExpressionRequrirement must be declared in the requirements section to be able to add default values at the step level.

cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string reference: File outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam chromosome: chromosome reference: reference out: [vcf]

Solution:

cwlVersion: v1.0 class: Workflow requirements: StepInputExpressionRequirement: {} inputs: bam: File chromosome: string reference: File outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam chromosome: default: chr1 reference: default: type: File location: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz out: [vcf]

Exercise 5:

Using docker images is a good way of creating reproducible workflows. When specifying DockerRequirement in the hints section of a workflow, you can use your own local images or images from a URL. Using dockerPull will grab a docker image from your local repository. Using dockerLoad will grab a Docker image using an HTTP URL.

In this workflow, use dockerPull to add a docker image called “broadinstitute/gatk4”

cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string sample: string reference: File outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam intervals: chromosome reference_fasta: reference out: [vcf]

Solution:

cwlVersion: v1.0 class: Workflow hints: DockerRequirement: dockerPull: broadinstitute/gatk4 inputs: bam: File chromosome: string sample: string reference: File outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam intervals: chromosome reference_fasta: reference out: [vcf]

Now, every step of the workflow will use the same Docker container!

Key Points

  • First key point. Brief Answer to questions. (FIXME)


CWL workflow descriptions

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • explain the difference between a CWL tool description and a CWL workflow

  • describe the relationship between a tool and its corresponding CWL document

  • exercise good practices when naming inputs and outputs

  • Be able to make understandable and valid names for inputs and outputs (not ‘input3’)

By the end of this episode, learners should be able to explain how a workflow document describes the input and output of a workflow and the flow of data between tools

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Debugging workflows

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • interpret commonly encountered error messages

  • solve these common issues

By the end of this episode, learners should be able to recognize and fix simple bugs in their workflow code.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Documenting your workflow

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • explain the importance of documenting a workflow

  • use description fields to document purpose, intent, and other factors at multiple levels within their workflow

  • recognise when it is appropriate to include this documentation

By the end of this episode, learners should be able to document their workflows to increase reusability.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Workflows as dependency graphs

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • explain that a workflow is a dependency graph

By the end of this episode, learners should be able to explain that a workflow is a dependency graph.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Iterative workflow development

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • recognise that workflow development can be iterative i.e. that it doesn’t have to happen all at once

By the end of this episode, learners should be able to recognise that workflow development can be iterative i.e. that it doesn’t have to happen all at once.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Capturing output

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • explain that only files explicitly mentioned in a description will be included in the output of a step/workflow

  • implement bulk capturing of all files produced by a step/workflow for debugging purposes

  • use STDIN and STDOUT as input and output

  • capture output written to a specific directory, the working directory, or the same directory where input is located

By the end of this episode, learners should be able to define the files that will be included as output of a workflow.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Describing requirements

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • identify all the requirements of a tool and define them in the tool description

  • use runtime parameters to access information about the runtime environment

  • define environment variables necessary for execution

  • use secondaryFiles or InitialWorkDirRequirement to access files in the same directory as another referenced file

  • use $(runtime.cores) to define the number of cores to be used

  • use type: File, instead of a string, to reference a filepath

By the end of this episode, learners should be able to describe all the requirements for running a tool

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Scattering

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • explain what is meant by the scatter pattern in workflow design, and how it differs from the similar concept of parallel execution

  • identify when the scatter pattern appears in a workflow description

  • (FIXME: does the above cover the intended meaning of the following two points from the lesson development sprint?); running the same program on each file; running the same program the same way except for one parameter

By the end of this episode, learners should be able to implement scattering of steps in a workflow.

TODO Add pictures of cross product / matrix manipulation

Exercise 1

What if you had two arrays, one a file array of bams and an array of chromosomes? How would you run all chromosomes on each bam?

Solution:

steps:
  GATK_HaplotypeCaller:
    run: GATK_HaplotypeCaller.cwl
    scatter: [intervals, input_bam]
    scatterMethod: flat_crossproduct

{. solution} {. challenge}

Exercise 2

How does this change the inputs and outputs for the workflow?

Solution:

cwlVersion: v1.0
class: Workflow
requirements:
  ScatterFeatureRequirement: {}
inputs:
   bam: File
   chromosomes: string[]
outputs:
  HaplotypeCaller_VCFs:
    type:
      type: array
      items:
        type: array
        items: File
    outputSource: GATK_HaplotypeCaller/vcf
steps:
  GATK_HaplotypeCaller:
    run: GATK_HaplotypeCaller.cwl
    scatter: [intervals, input_bam]
    scatterMethod: flat_crossproduct
    in:
      input_bam: bam
      intervals: chromosomes
    out: [vcf]

{. solution}

When scattering on multiple inputs, you need to explicitly say how the scatter should occur. There are 3 scatter methods in CWL: dot_product, flat_crossproduct and nested_crossproduct. dot_product is the default method, which takes each element of the array and runs on each nth item of the array. flat_crossproduct and nested_crossproduct will take both inputs and run on every combination of both arrays. The difference between flat and nested is in the output type. Flat will create a single array output whereas nested will create a nested array output.

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Adding your own script to a step

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • How to include and run a script in a step at runtime?

  • Which requirements need to be specified?

  • How to capture output of a script?

Objectives
  • Include and run a script in a step at runtime

  • Capture output of a script

By the end of this episode, learners should be able to include and run their own script in a step at runtime

Exercise 1:

Which Requirment from the following options is used to create a script at runtime?

A. InlineJavascriptRequirement
B. InitialWorkDirRequirement
C. ResourceRequirement
D. DockerRequirement

Solution

B. InitialWorkDirRequirement

Exercise 2:

Using the template below, add the missing instructions so that a script named script.sh with the specified contents is created at runtime.

InitialWorkDirRequirement:
  listing:
    - ------: script.sh
      ------: |
        echo "*Documenting input*" && \
        echo "Input received: $(inputs.message)" && \
        echo "Exit"

inputs:
  message:
    type: string

Solution:

InitialWorkDirRequirement:
  listing:
    - entryname: script.sh
      entry: |
        echo "*Documenting input*" && \
        echo "Input received: $(inputs.message)" && \
        echo "Exit"

inputs:
  message:
    type: string

Exercise 3:

Since we are using echo in the script (as shown below) - what is the appropriate type in the outputs section of following code block to capture standard output?

class: CommandLineTool
cwlVersion: v1.1
requirements:
  DockerRequirement:
    dockerPull: 'debian:stable'
  InitialWorkDirRequirement:
    listing:
      - entryname: script.sh
        entry: |
          echo "*Documenting input*" && \
          echo "Input received: $(inputs.message)" && \
          echo "Exit"

inputs:
  message:
    type: string

stdout: "message.txt"

outputs:
 message:
   type: ----

Your options are: A. File B. Directory C. stdout D. string

Solution:

C. stdout

Exercise 4:

Fix the baseCommand in following code snippet to execute the script we have created in previous exercises.

baseCommand: []

Solution:

baseCommand: [ sh, script.sh ]

Exercise 5:

CHALLENGE question. Extend the outputs section of the following CWLtool definition to capture the script we have created along with tools’ standard output.

This will help you inspect the generated script and is useful in more complex situations to troubleshoot related issues.

class: CommandLineTool
cwlVersion: v1.1
requirements:
  DockerRequirement:
    dockerPull: 'debian:stable'
  InitialWorkDirRequirement:
    listing:
      - entryname: script.sh
        entry: |
          echo "*Documenting input*" && \
          echo "Input received: $(inputs.message)" && \
          echo "Exit"

inputs:
  message:
    type: string

stdout: "message.txt"
baseCommand: ["sh", "script.sh"]

outputs:
  message:
    type: stdout

Solution:

class: CommandLineTool
cwlVersion: v1.1
requirements:
  DockerRequirement:
    dockerPull: 'debian:stable'
  InitialWorkDirRequirement:
    listing:
      - entryname: script.sh
        entry: |
          echo "*Documenting input*" && \
          echo "Input received: $(inputs.message)" && \
          echo "Exit"

inputs:
  message:
    type: string

stdout: "message.txt"
baseCommand: ["sh", "script.sh"]

outputs:
  message:
    type: stdout
  script:
    type: File
    outputBinding:
      glob: "script.sh"

Key Points

  • First key point. Brief Answer to questions. (FIXME)


Sketches as a design tool

Overview

Teaching: 0 min
Exercises: 0 min
Questions
  • Key question (FIXME)

Objectives
  • use cwlviewer online

  • generate Graphviz diagram using cwltool

  • exercise with the printout of a simple workflow; draw arrows on code; hand draw a graph on another sheet of paper

By the end of this episode, learners should be able to sketch their workflow, both by hand, and with an automated visualizer.

Key Points

  • First key point. Brief Answer to questions. (FIXME)