Introduction
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
First learning objective. (FIXME)
FIXME
Key Points
First key point. Brief Answer to questions. (FIXME)
Citing the tools in your workflow
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
give credit for all the tools used in their workflow(s)
By the end of this epsiode, learners should be able to explain the importance of correctly citing research software.
See this page.
Key Points
First key point. Brief Answer to questions. (FIXME)
Turning a shell script into a workflow
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
identify tasks, and data links in a script
recognize loops that can be converted into scatters
finding and reusing existing CWL command line tool descriptions
By the end of this episode, learners should be able to convert a shell script into a CWL workflow
Key Points
First key point. Brief Answer to questions. (FIXME)
Customizing workflows
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
customize a workflow at any of the many levels
By the end of this episode, learners should be able to customize a workflow at any of the many levels:
- Change the input object
- Change the default values at the workflow level
- Add default values to existing inputs at the workflow level
- Change default value at the Workflow step level
- Add hard coded values (via default or valueFrom) at the Workflow step level
- Change hard coded values at the Workflow step level
- Change default values in the CLT description
- Change hard coded values in the CLT description
- Change the container (add helper script)
- Change the tool source itself
You’ve been given a workflow by your colleague that runs GATK HaplotypeCaller and must change various points to fix your needs.
- Change the input object
Exercise 1:
In the input yaml file, your colleague adds an input for a bam input. You decide that you want to add a reference file input and a chromosome string input. Add those to this yaml file.
bam: class: File location: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00133/alignment/HG00133.unmapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
Solution:
bam: class: File location: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00133/alignment/HG00133.unmapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam reference: class: File location: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz chromosome: chr1
- Change the default values at the workflow level
Exercise 2:
Default values in a workflow can be used at both the input object level and the step level. Add the new reference and chromosome inputs to the workflow
cwlVersion: v1.0 class: Workflow inputs: bam: File
Solution:
cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string reference: File
Exercise 3:
In this workflow, add a default value for the reference in the inputs section.
cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string reference: File
Solution:
cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string reference: class: File default: type: File location: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
- Change hard coded values at the workflow level
Exercise 4:
In this workflow, add the new inputs to the GATK_HaplotypeCaller step. Assume that the inputs to GATK_HaplotypeCaller.cwl are the same variables as what you stated before.
cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string reference: File outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam out: [vcf]
Solution:
cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string reference: File outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam chromosome: chromosome reference: reference out: [vcf]
- Change default value at the Workflow step level
Exercise 5:
Default values in a workflow can be used at both the input object level and the step level. Add a default reference and chromosome inputs to the steps portion of the workflow. The requirement StepInputExpressionRequrirement must be declared in the requirements section to be able to add default values at the step level.
cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string reference: File outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam chromosome: chromosome reference: reference out: [vcf]
Solution:
cwlVersion: v1.0 class: Workflow requirements: StepInputExpressionRequirement: {} inputs: bam: File chromosome: string reference: File outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam chromosome: default: chr1 reference: default: type: File location: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz out: [vcf]
- Change hard coded values at the Workflow step level
- Change default values in the CLT description
- Change hard coded values in the CLT description
- Change the container
Exercise 5:
Using docker images is a good way of creating reproducible workflows. When specifying DockerRequirement in the hints section of a workflow, you can use your own local images or images from a URL. Using dockerPull will grab a docker image from your local repository. Using dockerLoad will grab a Docker image using an HTTP URL.
In this workflow, use dockerPull to add a docker image called “broadinstitute/gatk4”
cwlVersion: v1.0 class: Workflow inputs: bam: File chromosome: string sample: string reference: File outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam intervals: chromosome reference_fasta: reference out: [vcf]
Solution:
cwlVersion: v1.0 class: Workflow hints: DockerRequirement: dockerPull: broadinstitute/gatk4 inputs: bam: File chromosome: string sample: string reference: File outputs: HaplotypeCaller_VCF: type: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl in: input_bam: bam intervals: chromosome reference_fasta: reference out: [vcf]
Now, every step of the workflow will use the same Docker container!
- Change the tool source itself
Key Points
First key point. Brief Answer to questions. (FIXME)
CWL workflow descriptions
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
explain the difference between a CWL tool description and a CWL workflow
describe the relationship between a tool and its corresponding CWL document
exercise good practices when naming inputs and outputs
Be able to make understandable and valid names for inputs and outputs (not ‘input3’)
By the end of this episode, learners should be able to explain how a workflow document describes the input and output of a workflow and the flow of data between tools
Key Points
First key point. Brief Answer to questions. (FIXME)
Debugging workflows
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
interpret commonly encountered error messages
solve these common issues
By the end of this episode, learners should be able to recognize and fix simple bugs in their workflow code.
Key Points
First key point. Brief Answer to questions. (FIXME)
Documenting your workflow
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
explain the importance of documenting a workflow
use description fields to document purpose, intent, and other factors at multiple levels within their workflow
recognise when it is appropriate to include this documentation
By the end of this episode, learners should be able to document their workflows to increase reusability.
Key Points
First key point. Brief Answer to questions. (FIXME)
Workflows as dependency graphs
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
explain that a workflow is a dependency graph
By the end of this episode, learners should be able to explain that a workflow is a dependency graph.
Key Points
First key point. Brief Answer to questions. (FIXME)
Iterative workflow development
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
recognise that workflow development can be iterative i.e. that it doesn’t have to happen all at once
By the end of this episode, learners should be able to recognise that workflow development can be iterative i.e. that it doesn’t have to happen all at once.
Key Points
First key point. Brief Answer to questions. (FIXME)
Capturing output
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
explain that only files explicitly mentioned in a description will be included in the output of a step/workflow
implement bulk capturing of all files produced by a step/workflow for debugging purposes
use STDIN and STDOUT as input and output
capture output written to a specific directory, the working directory, or the same directory where input is located
By the end of this episode, learners should be able to define the files that will be included as output of a workflow.
Key Points
First key point. Brief Answer to questions. (FIXME)
Describing requirements
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
identify all the requirements of a tool and define them in the tool description
use
runtime
parameters to access information about the runtime environmentdefine environment variables necessary for execution
use
secondaryFiles
orInitialWorkDirRequirement
to access files in the same directory as another referenced fileuse
$(runtime.cores)
to define the number of cores to be useduse
type: File
, instead of a string, to reference a filepath
By the end of this episode, learners should be able to describe all the requirements for running a tool
Key Points
First key point. Brief Answer to questions. (FIXME)
Scattering
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
explain what is meant by the scatter pattern in workflow design, and how it differs from the similar concept of parallel execution
identify when the scatter pattern appears in a workflow description
(FIXME: does the above cover the intended meaning of the following two points from the lesson development sprint?); running the same program on each file; running the same program the same way except for one parameter
By the end of this episode, learners should be able to implement scattering of steps in a workflow.
TODO Add pictures of cross product / matrix manipulation
Exercise 1
What if you had two arrays, one a file array of bams and an array of chromosomes? How would you run all chromosomes on each bam?
Solution:
steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl scatter: [intervals, input_bam] scatterMethod: flat_crossproduct
{. solution} {. challenge}
Exercise 2
How does this change the inputs and outputs for the workflow?
Solution:
cwlVersion: v1.0 class: Workflow requirements: ScatterFeatureRequirement: {} inputs: bam: File chromosomes: string[] outputs: HaplotypeCaller_VCFs: type: type: array items: type: array items: File outputSource: GATK_HaplotypeCaller/vcf steps: GATK_HaplotypeCaller: run: GATK_HaplotypeCaller.cwl scatter: [intervals, input_bam] scatterMethod: flat_crossproduct in: input_bam: bam intervals: chromosomes out: [vcf]
{. solution}
When scattering on multiple inputs,
you need to explicitly say how the scatter should occur.
There are 3 scatter methods in CWL:
dot_product
, flat_crossproduct
and nested_crossproduct
.
dot_product
is the default method,
which takes each element of the array and runs on each nth item of the array.
flat_crossproduct
and nested_crossproduct
will take both inputs and
run on every combination of both arrays.
The difference between flat and nested is in the output type.
Flat will create a single array output whereas
nested will create a nested array output.
Key Points
First key point. Brief Answer to questions. (FIXME)
Adding your own script to a step
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How to include and run a script in a step at runtime?
Which requirements need to be specified?
How to capture output of a script?
Objectives
Include and run a script in a step at runtime
Capture output of a script
By the end of this episode, learners should be able to include and run their own script in a step at runtime
Exercise 1:
Which
Requirment
from the following options is used to create a script at runtime?A. InlineJavascriptRequirement B. InitialWorkDirRequirement C. ResourceRequirement D. DockerRequirement
Solution
B. InitialWorkDirRequirement
Exercise 2:
Using the template below, add the missing instructions so that a script named
script.sh
with the specified contents is created at runtime.InitialWorkDirRequirement: listing: - ------: script.sh ------: | echo "*Documenting input*" && \ echo "Input received: $(inputs.message)" && \ echo "Exit" inputs: message: type: string
Solution:
InitialWorkDirRequirement: listing: - entryname: script.sh entry: | echo "*Documenting input*" && \ echo "Input received: $(inputs.message)" && \ echo "Exit" inputs: message: type: string
Exercise 3:
Since we are using
echo
in the script (as shown below) - what is the appropriatetype
in theoutputs
section of following code block to capture standard output?class: CommandLineTool cwlVersion: v1.1 requirements: DockerRequirement: dockerPull: 'debian:stable' InitialWorkDirRequirement: listing: - entryname: script.sh entry: | echo "*Documenting input*" && \ echo "Input received: $(inputs.message)" && \ echo "Exit" inputs: message: type: string stdout: "message.txt" outputs: message: type: ----
Your options are: A. File B. Directory C. stdout D. string
Solution:
C. stdout
Exercise 4:
Fix the
baseCommand
in following code snippet to execute the script we have created in previous exercises.baseCommand: []
Solution:
baseCommand: [ sh, script.sh ]
Exercise 5:
CHALLENGE question. Extend the
outputs
section of the following CWLtool definition to capture the script we have created along with tools’ standard output.This will help you inspect the generated script and is useful in more complex situations to troubleshoot related issues.
class: CommandLineTool cwlVersion: v1.1 requirements: DockerRequirement: dockerPull: 'debian:stable' InitialWorkDirRequirement: listing: - entryname: script.sh entry: | echo "*Documenting input*" && \ echo "Input received: $(inputs.message)" && \ echo "Exit" inputs: message: type: string stdout: "message.txt" baseCommand: ["sh", "script.sh"] outputs: message: type: stdout
Solution:
class: CommandLineTool cwlVersion: v1.1 requirements: DockerRequirement: dockerPull: 'debian:stable' InitialWorkDirRequirement: listing: - entryname: script.sh entry: | echo "*Documenting input*" && \ echo "Input received: $(inputs.message)" && \ echo "Exit" inputs: message: type: string stdout: "message.txt" baseCommand: ["sh", "script.sh"] outputs: message: type: stdout script: type: File outputBinding: glob: "script.sh"
Key Points
First key point. Brief Answer to questions. (FIXME)
Sketches as a design tool
Overview
Teaching: 0 min
Exercises: 0 minQuestions
Key question (FIXME)
Objectives
use cwlviewer online
generate Graphviz diagram using cwltool
exercise with the printout of a simple workflow; draw arrows on code; hand draw a graph on another sheet of paper
By the end of this episode, learners should be able to sketch their workflow, both by hand, and with an automated visualizer.
Key Points
First key point. Brief Answer to questions. (FIXME)