googlegenomics / pipelines-api-examples Goto Github PK

Examples for the Google Genomics Pipelines API.

License: BSD 3-Clause "New" or "Revised" License

Shell 35.58% Python 61.91% R 2.51%

pipelines-api-examples's Introduction

pipelines-api-examples

This repository contains examples for the [Google Genomics Pipelines API] (https://cloud.google.com/genomics/reference/rest/v1alpha2/pipelines).

Alpha
This is an Alpha release of Google Genomics API. This feature might be changed in backward-incompatible ways and is not recommended for production use. It is not subject to any SLA or deprecation policy.

The API provides an easy way to create, run, and monitor command-line tools on Google Compute Engine running in a Docker container. You can use it like you would a job scheduler.

The most common use case is to run an off-the-shelf tool or custom script that reads and writes files. You may want to run such a tool over files in Google Cloud Storage. You may want to run this independently over hundreds or thousands of files.

The typical flow for a pipeline is:

Create a Compute Engine virtual machine
Copy one or more files from Cloud Storage to a disk
Run the tool on the file(s)
Copy the output to Cloud Storage
Destroy the Compute Engine virtual machine

You can submit batch operations from your laptop, and have them run in the cloud. You can do the packaging to Docker yourself, or use existing Docker images.

Prerequisites

Clone or fork this repository.
If you plan to create your own Docker images, then install docker: https://docs.docker.com/engine/installation/#installation
Follow the Google Genomics getting started instructions to set up your Google Cloud Project. The Pipelines API requires that the following are enabled in your project:
Follow the Google Genomics getting started instructions to install and authorize the Google Cloud SDK.
Install or update the python client via pip install --upgrade google-api-python-client. For more detail see https://cloud.google.com/genomics/v1/libraries.

Examples

pipelines-api-examples's People

Contributors

Stargazers

Watchers

pipelines-api-examples's Issues

Enable multiple inputs JSONs

Cromwell now supports providing multiple JSON files of inputs, which provides significant convenience for managing inputs.

For example, for a given WDL, I can feed Cromwell one JSON file with input data files (where inputs will typically change for every run of the pipeline), one JSON with analysis parameters (typically the same between runs), and one with performance-related parameters (eg Java heap size for each task) that amount to a sort of configuration, which I might want to swap out depending on the urgency of a project.

This would provide great value if it were supported by wdl_runner.

Update all samples to use command line args.

samtools
bioconductor

poll.sh helper script doesn't accept polling time argument

When called without arguments, the poll.sh script returns
Usage: ./poll.sh OPERATION-ID <poll-interval-seconds>, indicating that the user can specify the interval time as a second argument.

But it seems the usage check section (reproduced below) really only accepts a single argument, because of the $# -ne 1 condition:

if [[ $# -ne 1 ]]; then
  echo "Usage: $0 OPERATION-ID <poll-interval-seconds>"
  exit 1
fi

Create a pipeline sample using Cloud Functions

Write a cloud function that monitors for new objects added to a particular bucket and calls the pipelines API to run a pipeline on that file.

For example, when a new BAM file is added to the monitored bucket, the cloud function can call this pipeline to create the BAM index for the file.

For more detail, see https://cloud.google.com/functions/docs and https://cloud.google.com/functions/calling#google_cloud_storage

Increasing java heap space in cromwell driver

I'm running into some memory issues where Cromwell itself is running out of heap space. One of our engineers tells me they sometimes ran out of memory when running large workflows (lots of inputs/outputs - large scatters) because the default is quite low.

I'm trying to get past this by tweaking the cromwell_driver.py, adding a heap space parameter in

  def start(self):
    """Start the Cromwell service."""
    if self.cromwell_proc:
      logging.info("Request to start Cromwell: already running")
      return

    self.cromwell_proc = subprocess.Popen([
        'java',
        '-Dconfig.file=' + self.cromwell_conf,
        '-Xmx4g',                                                    # <- line I added
        '-jar', self.cromwell_jar,
        'server'])

At the moment it's running so at least I know I didn't break the code... will confirm whether this gets past my memory issue.

In any case I thought it would be useful to document how one might be able to tweak Cromwell's memory settings. It might also be useful to expose this as a parameter of the wdl_runner in some form.

link in section (3) of README is broken

the link to the monitor_wdl_pipelines.sh in the README is broken

Updates to Jes template conf for Cromwell v24

There are some config changes required for using Cromwell v24.

The current google stanza in the Jes template looks like:
google { applicationName = "cromwell" cromwellAuthenticationScheme = "application_default" }
but it needs to be:
google { application-name = "cromwell" auths = [ { name = "application-default" scheme = "application_default" } ] }

There's also the introduction of an additional required stanza:
engine { filesystems { gcs { auth = "application-default" } } }

Docker installation issue in VMs

Noticed recently that I was having problems running an cwl and having docker install properly. Was using cwltool and the get.docker.com script seemed to be failing. Switched it over to installing docker.io and all was good.

Update Cromwell Version

Cromwell recently had a new release and I would like to update the Cromwell version inside the wdl_runner to v26. I have an updated local docker image of the wdl_runner utilizing Cromwell v26, but not the permissions to push it to an appropriate repo. What's the best way to proceed?

genomics API spuriously creating VM instances

Hi there

I used Cromwell to kick off a batch of jobs using this workflow

I killed the Cromwell job but the genomics api is continuing to spin up VM instances.

I can't get the operation id because when I run
gcloud --project=calico-uk-biobank alpha genomics operations list and it's up to 300MB of JSON so far...

Also in our case, there are no running ones in the webui.

E: Package 'docker.io' has no installation candidate

I was running the example provided by cwl_runner. It seems like docker.io can't be installed. Here is the error:

Copying gs://genomics-public-data/cwl-examples/gdc-dnaseq-cwl/workflows/dnaseq/transform.cwl...
/ [0/1 files][    0.0 B/ 19.1 KiB]   0% Done                                    
/ [1/1 files][ 19.1 KiB/ 19.1 KiB] 100% Done                                    
Operation completed over 1 objects/19.1 KiB.                                     
Copying gs://genomics-public-data/cwl-examples/gdc-dnaseq-cwl/input/gdc-dnaseq-input.json...
/ [0/1 files][    0.0 B/  363.0 B]   0% Done                                    
/ [1/1 files][  363.0 B/  363.0 B] 100% Done                                    
Operation completed over 1 objects/363.0 B.                                      
E: Package 'docker.io' has no installation candidate
sudo: easy_install: command not found
Failed to start docker.service: Unit docker.service not found.
/startup-bmLh_c/tmpy1X4h2: line 120: virtualenv: command not found
/startup-bmLh_c/tmpy1X4h2: line 121: cwl/bin/activate: No such file or directory
/startup-bmLh_c/tmpy1X4h2: line 122: pip: command not found
/startup-bmLh_c/tmpy1X4h2: line 130: cwl-runner: command not found
/startup-bmLh_c/tmpy1X4h2: line 132: deactivate: command not found
Building synchronization state...
Starting synchronization...
Copying file:///tmp/status-20359.txt [Content-Type=text/plain]...
/ [0 files][    0.0 B/    7.0 B]                                                
/ [1 files][    7.0 B/    7.0 B]                                                
Operation completed over 1 objects/7.0 B.                                        
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    41  100    41    0     0   4849      0 --:--:-- --:--:-- --:--:--  5125
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    12  100    12    0     0   1423      0 --:--:-- --:--:-- --:--:--  1500

Have all examples say to do Prerequisites

https://deflaux-dot-bioquery-cookbook.googleplex.com/use_cases/docker.html links to https://github.com/googlegenomics/pipelines-api-examples/tree/master/samtools

I got an error with "Run the Docker image in the cloud" step because I didn't know I had to do Prerequisites first.

Create a sample for compress/decompress files in GCS

How to deal with output directories

Hello,

When running a job with "gcloud alpha genomics pipelines run", I have an output that is a couple different directories...
/mnt/data/output/A
/mnt/data/output/B

Is there any way to copy the directories A and B to my GCS without naming all files?

It fails because the pipelines tries: gsutil /mnt/data/output/* gs://my_bucket

Similar to the samtools example yaml, I have:
outputParameters:

name: outputPath
description: Cloud Storage path for where bamtofastq writes
localCopy:
path: output/*
disk: datadisk

And:
gcloud alpha genomics pipelines run
--pipeline-file my.yaml
--inputs bamfiles.bam
--outputs outputPath=gs://cgc_bam_bucket_007/output/ \

I was thinking that in the docker cmd: >, the output dir could be tarred up, and then
the output is just a tarball. But it's not a great solution.

Please help?

Enable WDL imports

Cromwell now has functionality to allow users to import WDL code within a WDL workflow, either of tasks, or even complete workflows (-> subworkflows). This is very useful so it would be great to update the wdl_runner to take advantage of it. The main change needed is to add a parameter for the user to provide a zip file of the imports, and plug that into Cromwell.

Here's an overview of how WDL imports work.

At the command line: you specify your "master" WDL as input and you make a zip file with any dependent WDLs that contain either WDL tasks or entire workflows. The relevant docs are here: https://github.com/broadinstitute/cromwell/blob/23/README.md#imports

The main two use cases are:

You have a library of single-task WDLs that you want to import rather than copy into your workflows. This would be perfect application of the GATK wrappers we wrote to enable people to call tools without rewriting everything themselves. Here's a worked out example from the doc:

For example, consider you have a directory of WDL files:

my_WDLs
└──cgrep.wdl
└──ps.wdl
└──wc.wdl

If you zip that directory to my_WDLs.zip, you have the option to pass it in as the last parameter in your run command and be able to reference these WDLs as imports in your primary WDL. For example, your primary WDL can look like this:

import "ps.wdl" as ps
import "cgrep.wdl"
import "wc.wdl" as wordCount

workflow threestep {

call ps.ps as getStatus
call cgrep.cgrep { input: str = getStatus.x }
call wordCount { input: str = ... }

}

The command to run this WDL, without needing any inputs, workflow options or metadata files would look like:

$ java -jar cromwell.jar run threestep.wdl - - - /path/to/my_WDLs.zip

OR you want to tie together multiple workflows, for example if you have one that reverts BAMs to unmapped BAMs, then our single-sample pipeline that takes uBAMs to make GVCFs per sample, then a third that runs joint genotyping on all the GVCFs. Sometimes you want to run them separately, sometimes all in a row, but you don't want to have one massive WDL that replicates the code from each in case you need to update individual segments (code drift alert!). So you use subworkflows, meaning you write one "master" WDL that is basically a container that ties the three separate workflows together into a single runnable WDL, using import statements to load in entire workflows. There's a worked-out example in the doc at https://github.com/broadinstitute/cromwell/blob/23/README.md#sub-workflows.

Create sample for running Bioconductor Docker on N files in Cloud Storage.

Something similar to https://github.com/isb-cgc/examples-R/blob/master/inst/doc/Processing_Raw_Data_With_Bioconductor.md would be a good start.

Add an example pipeline that uses a workflow definition language

Users of the pipelines API are able to run workflows based on WDL and CWL.

For WDL, they use the Broad's Cromwell.
For CWL, they use one of the CWL runners available.

Update examples to handle domain-scoped Cloud projects

As noted here:

https://cloud.google.com/container-registry/docs/#pushing_to_the_registry

If your project ID has the form example.com:foo-bar, with Docker 1.8+ use:

gcr.io/example.com/foo-bar/...

To support domain-scoped projects, we should sweep the examples and update lines like:

'imageName': 'gcr.io/%s/fastqc' % args.project,

to:

'imageName': 'gcr.io/%s/fastqc' % args.project.replace(':', '/'),

Continuing our discussion for simplifying pipelines (and their examples)

Hi Matt (@mbookman),

So to continue our discussion from #10 (comment), I understand the REST interface here:

https://www.googleapis.com/discovery/v1/apis/genomics/v1alpha2/rest

But this is too cumbersome for bioinformaticians who just want a turn-key solution and to run stuff. The examples are great, but we should have secondary ones to simplify them, which will increase the audience spectrum. This includes the ability for multiple files. This can be done now, even if the backend does not support it directly. Also include examples of connected pipelines as workflows and nested pipelines examples - and yes, there are several ways :)

So with each example there should be pipelines like this, which are defined in a file that the program (Python/R/Java, etc) will pick up and adapt to the REST interface. Here one provides only the necessary information, and the parser will transform the generalized names and also fill out the required on it's own:

Pipeline:

    name: 'fastqc'
    CPU: 1
    RAM: 3.75 GB

    disks:
      name: 'datadisk'
      mountPoint: '/mnt/data'
      size: 500 GB
      persistent: true

    docker: 
      image: 'gcr.io/PROJECT_ID_ARGUMENT/fastqc'

      cmd: ( 'mkdir /mnt/data/output && '
             'fastqc /mnt/data/input/* --outdir=/mnt/data/output/' )

     inputParameters:

       name: inputFile + [idx : 1...len(INPUT)]

       location: 
         path: 'input/'
         disk: 'datadisk'

     outputParameters:

       name: 'outputPath'

       location:
         path: 'output/*'
         disk: 'datadisk'


   pipelineArgs:

    RAM: 1 GB
    disks:
           name: 'datadisk'
           size: DISK_SIZE_ARGUMENT
           persistent: true

     inputs:
        inputFile + [idx : 1...len(INPUT)]
     outputs:
        path: OUTPUT_ARGUMENT

    logging:
      path: LOGGING_ARGUMENT

Let me know what you think.

Thanks,
Paul

update version of Rabix in cwl_runner; also Rabix option does not work

the current version of Rabix baked into cwl_runner is v1.0.0-rc2 (released Dec 2016) -- Rabix is now up to v1.0.0-rc5

Create a sample using FastQC

Update poll.sh to support more "brief" output options

When an operation completes, the poll.sh script emits the entire operation.
This is often quite verbose and one needs to scroll back in their window to see simply whether the operation completed successfully or not.

gcloud supports some very nice output formatting options (see "gcloud topic formats"), and so for example it would be very nice if poll.sh finished by passing the command-line option:

--format='yaml(done, error, metadata.events)'

instead of just --format yaml

The user would see output like:

done: true
metadata:
  events:
  - description: start
    startTime: '2016-08-05T23:08:26.432090867Z'
  - description: pulling-image
    startTime: '2016-08-05T23:08:26.432154840Z'
  - description: localizing-files
    startTime: '2016-08-05T23:09:03.947223371Z'
  - description: running-docker
    startTime: '2016-08-05T23:09:03.947277516Z'
  - description: delocalizing-files
    startTime: '2016-08-06T00:26:22.863609038Z'
  - description: ok
    startTime: '2016-08-06T00:26:24.296178476Z'

Update instructions to use virtualenv

From https://github.com/googlegenomics/pipelines-api-examples

"pip install --upgrade google-api-python-client" gave Permission Denied. I had to run "sudo pip install --upgrade google-api-python-client". Matt says rather than fix this, we should update instructions to use virtualenv.

Error when running samtools example

I am trying to run the samtools example. It ran successfully last Thursday, but today and yesterday it is giving me this error:

 "done": true,
 "error": {
  "code": 10,
  "message": "13: VM ggp-1907201874809182950 shut down unexpectedly."
 }

Any ideas?

test_index.sh error with "curl" command

not sure if this is a script error or a Cloud Shell error, but when trying to use this script while following one of the tutorial examples (in Cloud Shell), the following command failed:

curl -O ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot3_exon_targetted_GRCh37_bams/data/NA06986/alignment/NA06986.chromMT.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam

the problem may be the combination of curl and an ftp url -- modifying the script or including some comments about other options would be helpful -- eg the file can also be obtained from

http://storage.googleapis.com/genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot3_exon_targetted_GRCh37_bams/data/NA06986/alignment/NA06986.chromMT.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam

What infrastructure is in place for handling dependencies between pipelines?

Complex bioinformatics workflows usually have dependencies from one step in a workflow to the next. Is there functionality in the pipelines API to handle this? If not, any suggestions on implementation of such workflow dependencies, ideally with caching of previously computed results?

googlegenomics / pipelines-api-examples Goto Github PK

pipelines-api-examples's Introduction

pipelines-api-examples

Prerequisites

Examples

See Also

pipelines-api-examples's People

Contributors

Stargazers

Watchers

Forkers

pipelines-api-examples's Issues

Recommend Projects

Recommend Topics

Recommend Org