Git Product home page Git Product logo

pipelines-api-examples's Introduction

pipelines-api-examples

This repository contains examples for the [Google Genomics Pipelines API] (https://cloud.google.com/genomics/reference/rest/v1alpha2/pipelines).

Alpha
This is an Alpha release of Google Genomics API. This feature might be changed in backward-incompatible ways and is not recommended for production use. It is not subject to any SLA or deprecation policy.

The API provides an easy way to create, run, and monitor command-line tools on Google Compute Engine running in a Docker container. You can use it like you would a job scheduler.

The most common use case is to run an off-the-shelf tool or custom script that reads and writes files. You may want to run such a tool over files in Google Cloud Storage. You may want to run this independently over hundreds or thousands of files.

The typical flow for a pipeline is:

  1. Create a Compute Engine virtual machine
  2. Copy one or more files from Cloud Storage to a disk
  3. Run the tool on the file(s)
  4. Copy the output to Cloud Storage
  5. Destroy the Compute Engine virtual machine

You can submit batch operations from your laptop, and have them run in the cloud. You can do the packaging to Docker yourself, or use existing Docker images.

Prerequisites

  1. Clone or fork this repository.
  2. If you plan to create your own Docker images, then install docker: https://docs.docker.com/engine/installation/#installation
  3. Follow the Google Genomics getting started instructions to set up your Google Cloud Project. The Pipelines API requires that the following are enabled in your project:
    1. Genomics API
    2. Cloud Storage API
    3. Compute Engine API
  4. Follow the Google Genomics getting started instructions to install and authorize the Google Cloud SDK.
  5. Install or update the python client via pip install --upgrade google-api-python-client. For more detail see https://cloud.google.com/genomics/v1/libraries.

Examples

See Also

pipelines-api-examples's People

Contributors

binghamj avatar deflaux avatar eap avatar jbingham avatar mbookman avatar slagelwa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pipelines-api-examples's Issues

Enable multiple inputs JSONs

Cromwell now supports providing multiple JSON files of inputs, which provides significant convenience for managing inputs.

For example, for a given WDL, I can feed Cromwell one JSON file with input data files (where inputs will typically change for every run of the pipeline), one JSON with analysis parameters (typically the same between runs), and one with performance-related parameters (eg Java heap size for each task) that amount to a sort of configuration, which I might want to swap out depending on the urgency of a project.

This would provide great value if it were supported by wdl_runner.

poll.sh helper script doesn't accept polling time argument

When called without arguments, the poll.sh script returns
Usage: ./poll.sh OPERATION-ID <poll-interval-seconds>, indicating that the user can specify the interval time as a second argument.

But it seems the usage check section (reproduced below) really only accepts a single argument, because of the $# -ne 1 condition:

if [[ $# -ne 1 ]]; then
  echo "Usage: $0 OPERATION-ID <poll-interval-seconds>"
  exit 1
fi

Increasing java heap space in cromwell driver

I'm running into some memory issues where Cromwell itself is running out of heap space. One of our engineers tells me they sometimes ran out of memory when running large workflows (lots of inputs/outputs - large scatters) because the default is quite low.

I'm trying to get past this by tweaking the cromwell_driver.py, adding a heap space parameter in

  def start(self):
    """Start the Cromwell service."""
    if self.cromwell_proc:
      logging.info("Request to start Cromwell: already running")
      return

    self.cromwell_proc = subprocess.Popen([
        'java',
        '-Dconfig.file=' + self.cromwell_conf,
        '-Xmx4g',                                                    # <- line I added
        '-jar', self.cromwell_jar,
        'server'])

At the moment it's running so at least I know I didn't break the code... will confirm whether this gets past my memory issue.

In any case I thought it would be useful to document how one might be able to tweak Cromwell's memory settings. It might also be useful to expose this as a parameter of the wdl_runner in some form.

Updates to Jes template conf for Cromwell v24

There are some config changes required for using Cromwell v24.

The current google stanza in the Jes template looks like:
google { applicationName = "cromwell" cromwellAuthenticationScheme = "application_default" }
but it needs to be:
google { application-name = "cromwell" auths = [ { name = "application-default" scheme = "application_default" } ] }

There's also the introduction of an additional required stanza:
engine { filesystems { gcs { auth = "application-default" } } }

Docker installation issue in VMs

Noticed recently that I was having problems running an cwl and having docker install properly. Was using cwltool and the get.docker.com script seemed to be failing. Switched it over to installing docker.io and all was good.

Update Cromwell Version

Cromwell recently had a new release and I would like to update the Cromwell version inside the wdl_runner to v26. I have an updated local docker image of the wdl_runner utilizing Cromwell v26, but not the permissions to push it to an appropriate repo. What's the best way to proceed?

genomics API spuriously creating VM instances

Hi there

I used Cromwell to kick off a batch of jobs using this workflow

I killed the Cromwell job but the genomics api is continuing to spin up VM instances.

I can't get the operation id because when I run
gcloud --project=calico-uk-biobank alpha genomics operations list and it's up to 300MB of JSON so far...

Also in our case, there are no running ones in the webui.

E: Package 'docker.io' has no installation candidate

I was running the example provided by cwl_runner. It seems like docker.io can't be installed. Here is the error:

Copying gs://genomics-public-data/cwl-examples/gdc-dnaseq-cwl/workflows/dnaseq/transform.cwl...
/ [0/1 files][    0.0 B/ 19.1 KiB]   0% Done                                    
/ [1/1 files][ 19.1 KiB/ 19.1 KiB] 100% Done                                    
Operation completed over 1 objects/19.1 KiB.                                     
Copying gs://genomics-public-data/cwl-examples/gdc-dnaseq-cwl/input/gdc-dnaseq-input.json...
/ [0/1 files][    0.0 B/  363.0 B]   0% Done                                    
/ [1/1 files][  363.0 B/  363.0 B] 100% Done                                    
Operation completed over 1 objects/363.0 B.                                      
E: Package 'docker.io' has no installation candidate
sudo: easy_install: command not found
Failed to start docker.service: Unit docker.service not found.
/startup-bmLh_c/tmpy1X4h2: line 120: virtualenv: command not found
/startup-bmLh_c/tmpy1X4h2: line 121: cwl/bin/activate: No such file or directory
/startup-bmLh_c/tmpy1X4h2: line 122: pip: command not found
/startup-bmLh_c/tmpy1X4h2: line 130: cwl-runner: command not found
/startup-bmLh_c/tmpy1X4h2: line 132: deactivate: command not found
Building synchronization state...
Starting synchronization...
Copying file:///tmp/status-20359.txt [Content-Type=text/plain]...
/ [0 files][    0.0 B/    7.0 B]                                                
/ [1 files][    7.0 B/    7.0 B]                                                
Operation completed over 1 objects/7.0 B.                                        
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    41  100    41    0     0   4849      0 --:--:-- --:--:-- --:--:--  5125
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    12  100    12    0     0   1423      0 --:--:-- --:--:-- --:--:--  1500

How to deal with output directories

Hello,

When running a job with "gcloud alpha genomics pipelines run", I have an output that is a couple different directories...
/mnt/data/output/A
/mnt/data/output/B

Is there any way to copy the directories A and B to my GCS without naming all files?

It fails because the pipelines tries: gsutil /mnt/data/output/* gs://my_bucket

Similar to the samtools example yaml, I have:
outputParameters:

  • name: outputPath
    description: Cloud Storage path for where bamtofastq writes
    localCopy:
    path: output/*
    disk: datadisk

And:
gcloud alpha genomics pipelines run
--pipeline-file my.yaml
--inputs bamfiles.bam
--outputs outputPath=gs://cgc_bam_bucket_007/output/ \

I was thinking that in the docker cmd: >, the output dir could be tarred up, and then
the output is just a tarball. But it's not a great solution.

Please help?

Enable WDL imports

Cromwell now has functionality to allow users to import WDL code within a WDL workflow, either of tasks, or even complete workflows (-> subworkflows). This is very useful so it would be great to update the wdl_runner to take advantage of it. The main change needed is to add a parameter for the user to provide a zip file of the imports, and plug that into Cromwell.

Here's an overview of how WDL imports work.

At the command line: you specify your "master" WDL as input and you make a zip file with any dependent WDLs that contain either WDL tasks or entire workflows. The relevant docs are here: https://github.com/broadinstitute/cromwell/blob/23/README.md#imports

The main two use cases are:

  1. You have a library of single-task WDLs that you want to import rather than copy into your workflows. This would be perfect application of the GATK wrappers we wrote to enable people to call tools without rewriting everything themselves. Here's a worked out example from the doc:

For example, consider you have a directory of WDL files:

my_WDLs
└──cgrep.wdl
└──ps.wdl
└──wc.wdl

If you zip that directory to my_WDLs.zip, you have the option to pass it in as the last parameter in your run command and be able to reference these WDLs as imports in your primary WDL. For example, your primary WDL can look like this:

import "ps.wdl" as ps
import "cgrep.wdl"
import "wc.wdl" as wordCount

workflow threestep {

call ps.ps as getStatus
call cgrep.cgrep { input: str = getStatus.x }
call wordCount { input: str = ... }

}

The command to run this WDL, without needing any inputs, workflow options or metadata files would look like:

$ java -jar cromwell.jar run threestep.wdl - - - /path/to/my_WDLs.zip
  1. OR you want to tie together multiple workflows, for example if you have one that reverts BAMs to unmapped BAMs, then our single-sample pipeline that takes uBAMs to make GVCFs per sample, then a third that runs joint genotyping on all the GVCFs. Sometimes you want to run them separately, sometimes all in a row, but you don't want to have one massive WDL that replicates the code from each in case you need to update individual segments (code drift alert!). So you use subworkflows, meaning you write one "master" WDL that is basically a container that ties the three separate workflows together into a single runnable WDL, using import statements to load in entire workflows. There's a worked-out example in the doc at https://github.com/broadinstitute/cromwell/blob/23/README.md#sub-workflows.

Continuing our discussion for simplifying pipelines (and their examples)

Hi Matt (@mbookman),

So to continue our discussion from #10 (comment), I understand the REST interface here:

https://www.googleapis.com/discovery/v1/apis/genomics/v1alpha2/rest

But this is too cumbersome for bioinformaticians who just want a turn-key solution and to run stuff. The examples are great, but we should have secondary ones to simplify them, which will increase the audience spectrum. This includes the ability for multiple files. This can be done now, even if the backend does not support it directly. Also include examples of connected pipelines as workflows and nested pipelines examples - and yes, there are several ways :)

So with each example there should be pipelines like this, which are defined in a file that the program (Python/R/Java, etc) will pick up and adapt to the REST interface. Here one provides only the necessary information, and the parser will transform the generalized names and also fill out the required on it's own:

Pipeline:

    name: 'fastqc'
    CPU: 1
    RAM: 3.75 GB

    disks:
      name: 'datadisk'
      mountPoint: '/mnt/data'
      size: 500 GB
      persistent: true

    docker: 
      image: 'gcr.io/PROJECT_ID_ARGUMENT/fastqc'

      cmd: ( 'mkdir /mnt/data/output && '
             'fastqc /mnt/data/input/* --outdir=/mnt/data/output/' )

     inputParameters:

       name: inputFile + [idx : 1...len(INPUT)]

       location: 
         path: 'input/'
         disk: 'datadisk'

     outputParameters:

       name: 'outputPath'

       location:
         path: 'output/*'
         disk: 'datadisk'


   pipelineArgs:

    RAM: 1 GB
    disks:
           name: 'datadisk'
           size: DISK_SIZE_ARGUMENT
           persistent: true

     inputs:
        inputFile + [idx : 1...len(INPUT)]
     outputs:
        path: OUTPUT_ARGUMENT

    logging:
      path: LOGGING_ARGUMENT

Let me know what you think.

Thanks,
Paul

Update poll.sh to support more "brief" output options

When an operation completes, the poll.sh script emits the entire operation.
This is often quite verbose and one needs to scroll back in their window to see simply whether the operation completed successfully or not.

gcloud supports some very nice output formatting options (see "gcloud topic formats"), and so for example it would be very nice if poll.sh finished by passing the command-line option:

--format='yaml(done, error, metadata.events)'

instead of just --format yaml

The user would see output like:

done: true
metadata:
  events:
  - description: start
    startTime: '2016-08-05T23:08:26.432090867Z'
  - description: pulling-image
    startTime: '2016-08-05T23:08:26.432154840Z'
  - description: localizing-files
    startTime: '2016-08-05T23:09:03.947223371Z'
  - description: running-docker
    startTime: '2016-08-05T23:09:03.947277516Z'
  - description: delocalizing-files
    startTime: '2016-08-06T00:26:22.863609038Z'
  - description: ok
    startTime: '2016-08-06T00:26:24.296178476Z'

Error when running samtools example

I am trying to run the samtools example. It ran successfully last Thursday, but today and yesterday it is giving me this error:

 "done": true,
 "error": {
  "code": 10,
  "message": "13: VM ggp-1907201874809182950 shut down unexpectedly."
 }

Any ideas?

test_index.sh error with "curl" command

not sure if this is a script error or a Cloud Shell error, but when trying to use this script while following one of the tutorial examples (in Cloud Shell), the following command failed:

curl -O ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot3_exon_targetted_GRCh37_bams/data/NA06986/alignment/NA06986.chromMT.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam

the problem may be the combination of curl and an ftp url -- modifying the script or including some comments about other options would be helpful -- eg the file can also be obtained from

http://storage.googleapis.com/genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot3_exon_targetted_GRCh37_bams/data/NA06986/alignment/NA06986.chromMT.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.