Git Product home page Git Product logo

serratus's Introduction

Serratus

Introduction Video (1:59)

Serratus Mountain in Squamish, BC. Canada

Background

The SARS-CoV-2 pandemic will infect millions and has already crippled the global economy.

While there is an intense research effort to sequence SARS-CoV-2 isolates to understand the evolution of the virus in real-time, our understanding of coronavirus evolution is limited by the poor characterization of other members of the Coronaviridae family (only 53/436 CoV sp. Genomes are available).

We are re-analyzing all RNA-sequencing data in the NCBI Short Read Archive to discover new members of Coronaviridae and assemble their genomes. That is >1.12 million biological samples or 5.72 petabytes of sequencing data.

Architecture

serratus-overview

Contributing

Serratus is an Open-Science project. We welcome all scientists to contribute. See CONTRIBUTING.md

Email (ababaian AT bccrc DOT ca) or join Slack (type /join #serratus)

Setting up and running Serratus

0) Dependencies

AWS account

  1. Sign up for an AWS account (you can use the free tier)
  2. Create an IAM Admin User with Access Key. For Access type, use Progammatic access.
  3. Note the Access Key ID and Secret values.
  4. Create a EC2 keypair in us-east-1 region. Retain the name of the keypair and the .pem file. Configure your ssh for easy AWS access(change serratus.pem to your identity file).

~/.ssh/config: Add these lines

Host *.compute.amazonaws.com *.compute-1.amazonaws.com aws_*
     User ec2-user
     IdentityFile ~/.ssh/serratus.pem
     StrictHostKeyChecking no
     UserKnownHostsFile /dev/null

Packer

  1. Download Packer as a binary. Extract it to a PATH directory (~/.local/bin)

Terraform

  1. Download Teraform (>= v0.12.24) as a binary. Extract it to a PATH directory (~/.local/bin)

1) Build Serratus AMIs with Packer

Pass AWS credentials to pipeline via environmental variables

export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"

Use packer to build the serratus instance image (AMI)

cd serratus/packer
/path/to/packer build docker-ami.json
cd ../..

This will start up a t3.nano, build the AMI, and then terminate it. Currently this takes about 2 minutes, which should cost well under a penny. The final line of STDOUT will be the region and AMI. Retain this information

Current stable AMI: us-east-1: ami-04c1625cf0bcb4159

2) Build Serratus resources with Terraform

Set Terraform variables

Open terraform/main/terraform.tfvars in a text editor. Set these variables

  • dev_cidrs: Your public IP, followed by "/32". Use: curl ipecho.net/plain; echo
  • key_name: Your EC2 key pair name
  • dockerhub_account: (optional). Change this to your docker hub account to build your own images. Default images are in serratusbio organization.

Create Serratus resources

Navigate to the top-level module and run terraform initialization and apply. Retain the scheduler DNS address (last output line).

cd terraform/main
terraform init
terrafform apply
cd ../..

At the time of writing, this will create:

  • a t3.nano, for the scheduler, with an Elastic IP
  • an S3 bucket, to store intermediates
  • an ASG for serratus-dl, using c5.large with 50GB of gp2.
  • An ASG for serratus-align, using c5.large
  • An ASG for serratus-merge, using t3.small
  • Security groups and IAM roles to tie it all together.

All ASGs have a max size of 1. This can all be reconfigured in terraform/main/main.tf.

At the end of tf apply, it will output the scheduler's DNS address. Keep this for later.

3) Open SSH tunnel to the scheduler

The scheduler exposes ports 3000/8000/9090. This port is not exposed to the public internet. You will need to create an SSH tunnel to allow your local web-browser and terminal to connect.

./create_tunnel.sh

Open a web browser for UI: Status Page: http://localhost:8000/jobs/ Grafana: http://localhost:3000/jobs/ http://localhost:8000/jobs/ Prometheus: http://localhost:8000/jobs/

May take a few minutes to boot. Make tea.

5) Loading SRA Accesions into Serratus

Once the scheduler is online, you can curl SRA accession data in the form of a SraRunInfo.csv file (NCBI SRA > Send to: File).

curl -s -X POST -T /path/to/SraRunInfo.csv localhost:8000/jobs/add_sra_run_info/

This should respond with a short JSON indicating the number of rows inserted, and the total number in the scheduler.

In your web browser, refresh the status page. You should now see a list of accessions by state. If ASGs are online, they should start processing immediately. In a few seconds, the first entry will switch to "splitting" state, which means it's working.

6) Launch cluster nodes

With data loaded into the scheduler, manually set the number of serratus-dl, serratus-align and serratus-mergenodes to process the data. You can adjust the number of each node with these scripts.

terraform/main/dl_set_capacity.sh 10
terraform/main/align_set_capacity.sh 10
terraform/main/merge_set_capacity.sh 1

Example

Useful links

AWS-specific

SARS-CoV-2

Bloom Filters

Data Release Policy

To achieve our objective of providing high quality CoV sequence data to the global research effort, Serratus ensures:

  • All software development is open-source and freely available (GPLv3)
  • All sequencing data generated, raw and processed, will be freely available in the public domain in accordance with the Bermuda Principles of the Human Genome Project.

serratus's People

Contributors

brietaylor avatar charlescongxu avatar justinchu avatar mathemage avatar rcedgar avatar victorlin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.