ababaian / serratus Goto Github PK

View Code? Open in Web Editor NEW

245.0 20.0 33.0 7.04 MB

Ultra-deep search for novel viruses

Home Page: http://serratus.io

License: GNU General Public License v3.0

Shell 0.99% Python 1.06% Dockerfile 0.18% HCL 0.33% HTML 15.84% Jupyter Notebook 81.31% Metal 0.09% Makefile 0.20%

bioinformatics covid19 sars-cov-2 coronavirus aws open-science

serratus's Introduction

Serratus

Serratus is a collaborative Open Science project for ultra-rapid discovery of all viruses.

Paper: Petabase-scale sequence alignment catalyses viral discovery

Background

Forewarned is Forearmed.

Our primary goal is to generate the viral data to accelerate the global research efforts in fighting emerging pathogens. We are re-analyzing all RNA-seq, meta-genomics, meta-transcriptomics and environmental sequencing data in the NCBI Short Read Archive to discover new viruses. That is >6 million biological samples or >10 petabases of sequencing data.

Contribute to Serratus

The Serratus team is actively looking to collaborate with all scientists and developers.

See: CONTRIBUTING.md
Join the Serratus Slack to chat / ask a question.

Serratus Usage

Learn more on the Serratus Wiki

Serratus Data Explorer

palmID: RdRP search

How To: Finding Novel Viruses

Accessing Raw Data

Cloud Architecture

Serratus CLI Usage

Data Release Policy

Our primary goal is to generate the viral data to accelerate the global research efforts in fighting emerging pathogens. To achieve this:

All software development is open-source and freely available (GPLv3)
All data, raw and processed, will be freely and immediatly available in the public domain in accordance with the INSDC Policy.

About The Team

serratus's People

Contributors

Stargazers

Watchers

serratus's Issues

Error: No more processes: cannot run a single job.

It looks like one of the instances ran out of PIDs? Here's the message Artem found in the logs:

00:29:17
parallel: Warning: No more processes: Decreasing number of running jobs to 1.
parallel: Warning: No more processes: Decreasing number of running jobs to 1.

00:29:17
parallel: Warning: Raising ulimit -u or /etc/security/limits.conf may help.

00:29:17
Completed 256.0 KiB/67.9 MiB (688.9 KiB/s) with 1 file(s) remaining Completed 512.0 KiB/67.9 MiB (1.3 MiB/s) with 1 file(s) remaining Completed 768.0 KiB/67.9 MiB (1.9 MiB/s) with 1 file(s) remaining Completed 1.0 MiB/67.9 MiB (2.6 MiB/s) with 1 file(s) remaining Completed 1.2 MiB/67.9 MiB (3.2 MiB/s) with 1 file(s) remaining Completed 1.5 MiB/67.9 MiB (3.8 MiB/s) with 1 file(s) remain

00:29:17
parallel: Error: No more processes: cannot run a single job. Something is wrong at .

`bam` and `fastq` files via `fusera`

sratoolkit works natively with AWS/GCP to access SRA archive files. Most data is stored in this format and requires fastq-dump from sratoolkit to start the pipeline. Some modern SRA entries already contain bam or fastq files, these would be faster to access via fusera.

Add a "Try Fusera" to access fq files first then fall-back on sratoolkit fq dump which is the current default.

Originally posted by @ababaian in
#5 (comment)

Reduce disk usage (or dynamically provision) in serratus-dl.sh

The SRA's have vastly different sizes, and currently fastq-dump stores them in raw text which appears to be (empirically speaking, sample size = 1) ~8x larger than the files on SRA. If we keep doing it this way, some of the SRA files will expand to ~220GB for this step. Here are some possible options to mitigate:

As suggested by Artem, dynamically provision and mount EBS volumes with needed storage. (not sure if possible)
Give the instances fixed storage and implement a way for the scheduler to assign jobs to instances with available storage (complicated).
Stream the files to and from S3 using pipes. Maybe possible using SRA-Pipeline code as inspiration (though how to do paired-end reads?)
~~Use EFS (Elastic File Store).~~ EFS bandwidth for bursting is 50KB/s per GB of store data. You can pay for extra speed but it's not cheap. Not meant for temporary files like this.
Overprovision storage for the instances and just pay for it. Easy. EBS is $0.13/TB-hr. Worst case scenario is two threads, 220GB each, times 2 (unsplit and split) ~= 1TB. C5.large Spot instances are ~$0.03/hr (OnDemand = $0.085). So we pay more for storage than compute, but it's not the end of the world.

Testing how divergence of CV genomes affects bloom filter homology detection

Index CV pangenome minus one out-group in a bloom filter
Simulate reads from out-group
Query reads against pangenome index

cov2 pan-genome refinements

Objective

Refine cov1 to reduce the level of false-positives and increase alignment specificity
Reduce the representation of highly homologous sequences (99% sequence identity) in the pan-genome to decrease the alignment space

include "WGS" in "Assay Type" field from SRA

WGS data sets include viral RNA

see the following runs for SARS-CoV-2 specific datasets:

SRR11393704
SRR11410122
SRR11410124
SRR11433889
SRR11433885
SRR11433893
SRR11410125
SRR11433881
SRR11433890
SRR11433892
SRR11433886
SRR11433888
SRR11410121
SRR11433891
SRR11433882
SRR11433883
SRR11433887
SRR11410123
SRR11393275
SRR11393276
SRR11393278
SRR11393277
SRR11433884

TODO List

AWS Specific Tasks

sratoolkit works natively with AWS/GCP to access SRA archive files. Most data is stored in this format and requires fastq-dump from sratoolkit to start the pipeline. Some modern SRA entries already contain bam or fastq files, these would be faster to access via fusera. Add a "Try Fusera" to access fq files first then fall-back on sratoolkit fq dump which is the current default
Try to abstract away the entire serratus-align layer into AWS Lambda functions. See: 1 2.
Do a price comparison for serratus-dl between using C5.large and i3.large instances, the latter have 2 CPU, high ram and 400 GB internal storage as part of the machine. You need to mount the internal volume yourself Issue #6
From SraRunInfo table we know the size of the file in Mb. Can we dynamically/rapidly provision how much EBS space the EC2 instance has so that it's say 5 Gb + 3xSRA size. i.e. for a 2 Gb SRA it will have 5 Gb + 3x2 = 11 Gb. Then the next job is 100 Gb so we'll need 5 Gb + 3x100 = 305 Gb on EBS, then the next job is 200 Mb and we need ... Issue #4

BASH /Linux tasks

Build bowtie2 on ARM, to take advantage of ~1/2 price A1 instances. Also samtools and blastn.

Docker tasks

Research available BLAST containers and time the efficiency of each implementation.

Bioinformatics tasks

Find software implementations of Bloom Filters that we can use to stream data through and select possible CoV Reads
Create an annotated list of SRA libraries with known Coronaviruses in the samples (i.e. SARS cell culture, SARS-CoV-2 accessions).
Create a simulated test-data set, using a baseline human transcriptome (cell line?) 'spike in' simulated viral genome/transcripts at different quantities so we can test the sensitivity/specificity of our pipeline versus a known standard.

Taxonomy identifiers for Cov reference database

It would be useful to include taxonomy information in pan-genome reference to enable analyses such as taxonomy-aware summaries of bowtie2 hits. Taxonomy descriptions are included in the full FASTA deflines, but not the taxonomy id (i.e., the integer accession in the NCBI Taxonomy database). The integer accession is better because it places the sequence in a tree structure. For example, MH726362.1 is "Porcine epidemic diarrhea virus isolate GDS05, complete genome" and the taxonomy id in the Genbank record is 28295. To implement this, we need a script which can take a long list (~33k) of sequence accessions (MH726362.1) and fetch their taxonomy identifiers. I'm guessing this can be done by a simply query using Entrez or something like that.

Even better would be to have the taxonomy identifier (ideal) or species name (good) of the host as well. This can be done by downloading the Genbank record and extracting the "/host" annotation which gives the species name, e.g. for MH726362.1 you would find /host="Sus scrofa". The taxonomy identifier of the host is not given, so an additional lookup (Entrez?) would be required to retrieve it.

Simulated test set using a baseline human transcriptome

Objective

Create a 'benchmark' set of RNA-seq data of human (use real data from tissue) and a dilution series of 'spike in' simulated CoV sequences. 1e6, 1e5, 1e4 ... 1 viral genome copy per library.
Create benchmark series for at least 3-5 divergent CoV species (see: genome phylogeny)
Implement a naming convention/flags in the reads such that we can rapidly seperate and quantify human/CoV sequences from a BAM/SAM file using 'grep'
Must be delivered with a reproducible script to create new bench-mark datasets rapidly with minimal additional software requirements requirements.

Write a performance / cost simulation for autoscaling

I'd like to write a small tool, using numpy, that would simulate performance and cost over time. The idea is to include the number of nodes running at a given time, how many are booting, processing data, shutting down, etc, total data stored on S3 over time. Then we could "plug in" different autoscaling strategies (do nothing, proportional, PI, etc), and observe how these affect final cost.

My gut feel on this is that more agressive scaling strategies will save us money. But I believe it'll be much more efficient to test our ideas in simulation (runs in seconds), than in the real world (runs in hours).

Artem also floated some ideas, like running small -> big or randomly ordering runs. We could test these as well.

Perform CoV Cross-validation (hold-out)

From Artem by email: "[T]he difference in number of reads mapping between CoV+ control libraries and mammal transcriptomes is very large."

I suspect this may be misleading. If I understand correctly, Cov+ datasets have known infections by known coronaviruses, mostly (all?) Cov-19, but we are looking for incidental infections by novel coronaviruses which by definition are not in the pan-genome. Possibly, the number of viral transcripts will tend to be much lower with an incidental infection. Certainly, a novel virus will have lower identity and uneven coverage to the pan-genome compared to a positive control with a Cov-19 infection.

To model what we might see in production, Cov+ datasets should be mapped to a pan-genome with the genome of the known infection (Cov-19) excluded and of its genes plus close relatives. This is hold-out validation, also called cross-validation.

A hold-out pan-genome can be constructed using usearch as follows:

usearch -usearch_global pan_genome.fa \
-db cov19_genome.fa \
-strand both \
-id 0.95 \
-uc hits.uc \
-notmached holdout_pan_genome.fa

Here, 0.95 is the identity threshold; here all sequences having >=95% with cov19_genome.fa are removed from the reference; hits.uc is a tsv file with one record for each pan-genome sequence indicating whether it matched or not.

Unexpectedly high PUT/LIST usage on S3

From my own tests, I've noticed a large fraction of expense is in S3 Tier1 requests: PUT/COPY/LIST/POST. I don't believe we use COPY or POST anywhere, which leaves PUT and LIST.

Expected behaviour is 2 PUTs for unpaired blocks, and 4 for paired (counted in the code), plus a few LISTs at the accession level. But really, we're doing more like 10+. This is about a quarter of our cost right now.

We can pretty easily make the blocks bigger, to reduce the overhead of this (which we should do). But we should also try to understand aws s3 cp and aws s3 ls, in case these are doing unnecessary work.

Aggreggate Stats by ASG in Prometheus / Grafana

Actual Behaviour:

Desired Behaviour

Pretty graphs, with min/max/mean/median values, aggregated beautifully by auto scaling group.

How to get there

We need the prometheus node query tool to somehow add an "autoscaling_group" value to each instance. Then, we can create aggregated queries on these values in grafana, and use its styling features to create a min/max/average plot.

This should also take some load off of Grafana (making hundreds of queries), my laptop (rendering hundreds of lines), and put it on Prometheus, which is optimized for aggregating tons of data, and I think it includes features for caching and such.

Migrate `serratus-public` to US-East-1

To save on egress costs, the serratus-public bucket is going to be migrated over to us-east-1 region. This requires that the entire bucket be deleted temporarily while move commands are performed and will mean upload/download is temporarily disrupted.

Check updated Automation of project `TODO List`

Hi @ababaian and @jefftaylor42 !

I have updated the preset Automation actions in the project TODO List.

I believe this better reflects your needs to manage this project and appears as logical way to assign the preset actions.

Please check in case you disagree with any of the actions and feel free to close this issue when you are satisfied.

Thanks and regards,
@mathemage

serratus-dl: Out of disk usage error

serratus-dl when downloading SRR11278425 ran out of disk space while processing and leading to a split_err. Easy fix is to set disk space to large enough to hold the largest data-set. But the 2 workers files were only ~2.5 GB each? Should disk be a variable based on input data?

upload: ./SRR11278425.2.fq.0000000072 to s3://tf-serratus-work-20200416231833304500000001/fq-blocks/SRR11278425/SRR11278425.2.fq.0000000072
2020-04-16T23:35:46 fastq-dump.2.10.4 err: storage exhausted while writing file within file system module - system bad file descriptor error fd='8'

Why just Cov?

Seems to me we have a workflow which can be expanded to include many more families of virus with a small incremental effort. In terms of EC2 resources, there would be negligible additional cost because bowtie2 can map to a 3Gb pan-genome at similar speed to a 30Mb pan-genome. Why not leverage the big compute to capture as many novel viruses as possible?

Find a better way to reference the scheduler

Here's the process to make a change in the scheduler:

$ code, code, code
$ podman build
$ podman push
$ cd path-to-serratus-live/stage
$ terraform taint module.scheduler.aws_instance.scheduler
$ terraform apply
$ manually shut down any old instances from the ASGs, since AWS doesn't automatically recreate them when the user_data changes.

This process has a few problems. It includes a lot of manual steps (like deleting old instances from the ASGs), it's easy to forget things, and it just takes a long time to get through all these steps. The main issue is that the DNS name of the scheduler is generated on boot, by AWS, and passed to the workers via user_data. So if the scheduler is rebooted, it'll get a new DNS, and the workers won't know where to find it. Which means everything has to be restarted.

Add `Homebrew` to base Serratus container

Homebrew is a pacman for bioinformatics and will give us fast access to a suite of bioinformatics software within serratus.

Add this to the serratus-base dockerfile such that it's installed system wide. See Dockerfile README for an outline of the containers set-up.

fatal error: unable to locate credentials

One of the instances got a fatal error: Unable to locate credentials when we did the 80-instance run. Why wasn't its role attached, like the others?

Learning how to work with serratus repo

I'm trying to get my head around git, github and how to collaborate within the serratus project repo. To get a toe in the github water, I created a repo for the simulator I made to do some bowtie2 benchmarking:

https://github.com/rcedgar/simple_paired_read_simulator

I guess next steps for the simulation are:

(1) Upload the simulated reads and pan-genome to s3://serratus-public/ because total size is ~60Mb so belong on S3 not in github. I made a pan-genome myself rather than taking one from the repo so should be included to enable reproducing the results (if in fact this is needed for a simulation that may never be used again).

(2) Post a wrapper script to serratus repo which re-generates the simulated reads by invoking simple_paired_read_simulator.py

I'm not clear in detail how I should do (2). Artem kindly provided a quick summary that I don't understand: "Just keep this all to a separate branch "rce". The "master" branch is for production and experimental results. If there's files that need to be migrated between branches we can use "git checkout" in this case. " Can someone explain this in more detail, noting that I'm a git novice and do not yet understand branching and much else.

Hard Optimize `bowtie2` alignment parameters

Problem:

We are currently running bowtie2 --very-sensitive-local ... for detecting homologous CoV sequences. We need a method (script) to test an array of bowtie2 parameters for time, sensitivity and specificity of alignment.

Current settings are -D 20 -R 3 -N 0 -L 20 -i S,1,0.50, say we'd like to test the space of -D 5-25 -R 1-4 -N 0-1 -L 30-15.

Test Data

We will have simulated data soon for benchmarking but for now we can use some real data. You can likely sub-sample these to ~100,000 from CoV+ and 100,000 form CoV- for coarse-grain optimization and then do a few ultra-deep libraries in a smaller search space.

CoV Reads from SRR11454614
Mouse CoV negative SRR9658384 / SRR9658385

Output

The output should be

Wall-clock / CPU time / User time for each setting.
TP / FP / TN / FN

Annotated list of SRA libraries with known Coronaviruses

Create an annotated list of SRA libraries with known Coronaviruses in the samples (i.e. SARS cell culture, SARS-CoV-2 accessions).

Originally posted by @ababaian in
#5 (comment)

Blacklist implementation

The blacklist is a useful resource. Currently it seems to be embedded rather obscurely in script code. Suggest that it be implemented as a plain-text file in the repo with two fields: accession and comment (why it is blacklisted). The full Covid reference would include an annotation (e.g. blacklist=yes) for the benefit of post-processing analysis rather than removing the sequence from the FASTA file. This is forwards-compatible: if a sequence was blacklisted after a dataset was run through bowtie2, the corresponding SAM records can be ignored, e.g. by the summarizer / candidate dataset identification module. The goal of including it in the master FASTA is to consolidate as much useful information as possible in one resource.

Allow adding public keys to terraform config

Right now we have a terraform variable:

key_name = "jeff@rosario"

This is annoying because:

You have to log into the console to import the key
It doesn't support more than one key (eg. for Jeff and Artem both logging in)

What I really want is to use the same format SSH uses:

public_keys = [
    "ssh-rsa <publickey> jeff@rosario",
    "ssh-rsa <publickey> artem@somewhere"
]

Terraform would then ensure that users can log in using these keys. Probably via user_data, but TBD.

Port/test `serratus-dl` and `serratus-align` containers to ARM (a1.instance)

Build bowtie2 on ARM, to take advantage of ~1/2 price A1 instances. Also samtools and blastn.

Originally posted by @ababaian in
#5 (comment)

ordered priority

samtools on ARM
sratoolkit on ARM
bowtie2 on ARM

Do a price comparison for serratus-dl between using C5.large and i3.large instances

The latter have 2 CPU, high ram and 400 GB internal storage as part of the machine. You need to mount the internal volume yourself

Add ecsInstanceRole to terraform

This is an IAM role which is missing in the terraform configs.

Software implementations of `Bloom Filters`

Find software implementations of Bloom Filters that we can use to stream data through and select possible CoV Reads

Originally posted by @ababaian in
#5 (comment)

Curate a list of available bloom filter software. Include licenses, is it command line, language it's written in, website, academic paper reference if available.
Build each software package in a container and test how long it takes to process a benchmark set of RNA-sequencing libraries against a small test filter.
(If needed, we can alter the program to accept RNA-sequencing reads (in fastq) format from a data-pipe.)

Originally posted by @ababaian in
#15 (comment)

`BLAST` containers

Research available BLAST containers and time the efficiency of each implementation.

Originally posted by @ababaian in
#5 (comment)

Implement a "Jump Server" for serratus

Current process

Add my public key to terraform
SSH into all the instances
Add Artem's public key
Run multiple SSH tunnels on my laptop to connect to:

Grafana
Prometheus
Scheduler

Desired process

Have one machine, a jump server, with everyone's keypairs installed, and a light nginx webserver, with port forwards setup to the instances we need to see. Then we just need a single ssh tunnel to connect to all our services.

We could also separate it into its own public subnet, and make the rest of our application run in a private subnet, reducing surface area.

`Serratus` CLI front-end wrapper script

Currently to get serratus up and running requires a few steps including building containers, uploading them, initializing AWS credentials and access keys, then submitting a job (sraRunInfo table) to the scheduler via curl. This is detailed in the Readme.md

Create a wrapper script for Serratus such that you can type in serratus --sra <runInfo.csv> and it will go through the launch procedures and begin processing data.
Create a build script for Serratus such that you can type in serratus --build and it will make the AMI via packer, make containers if they don't exist and create all the neccesary resources files for serratus.

This should be implemented in POSIX-complient bash.

Refactor worker nodes into ECS

There are a couple of nuisances with our current worker strategy, that I think would be helped by moving most of what we've done to an orchestration system like ECS.

Log streams are currently generated by instance, so the logs from all N workers get interleaved (making it hard to find errors)
When a worker crashes, it never gets replaced.
Updating the container images is a right pain. docker kill, docker rm, docker pull, find / -name part-001, (cloud-init script) /path/to/part-001. vs. pushing a new launch template and having fresh images in a couple of minutes.
Ugly names for the ASGs (means we have to "discover" the names to do adjust desired sizes), like tf-asg-tf-serratus-dl-20200304125312000001, this is currently necessary, so that all instances get replaced when we change the user_data in the launch configuration, ECS would deal with sending the correct arguments to our scripts.

There are a couple things to work out though, first:

will we use Daemon or Replication jobs? Daemon doesn't solve 1, but replication doesn't solve 4. We need a way to force all images to be replaced if we change them.
ECS + Cloudwatch Logs
...and more, maybe?

bvfilter bugs & fixes

I found and fixed some bugs in bvfilter. New binary posted here:

https://drive5.com/tmp/bvfilter.gz

The md5sum of the binary after gunzip should be bac9251f90e8bddb4f07e398a75ccb76.

Below is the bash script I used for testing. Note that the pipeline is single-threaded, uses bvfilter in memory-mapped mode, and uses bowtie2 in unpaired mode (-U) per Issue #50. I will post timing results from this test shortly, probably as an Issue (is there a better way?).

Notice that the -quiet option of bvfilter is required because otherwise it stupidly writes a copyright message to stdout and messes up the SAM header (this bug not yet fixed).

#!/bin/bash

SRR=SRR11454614

./bvfilter -load_bitvec cov.bv -output COV_BV

/usr/bin/time -o time.txt \
    | fastq-dump -Z $SRR \
    | ./bvfilter -quiet -search_bitvec /dev/stdin -threads 1 \
        -ref COV_BV \
        -wordlength 18 \
        -sharedmem \
        -output /dev/stdout \
    | bowtie2 -p 1 \
        -x cov1.id99.polyamasked \
        --very-sensitive-local \
        -U /dev/stdin \
    > hits.sam

Good ideas to implement

Migrate over to us-east-1 region for closer access to NIH AWS systems and not get dinged for cross-region I/O

Instance doesn't signal the scheduler when shutting down

Steps to Reproduce:

SSH into a running serratus-dl instance and run docker stop serratus-dl.
SSH into a running serratus-dl instance and run sudo shutdown -h now
Decrease the size of a running serratus-dl ASG from 1 to 0.

Expected behaviour:

In all cases, the instance should send a "terminated" message to the scheduler, before shutting down cleanly. (which tells the scheduler it can reschedule that job)

Actual behaviour:

The "terminated" signal isn't sent.

In case 1, Docker waits the full 10 seconds for the program to receive its signal, before hard-killing it.

serratus-Aligner based off Alpine Linux

The current AMI image for serratus-Aligner is based off of Amazon Linux 2 (9 Gb base image). Switching this to Alpine Linux (95 Mb base image; ~400 Mb with python + dev libs) will reduce storage overhead and load times.

Current issue is that Alpine Linux uses the libc musl and not glibc, and this causes dependency issues with bowtie2 and perl (dependency). I would get a recurring error of executable not found with pre-compiled bowtie2 and could not compile from source. See early version of above makeAligner.sh script to reproduce attempt via Alpine Linux.

Benchmarking statistics for divergence simulation tests

Given cov0r.fa, and an incoming stream of aligned reads (BAM format), count the total number of reads that match per sequence. Counting to be done on the fly until EOF.

cov0r.fa consists of sequences along with non-complement reverse entries. The counts for each sequence+reverse pair can be used to calculate TP/FP/TN/FN.

cov1 Multiple Sequence Alignment

Preamble

See #27

cov0r.fa
- generate suffix trees for faster subsequnce searching (python implementation)
- compute for forward sequences only (we can reverse the incoming SAM reads to check for reverse matches)
- compute for unique sequences only (out of 33,296 forward sequences, only 24,865 are unique)

Originally posted by @victorlin in #27 (comment)

The patched version of cov1r now removes perfect-match duplicates of sequences.

Objective

To further optimize cov1r, generate a multiple-sequence alignment of the current forward sequences and prune sequences that are >99% (or any arbitrary cut-off) similar. Since we don't need an exact match to 'detect' CoV reads, we can narrow the search space while retaining high levels of sensitivity.

To pair or not to pair

From Artem by email: "I'd be happy to abandon paired-end analysis and it would make our lives much simpler, if there is data showing that it's not going to be informative for detecting CoVs."

It seems certain to me that read pairing could have no meaningful benefit in the big compute, though it might in a second pass to analyze candidate datasets. If one read in a pair does(n't) have at least one alignment, bowtie2 will (not) find and report at least one alignment independently of the other read. This is equally true in both paired and unpaired mode. Essentially the only benefit of pairing is to increase MAPQ if the pair are mapped to locations that are close together consistent with the estimated range of construct lengths (say, in the range 200 - 500nt for a typical Illumina shotgun library). In rare cases, this resolves the location of one of the reads if it maps to repetitive sequence -- e.g., if R1 has a unique mapping but R2 maps to repeats that are 300nt and 1000nt distant, then the first repeat is probably right. In unpaired mode, bowtie2 would make an arbitrary choice between the two R2 alignments and would assign a very low MAPQ.

To re-state, if bowtie2 finds alignments for a paired read, it will almost always find alignments for R1 and R2 separately in unpaired mode. The only difference between paired and unpaired is that in paired mode the location may be better resolved and the MAPQ may be higher. We don't care about location or MAPQ for the first round, so these benefits have no value.

You can verify this by comparing paired and unpaired mode in the benchmark tests.

Preserving full headers in pan-genome

Per following comment in https://github.com/ababaian/serratus/blob/master/notebook/200420_cov2_pangenome.ipynb

"seqkit destroys the original headers so now each 'chromosome' or input sequence is referred to by it's accession ID only."

Seqkit truncates at the first white space. You can prevent it from truncating labels by replacing spaces by a character which never appears in a defline such as '@' and reversing out the change at the end of the processing.

This can be implemented by a short awk script which replaces space by @ on lines matching ^>, and reversed out by a similar script which replaces @ by space.

What data should be collected from first pass (the "big compute")

First pass through the data = map reads to Covid pan-genome. Which data should be collected and saved from this pass?

Explore Nextflow as an Alternative Architecture

Nextflow is a system for defining and running Genomics pipelines. It supports a variety of Grid Computing systems, as well as running directly on EC2.

I skimmed the docs to see how it handles file splitting:

By default chunks are kept in memory. When splitting big files specify the parameter file: true to save the chunks into files. See the documentation for details.

There's no mention of buckets or any other mechanism for splitting work across instances. That means one of two things:

It's a good fit, but doesn't support the feature set we need. We should consider contributing to that project to benefit the scientific community as a whole.
OR it's not a good fit, it's designed for smaller jobs than what we're doing.

I'm a huge fan of the pipeline description language—imagine being able to plug their example code into a system like Serratus and have it magically go to extreme scale—this would have huge implications both inside and outside of bioinformatics. It would probably be worth it to spend a day or two implementing a pipeline with their system, just to see if it's all we're imagining it to be.

Automatically spin down when finished

This is an edge case, albeit a useful one, to the idea that our ASGs should grow and shrink dynamically.

Current behaviour: Optimize costs by manually setting desired_size to 0, when a step in the pipeline is finished.

Desired behaviour: The scheduler should detect this condition, and set the desired_size to 0 automatically.

Caveats: If we decide to add more SRA data afterwards, we'll have to manually reset the desired_size.

Max RAM in EC2 compute nodes

Per CONTRIBUTING: "The workhorse for serratus is currently the C5.large EC2 instance on AWS.. Each node has 2 vCPU and 4 GB of memory".

My bvfilter requires 8GB RAM. Does this rule out the current implementation of bvfilter, or can we run larger nodes if the filter gives a worthwhile efficiency improvement?

Improve monitoring with Grafana / Prometheus

As things get bigger, it'll be nice to have some proper monitoring so we can detect issues more easily (and not have to waste time randomly clicking around on Cloudwatch). Things we'll want to plot:

Per instance metrics not included by Cloudwatch such as: disk usage, memory usage, and better resolution network and CPU stats.
Scheduler internals: number of accessions / blocks in each state.
ASG stats: number of instances in various states. Desired vs. Actual
S3 Stats: Data storage

This is definitely a nice to have, but I think it's something that should be put together sooner rather than later, especially if we're going to build other projects on the same framework.

Improve the HTML status page

Right now the HTML status page is a giant static table. It's better than querying raw SQL, but could be much better. Here are some things that come to mind:

Use less precise start time, and a time delta. Example "2020-03-38 4:58:03.2 +328.628624s" (Sub second precision is nice for finding things in the logs)
Group blocks by accessions, with accordian boxes.
Show the alignment progress (number of blocks finished/error/total) as part of each accession line.
Colour-code things (green for success, red for failure? Or maybe purple/yellow for our colourblind friends?)

This isn't a definitive list, just some things I would love to have. If you have other ideas for how to display what's going on, I'd love to see them! :D

`serratus-align` into AWS Lambda

Try to abstract away the entire serratus-align layer into AWS Lambda functions.

See:

Originally posted by @ababaian in
#5 (comment)

Write a script to copy outputs

I always find myself typing echo yes | terraform destroy, then realizing I haven't saved something important and frantically trying to save everything as terraform destroys it all.

It would be nice to have a script that:

Downloads and saves the .bam, .bai, and flagstat outputs from S3
Downloads and saves a dump of the scheduler state
Downloads and saves a dump of the cloudwatch logs
...and more.

Which we can run, and then do terraform destroy with confidence that we haven't forgotten something.

S3 Tier1 Requests are 5x higher than expected

S3 Tier1 requests include PUT/COPY/POST/LIST, and these are priced at $0.005 / 1000 requests.

Expected behaviour:

1 PUT / block in serratus-dl (for the fq-blocks)
1 PUT / block in serratus-align (for the bam-blocks)
Several LISTs in serratus-merge, but not more than 10 / accession.

Actual behaviour:

Dataset with ~1600 blocks results in ~10,000 Tier1 requests, which represents ~22% of overall cost at the moment.

ababaian / serratus Goto Github PK

serratus's Introduction

Serratus

Paper: Petabase-scale sequence alignment catalyses viral discovery

Background

Contribute to Serratus

Serratus Usage

Data Release Policy

serratus's People

Contributors

Stargazers

Watchers

Forkers

serratus's Issues

Objective

AWS Specific Tasks

BASH /Linux tasks

Docker tasks

Bioinformatics tasks

Objective

Actual Behaviour:

Desired Behaviour

How to get there

Problem:

Test Data

Output

See Also

ordered priority

Current process

Desired process

Preamble

Objective

Expected behaviour:

Actual behaviour:

Recommend Projects

Recommend Topics

Recommend Org