Git Product home page Git Product logo

bowtie-scaling's Introduction

Scaling read aligners to hundreds of threads on general-purpose processors

This repository contains scripts used to drive the experiments and compile the figures and tables for the manuscript "Scaling read aligners to hundreds of threads on general-purpose processors." All relevant scripts are in the thread_scaling/scripts subdirectory.

Generating reads

Links for downloading the reads are in Supplementary Note 2. The process of generating the reads involves downloading the source read files, sampling 100M reads from each, and randomizing the overall order of the reads.

The scripts we used to generate these shuffled samples are in:

  • reads.py

The scripts use to submit SLURM jobs to run this script releatedly and then concatenate the results are in:

  • reads.sh
  • reads_cat.sh

Read file sizes were measured with ls -l and these are reported in Supplementary Table 2.

Thread scaling experiments

Running times for all thread counts and for every combinations of (a) configuration (aligner and arguments), (b) system (KNL or Broadwell), and (c) paired-end status were performed and results are shown in Figures 3-5, Tables 2-4 and Supplementary Figures 1-3. Important scripts driving this process are:

  • master.py master script for driving one or more configurations through a complete series of tests. Handles building the various configurations with appropriate preprocessor macros. Also handles preparing the read files for each run, conducting the runs, running top and/or iostat in the background during runs to collect system measurements, and killing runs when the time limit is exceeded.
  • stampede_knl/*.sh SLURM scripts for driving all the KNL-based configurations. These scripts depend on and invote common.sh.
  • marcc_lbm/*.sh SLURM scripts for driving all the Broadwell-based configurations. These scripts depend on and invote common.sh.

Important configuration files governing these experiments are in .tsv files. Each line of each file defines the repository, tag, preprocessor macros, aligner command-line arguments, and multithreading/multiprocessing balances to use for a configuration. Specifically:

  • bt_base.tsv defines the configurations for the Bowtie lock-type experiments described in Figure 3/Table 2.
  • bt.tsv defines the configurations for all other Bowtie experiments, as described in Figures 4 and 5 and Tables 3 and 4.
  • bt2_base.tsv like bt_base.tsv but for Bowtie 2.
  • bt2.tsv like bt.tsv but for Bowtie 2.
  • ht_base.tsv like bt_base.tsv but for HISAT.
  • ht.tsv like bt.tsv but for HISAT.
  • bwa.tsv defines the configurations for the BWA-MEM experiments described in Figure 5/Table 4.

These configurations are also described in Supplementary Note 1.

The thread count series used in the experiments are in:

  • marcc_lbm/thread_series.txt for all Broadwell series
  • stampede_knl/thread_series.txt for all KNL series

Tabulating and plotting running time versus thread count

The KNL and Broadwell experiments write results to the stampede_knl/results and marcc_lbm/results subdirectories. These are tabulated into CSV files using the script:

  • tabulate.py

These scripts are then used as inputs to the scaling_results.Rmd R Markdown notebook. We then run the R Markdown notebook to generate all the thread scaling plots. The find the code for generating these plots, look in the following named code blocks in scaling_results.Rmd:

  • baseline_plots_all
  • baseline_plots_all_unp
  • baseline_plots_all_pe
  • parsing_plots_all
  • parsing_plots_all_unp
  • parsing_plots_all_pe
  • final_plots_all
  • final_plots_all_unp
  • final_plots_all_pe

Tabulating peak throughputs

Using the same data used to generate Tables 2-4 and Supplementary Tables 1-3, we used the peak_throughput_table code block in the thread_scaling/scripts/scaling_results.Rmd R Markdown notebook to compile a master table giving the peak throughput for every combination of configuration, system and paired-end status.

Measuring peak memory footprint

Since top is run in the background during thread scaling experiments, we can parse the top log to find the peak resident set size, as plotted in Supplementary Figure 4. The script for doing this is:

  • thread_scaling/scripts/peak_res.py

Reads per thread

The number of reads per thread used in each experiment as shown in Supplementary Table 1 were determined manually, with the goal of making all runs last a minute or longer. These numbers were then coded into the scripts in the thread_scaling/scripts/stampede_knl for the KNL experiments and thread_scaling/scripts/marcc_lbm for the Broadwell experiments.

Miscellaneous

  • check_blocked.py sanity-checks a file with padding appropriate for L-parsing.
  • get_reads.sh downloads all the read files at the links shown in Supplementary Note 2. They are downloaded compressed and you will have to decompress before running the experiments.

bowtie-scaling's People

Contributors

benlangmead avatar christopherwilks avatar val-antonescu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.