Git Product home page Git Product logo

dx-streaming-upload's Introduction

DNAnexus

Dnanexus Apps and Scripts

applets

  • binning_step0: BioBin Pipeline
  • biobin_pipeline
  • binning_step1: BioBin Pipeline
  • biobin_pipeline
  • binning_step2: BioBin Pipeline
  • biobin_pipeline
  • binning_step3: BioBin Pipeline
  • biobin_pipeline
  • impute2_group_join: Impute2_group_join
  • This app can be used to merge multiple imputed impute2 files
  • plato_biobin: PLATO BioBin Regression Analysis
  • PLATO_BioBin
  • vcf_batch: VCF Batch effect tester
  • vcf_batch

apps

  • association_result_annotation: Annotate GWAS, PheWAS Assocaitions
  • association_result_annotation
  • biobin:
  • This app runs the latest development build of the rare variant binning tool BioBin.
  • generate_phenotype_matrix: Generate Phenotype Matrix
  • generate_phenotype_matrix
  • genotype_case_control: Generate Case/Control by Genotype
  • App provides case and control number by each genotype
  • impute2: imputation
  • This will perfrom imputation using Impute2
  • impute2_to_plink: Impute2 To PLINK
  • Convert Impute2 file to PLINK files
  • plato_single_variant: PLATO - Single Variant Analysis
  • Apps allows you to run single variant association testing against single phenotype (GWAS) or multiple phenotype (PheWAS) test
  • rl_sleeper_app: sleeper
  • This App provides some useful tools when working with data in DNANexus. This App is designed to be run on the command line with "dx run --ssh RL_Sleeper_App" in the project that you have data that you want to explore (use "dx select" to switch projects as needed).
  • shapeit2: SHAPEIT2
  • This app do phasing using SHAPEIT2
  • strand_align: Strand Align
  • Strand Align prior to phasing
  • vcf_annotation_formatter:
  • Extracts and reformats VCF annotations (CLINVAR, dbNSFP, SIFT, SNPEff)
  • QC_apps subfolder:
    • drop_marker_sample: Drop Markers and/or Samples (PLINK)
      • drop_marker_sample
  • drop_relateds: Relatedness Filter (IBD)
    • drop_relateds
  • extract_marker_sample: Drop Markers and/or Samples (PLINK)"
    • extract_marker_sample
  • maf_filter: Marker MAF Rate Filter (PLINK)
    • maf_filter
  • marker_call_filter: Marker Call Rate Filter (PLINK)
    • marker_call_filter
  • missing_summary: Missingness Summary (PLINK)
    • Returns missingness rate by sample
  • pca: Principal Component Analysis using SMARTPCA
    • pca
  • sample_call_filter: Sample Call Rate Filter (PLINK)
    • sample_call_filter

scripts

  • cat_vcf.py *
  • download_intervals.py *
  • download_part.py *
  • estimate_size.py *
  • interval_pad.py
    • This reads a bed file from standard input, pads the intervals, sorts and then outputs the intervals guranteed to be non-overlapping
  • update_applet.sh *

sequencing

  • bcftools_view:
    • Calls "bcftools view". Still in experimental stages.
  • calc_ibd:
    • Calculates a pairwise IBD estimate from either VCF or PLINK files using PLINK 1.9.
  • call_bqsr: Base Quality Score Recalibration
  • call_genotypes:
    • Obsolete, do not use; use geno_p instead. Calls GATK GenotypeGVCFs.
  • call_hc:
  • call_vqsr:
  • cat_variants: combine_variants
    • Combines non-overlapping VCF files with the same subjects. A reimplementation of GATK CatVariants (GATK CatVariants available upon request)
  • combine_variants: combine_variants
  • gen_ancestry:
    • Determine Ancestry from PCA. Uses an eigenvector file and training dataset listing known ancestries. Runs QDA to determine posterior ancestries for all samples, even those in the training set.
  • gen_related_todrop:
    • Uses a PLINK IBD file to determine the minimal set of samples to drop in order to generate an unrelated sample set. Uses a minimum vertex cut algorithm of the related samples to get
  • geno_p:
  • merge_gvcfs:
  • plink_merge:
    • Merge PLINK bed/bim/fam files using PLINK 1.9
  • select_variants: VCF QC
  • variant_annotator: VCF QC
  • vcf_annotate: Annotate VCF File
    • Use a variety of tools to annotate a sites-only VCF.
  • vcf_concordance: VCF Concordance
  • vcf_gen_lof:
    • Subset a VCF from vcf_annotate based on the given annotations to get a sites-only VCF of loss-of-function variants.
  • vcf_pca:
    • Uses PLINK 1.9 and eigenstrat 6.0 to calculate principal components from VCF or PLINK bed/bim/fam files.
  • vcf_qc:
  • vcf_query:
    • Calls "bcftools query" to extract annotations from the VCF file. Used in the stripping of files for MEGAbase
  • vcf_sitesonly: VCF QC
    • Generates a sites-only file from full VCF files.
  • vcf_slice: Slice VCF File(s)
    • Return a small section of a VCF file (similar to tabix). For large output, many small regions, or subsetting samples, use subset_vcf instead.
  • vcf_summary: VCF Summary Statistics
    • Generate summary statistics for a VCF file (by sample and by variant)
  • vcf_to_plink:
    • Uses PLINK 1.9 to convert VCF files to PLINK bed/bim/fam files

dx-streaming-upload's People

Contributors

commandlinegirl avatar damien-black avatar davisfeng70 avatar hartzell avatar jethror1 avatar mhrvol avatar mlin avatar nainathangaraj avatar sclan avatar tomkinsc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dx-streaming-upload's Issues

Expose configuration for parameters of triggered pipeline

Email correspondence:

Can the config file expose parameters for the actual workflow that gets executed post-upload? Or is it only parameters for the uploader machinery that can go there? If you can specify workflow optional parameters, I'd love to have the illumina-demux --sequencing_center option customizable on a per-site basis.

use full variable syntax for bare variables

In a few places bare variables are used in this role's playbook. When the playbook is run, Ansible warns that bare variables (ex. monitored_users) are deprecated, and should be replaced with jinja2-style variable syntax (ex. '{{monitored_users}}'). As of Ansible 2.1.0.0 the role playbooks still work, but they may not in the near future unless the bare variables are wrapped.

Clean up these uploads if error encountered?

Currently, we upload the RunInfo.xml and sample sheet in this script. We then call the dx_sync_directory.py. If we encounter an error in that (say it throws an exception) then we will have not uploaded any files except these two. The next time the upload routine occurs, we'll upload these two files again, encounter the error again, an it will carry on ad infinitum. Should we at least consider cleaning up these two files if the call to dx_sync_directory.py fails?

https://github.com/dnanexus-rnd/dx-streaming-upload/blob/8fec7ae047504b6b39996dd52e52dd684306e518/files/incremental_upload.py#L359-L362

Include timestamps in logs? Newlines for ua progress report?

Looking over the upload logs (monitor.log) for our install of this role, we've seen connection dropouts (unable to reach api.dnanexus.com) and various other transient issues. It would be helpful to have timestamps in the log file, similar to standard *nix syslog files.

Related to this, the logged progress lines for an active upload seems to use \r characters rather than \n, making it necessary to view the log with replacement of the carriage return characters via something like cat /home/miseq/monitor.log | tr '\r' '\n'. Ultimately, it looks like ua is intended for interactive operation, where a carriage return (\r) would make sense for updating the same line with the latest transfer information, but for headless/logged operation, newlines (\n) should really be used regardless of log verbosity.

Is the fix a simple matter of passing --verbose as part of the cronjob command, or will that result in too much noise in the log file? Would it make sense to add another arg option ("--log-output"?) to ua for non-interactive operation? The arg could be used in the ternary on this line in ua/main.cpp so newline characters are printed.

In case our issues are not due to connection dropouts, here are lines from monitor.log that relate to failed uploads:

Upload throttling is disabled.

Unable to connect to API server. Run 'ua --env' to see the current configuration.

Detailed message (for advanced users only):
DXConnectionError: 'Was unable to make the request: POST 'https://api.dnanexus.com:443/system/greet' . Details: '
*******
Error in using curl_easy_perform.
Error code (CURLcode) = 6
Error Message: 'Could not resolve: api.dnanexus.com (Could not contact DNS servers)'
********
'.', Curl error code = '6'

Lock file created by playbook in tmpfs

When the playbook is run to set up a new instance of dx-streaming-upload it automatically creates a lock file in /var/lock/:

https://github.com/dnanexus-rnd/dx-streaming-upload/blob/a14c438db11c77b9c2e8ef5f693fa9b0fead338d/tasks/main.yml#L184-L186

Since this is a sym linked directory to /run/lock which is a tmp file system this gets cleared on every reboot, and as such dx-streaming-upload fails when this file is missing on the next hourly cron job starting. This then requires either manually creating the required lock files again, or restarting dx-streaming-upload by running with the playbook again.

This is not documented in the readme, adding the same touch command from the playbook to /etc/rc.d/rc.local works on RedHat for creating the required lock files on boot. It would be good for this to be documented or for dx-streaming-upload to recreate it's own lock files if they do not already exist.

Missing novaseq config from main.yml

During the ansible set up the config in the playbook is copied to the user config file but the novaseq option is missing from here.

If monitor_runs.py is run via the cmd line and the novaseq key added manually to the config file then this is correctly added and CopyComplete.txt is picked up, but when running via Ansible this is not used and the run never completes uploading.

I'm not sure as to the logic of having a separate 'novaseq' option, could you not just check for either RTAComplete.txt or CopyComplete.txt in check_local_runs() of monitor_runs.py and not need to specify novaseq as an input? This seems simpler and would achieve the same result of handling novaseq runs.

Also this novaseq option is missing from the readme in the playbook options.

Stale run check in monitor_runs.py

From email correspondences:

In monitor_runs.py, the (previously monolithic) class of "unsynced runs" will be sub-divided to 3 classes: The script will attempt to trigger incremental_upload on runs of class 1 then 2; and will not attempt to upload run folders of class 3.

  1. Completed runs (has a RunInfo.xml and a RTAComplete.txt/xml file)
  2. In-progress run (has a RunInfo.xml file, but not a RTAComplete.txt/xml file)
  3. Stale run (has a RunInfo.xml file, which was created more than Y times expected duration of a sequencing run, both the sequencing runtime and Y will be user-specified, with tentative defaults of 3 and 24hrs respectively).

In incremental_upload.py, once an upload has been triggered for an in-progress run (class 2 from above), it will check whether the run goes stale, and timeout if the run is not completed by T times expected duration of run. It will still potentially block the upload process up to a maximum duration of Y times expected run duration, however.

Robustness of upload in low internet-bandwidth locale

Would there be an advantage for monitor_runs.py to pass --max-size (-M) to incremental_upload.py, based on a value in the monitor_runs.config file? As it is now, if a complete run directory is placed in a path watched by monitor_runs.py, a large tarball of the entire run is created. In such cases, will ua still upload chunks, or will it attempt to upload the entire large tarball?

In use cases like ours were connection instability is an issue, we'd prefer to upload many small chunks so dropouts do not interrupt large file transfers.

Include timestamps in logs? Newlines for ua progress report?

Looking over the upload logs (monitor.log) for our install of this role, we've seen connection dropouts (unable to reach api.dnanexus.com) and various other transient issues. It would be helpful to have timestamps in the log file, similar to standard *nix syslog files.

Related to this, the logged progress lines for an active upload seems to use \r characters rather than \n, making it necessary to view the log with replacement of the carriage return characters via something like cat /home/miseq/monitor.log | tr '\r' '\n'. Ultimately, it looks like ua is intended for interactive operation, where a carriage return (\r) would make sense for updating the same line with the latest transfer information, but for headless/logged operation, newlines (\n) should really be used regardless of log verbosity.

Is the fix a simple matter of passing --verbose as part of the cronjob command, or will that result in too much noise in the log file? Would it make sense to add another arg option ("--log-output"?) to ua for non-interactive operation? The arg could be used in the ternary on this line in ua/main.cpp so newline characters are printed.

Change install location to not be dependent on a user's home folder

Looking at the role more, this looks great! A couple initial thoughts:

Currently in the work-in-progress branch, most files are copied to ~/. Perhaps the install location for the uploader should not be relative to the (ansible) user's home directory. Placing the files in /opt as per the Linux FHS might be appropriate if there permissions allow read/execute by the cronjob user.

Additionally, it might be nice to have the user specified for the cron jobs be a role variable (defaulting to root? or ansible_user?) The variable ansible_ssh_user is deprecated in favor of ansible_user. Or maybe there should be an ability to specify multiple users, each with a monitored directory he or she has read access to. Something like this (with singular user as the default in defaults/main.yml):

monitored_users:
  - username: user-one 
    monitored_directories:
      - /path/to/a/run/directory/parent/dir
      - /some/other/directory
  - username: user-two
    monitored_directories:
      - /miseqtwo/run/storage

Having multiple users might complicate dx token usage though? Since the token persists in a user's home directory after logging in via dx login, and the cron user may or may not have logged in? Tokens could be passed in as part of the list of users, and use dx_token if one is not specified:

dx_token: "PROJECTTOKEN"
monitored_users:
  - username: user-one
    dx_user_token: "TOKENVALUE"
    monitored_directories:
      - /path/to/a/run/directory/parent/dir
      - /some/other/directory
  - username: user-two
    monitored_directories:
      - /miseqtwo/run/storage

Using the user list would require something like this:

# Logging into DNAnexus account
- name: Log in to DNAnexus account if token is provided
  command: source ~/dx-toolkit/environment && dx login --token {{ item.dx_user_token if 'dx_user_token' in item else dx_token }} --noprojects
  become:
  become_user: {{ item.username }}
  args:
    executable: /bin/bash
  with_items: "{{monitored_users}}"
  when: dx_token is defined

 #...

- name: set up CRON job to run every hour in deploy mode
  cron: >
    name="DNAnexus monitor runs (deploy)"
    special_time=hourly
    user="{{item.0.username}}"
    job="flock -n ~/dnanexus/cron.lock bash -ex -c 'source ~/dx-toolkit/environment; PATH=$PATH:~/dnanexus-upload-agent; python ~/dnanexus/scripts/monitor_runs.py -c ~/dnanexus/config/monitor_runs.config -p {{ upload_project }} -d {{item.1}} -v > ~/monitor.log 2>&1' > ~/dx-stream_cron.log 2>&1"
  with_subelements:
        - monitored_users
        - monitored_directories
  when: mode == "deploy"

Example debug output:

    - debug: msg="User {{item.0.username}} should monitor {{item.1}}"
      with_subelements:
        - monitored_users
        - monitored_directories

    - debug: msg="User {{ item.username }} has token {{ item.dx_user_token if 'dx_user_token' in item else dx_token }}"
      with_items: "{{monitored_users}}"
ok: [localhost] => (item=({u'username': u'user-one', u'dx_user_token': u'TOKENVALUE'}, u'/path/to/a/run/directory/parent/dir')) => {
    "item": [
        {
            "dx_user_token": "TOKENVALUE",
            "username": "user-one"
        },
        "/path/to/a/run/directory/parent/dir"
    ],
    "msg": "User user-one should monitor /path/to/a/run/directory/parent/dir"
}
ok: [localhost] => (item=({u'username': u'user-one', u'dx_user_token': u'TOKENVALUE'}, u'/some/other/directory')) => {
    "item": [
        {
            "dx_user_token": "TOKENVALUE",
            "username": "user-one"
        },
        "/some/other/directory"
    ],
    "msg": "User user-one should monitor /some/other/directory"
}
ok: [localhost] => (item=({u'username': u'user-two'}, u'/miseqtwo/run/storage')) => {
    "item": [
        {
            "username": "user-two"
        },
        "/miseqtwo/run/storage"
    ],
    "msg": "User user-two should monitor /miseqtwo/run/storage"
}

TASK [debug] *******************************************************************
ok: [localhost] => (item={u'username': u'user-one', u'monitored_directories': [u'/path/to/a/run/directory/parent/dir', u'/some/other/directory'], u'dx_user_token': u'TOKENVALUE'}) => {
    "item": {
        "dx_user_token": "TOKENVALUE",
        "monitored_directories": [
            "/path/to/a/run/directory/parent/dir",
            "/some/other/directory"
        ],
        "username": "user-one"
    },
    "msg": "User user-one has token TOKENVALUE"
}
ok: [localhost] => (item={u'username': u'user-two', u'monitored_directories': [u'/miseqtwo/run/storage']}) => {
    "item": {
        "monitored_directories": [
            "/miseqtwo/run/storage"
        ],
        "username": "user-two"
    },
    "msg": "User user-two has token PROJECTTOKEN"
}

Also, I'm not sure if this is something for the role, or for the DNAnexus project, but it might be nice to organize data on the DNAnexus side into folders based on the ansible_hostname, so we can quickly see which files have been uploaded from a given site. Something like:

  DNAnexus_project/
    node-3/
      runs/
      demux/
    {{some_other_hostname }}/
      runs/
      demux/

Gracefully handle malformed run folders.

The general flow of this script is to monitor a folder for potential new Illumina runs, check to see if there is already a complete run uploaded to DNAnexus that matches those runs, and if not, schedule the runs to be uploaded.

Currently there is logic up front to get a list of potential run folders, and then once we have filtered down to the list of folders that need to be uploaded, we iterate through that list and upload.

During the process of identifying which folders need to be synced, we do some sanity checking. If we fail the sanity checks, the whole process terminates before even uploading any files.

Instead, we should check each folder independently, note folders that are problematic, but continue on with folders that are well formed.

Here are two such checks:
https://github.com/dnanexus-rnd/dx-streaming-upload/blob/8fec7ae047504b6b39996dd52e52dd684306e518/files/monitor_runs.py#L323
https://github.com/dnanexus-rnd/dx-streaming-upload/blob/8fec7ae047504b6b39996dd52e52dd684306e518/files/monitor_runs.py#L323

A couple of beginner's questions

  1. Is there anything other than the explicit reference to the Ubuntu version of the dx-toolkit tarball in tasks/main.yml that makes this Ubuntu specific?

  2. Are the monitored_directories being written by the sequencers using the Run Copy Service?

  3. I'm interested in running this in support of a MiniSeq. Any gotcha's I should watch out for?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.