Git Product home page Git Product logo

miptools's People

Contributors

alfredsimkin avatar arisp99 avatar ashlinharris avatar asimkinbrown avatar aydemiro avatar charliesimkin avatar dependabot[bot] avatar jeffandbailey avatar ozkanaydemir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

miptools's Issues

Unit testing and automation

It would be great to add some unit testing at this stage. This can drastically cut down the time spent finding errors and tracking down their source, so we can make revisions with confidence to the testing branch.

Documentation fixes for v0.4.0

This issue serves as a central place to list changes that need to be made for the stable release of the documentation (v0.4.0). The current documentation for this release is copied over from the development documentation, so in many cases, we need to remove references to new capabilities. The source code for these docs can be found in the v0.4.0-prod-docs branch.

  • Remove the download app. v0.4.0 of MIPTools only contains the original app for downloading data from basespace which runs a python script
  • Remove additional arguments from the wrangler app that refer to downsampling of UMIs.
  • Remove discussion about freezing software versioning on the installation page.
  • Update the license year
  • Remove -h flag from all man pages

mamba install freezes

Bug Description

When trying to build the container, mamba freezes when trying to install packages.

Reprex

gh repo clone bailey-lab/miptools
sudo singularity build --force miptools_dev.sif MIPTools.def

Additional Information

  • Printing the info during mamba installation, mamba seems to fail when installing openssl. It prints the following info message: info libmamba Parsing MatchSpec openssl and never moves on from this.
  • I have attempted to build the development version and the latest stable release (v0.4.0) and both never move past this line.
  • I have also attempted to change mamba versions to no avail.

cc @aydemiro. Do you have any experience with this problem?

Optimal downsampling of UMIs during wrangling

Related Problem

When performing large scale seqeuncing the input for certain samples and particular MIPs within can be extremely deep (may reads for a given MIP in a given sample). This occurs when controls are repeated sequenced and merged together. The best place to subsample to reduce depth is after UMI determination and correction. The follwoing script does the subsampling

https://github.com/bailey-lab/MIPTools/blob/master/src/wrangler_downsample_umi.py

However the subsampling is random which is not optimal as it would be preferable to have this deterministic. Also, UMIs with the most read support make the most optimal sequences to subsample.

Solution Requested

Modify algorithm downsampling script to sort UMI sequences deterministiically based on # of supporting reads and then trim off those with lower read support if the number of UMI sequences exceeds the input threshold.

Describe alternatives you've considered
I am not sure there is really justification for alterantives unless one can argue that one wants to explore the effect of suboptimally selecting UMI sequences

Wrangler app bad substitution error

Bug Description

In #26, we implemented checks to ensure that arguments passed to the wrangler app had no spaces in them. Our solution relies on parameter expansion. Crucially, this is a non-POSIX feature. While our container does run using bash, this error would indicate that the parameter expansion is not working properly.

Reported by @iek.

Specify version number for installed software

We may want to consider fixing the versions of the installed software in our container. I know that this has been mentioned by @JeffAndBailey a couple of times in conversation. I am not sure that we need to fix the versions for every piece of software installed in the container but it may be useful to fix the versions for larger pieces of software.

Currently, we install software using apt-get, conda, mamba, git, wget, and some packages through R via the use of install.packages(). Of these, my initial thought would be to update system packages (installed via apt-get) on build but fix all other software.


The software we install with each system can be found below...

conda

  • mamba

mamba

  • r
  • python
  • bcftools
  • samtools
  • vcftools
  • freebayes
  • gatk4
  • a ton of other packages...

git

wget

install.packages()

Installing bcl2fastq without rebuilding the container

Currently, we advise users that if they would like to demultiplex files they must first download bcl2fastq, move the .zip file into the programs directory, and then re-build the container. It would be convenient if users were able to add custom software into the container without needing to re-build it, perhaps by binding to a certain location within the container.

This issue was discussed previously by @JeffAndBailey in #32 (comment) and by @aydemiro in #32 (comment):

Let's see if we can download and build externally bcl2fastq and install it as a working version with any needed libraries or accessory files. If that is possible then really our fixed builds san bcl2fastq will be fine for reproducibility.

Originally posted by @JeffAndBailey in #32 (comment)

As for the bcl2fastq issue, I agree that we should explore building the software outside and providing the binary to the container as a binding. However, this is a compiled c++ program and how to create a portable binary is beyond my capabilities at the moment. Nick is probably the best person to consult on this.

Originally posted by @aydemiro in #32 (comment)

Arguments with spaces cause errors

Bug Description

When arguments contain spaces, argparse is unable to parse the spaces and causes our code to crash. I encountered this error when attempting to run through the analysis test run. When running through the first section of the tutorial, the user is asked to specify the probe sets used before running the wrangler app. In the example, we write

probe_sets_used="DR1,VAR4"

However, if the user adds a space in between these two probe sets:

probe_sets_used="DR1, VAR4"

and tries to run the wrangler app using the following:

singularity run --app wrangler \
    -B $project_resources:/opt/project_resources \
    -B $fastq_dir:/opt/data \
    -B $wrangler_dir:/opt/analysis \
    $miptools -p $probe_sets_used -s $sample_sets_used -e $experiment_id \
    -l  $sample_list -c $cpu_number -m $min_capture_length

the app will fail with the following error: generate_wrangler_scripts.py: error: unrecognized arguments: VAR4.

Our code does aim to deal with these cases by cleaning up spaces in the following lines:

# get sample sets and probe sets to be processed.
# sample sets should be provided as a comma separated text
# sometimes semicolon is used and spaces may be included by mistake
# we'll check for these potential mistakes and create a list of sample sets
# to process.
sam_set = args["sample_sets"]
sam_set = sam_set.split(",")
sample_sets = []
for s in sam_set:
sample_sets.extend(s.strip().split(";"))
sample_sets = set([s.strip() for s in sample_sets])
# probe sets are provided and will be processed similarly to sample sets
pr_set = args["probe_sets"]
pr_set = pr_set.split(",")
probe_sets = []
for p in pr_set:
probe_sets.extend(p.strip().split(";"))
probe_sets = set([p.strip() for p in probe_sets])

However, the code fails before we are able to reach this stage. It seems that the error is triggered by argparse here:

args = vars(parser.parse_args())

Suggested Implementation

Seeing as we are unable to deal with spaces within the python file, I suggest we deal with any spaces and misformatted arguments within the wrangler app before feeding in our arguments to the python script. We can do simple string manipulation in bash as outlined here. For instance, to remove whitespace:

probe_sets_used="DR1, VAR4"
echo bash ${probe_sets_used// /}
#> DR1,VAR4

Does v1.0.0 exist?

Bug Description

Does v1.0.0 exist?

Expected Behavior

You should be able to download this non-existant version if you put a link in the README? As this is public????

Reprex

Currently in read me you have

Download the latest stable release

singularity pull library://apascha1/miptools/miptools:v1.0.0 fails.

50FATAL: While pulling library image: error fetching image: unable to download image: error downloading image: context canceled
(base) [jbailey5@pbaileywebcit deleteme]$ singularity pull library://apascha1/miptools/miptools:v1.0.0
FATAL: While pulling library image: error fetching image: image does not exist in the library: apascha1/miptools/miptools:v1.0.0 (amd64)
(base) [jbailey5@pbaileywebcit deleteme]$ singularity pull library://apascha1/miptools/miptools:v1.0.0

-->

Wrangler: `--stitch-options` parsing

In #26, we addressed the fact that arguments might contain spaces by parsing the strings to remove whitespace. Following the clarification of the documentation for the --stitch-options (-x) argument of wrangler (#27), it makes sense to do some additional string parsing to ensure the arguments are passed in the correct format. Namely, in order to feed in arguments, we must have a leading comma. We can use an if statement to check if a leading comma is present and if it is not, add it.

`MIPWrangler` and `elucidator` shared libraries error

Bug Description

I have been unable to use MIPWrangler and elucidator using the development version of MIPTools.

Steps to Reproduce

After downloading the development version, run:

$ singularity shell miptools_dev.sif
Singularity> MIPWrangler
#> MIPWrangler: error while loading shared libraries: libbamtools.so.2.5.2: cannot open shared object file: No such file or directory

Expected Behavior

MIPWrangler and elucidator should work within singularity shell and within any apps that call the two programs.

Additional Notes

I suspect that this may have been caused by 26eca75 where we deleted the source code after installation of the binary. It may be the case that in order for these programs to run the binary needs access to the source code or other build artifacts.

Add `-h` flag to apps

In an effort to improve docs, we should add the -h flag to all the MIPTools apps. Then the user can type the flag and get a nicely formatted man page. An example format that we could use is below:

Description

Usage:
singularity run [options] --app [app_options]

Options:
See 'singularity run'.

App Options:
-a Required. Description
-b Optional argument. Default value.
-h Print the help page.

Examples:
# Set paths
$ resource_dir=/bin/MIPTools/base_resources
$ dir=/work/usr/example

# Run app
$ singularity run
-B $resource_dir:/opt/resources -B $dir:/opt/analysis
--app -h

Remove unused documentation branches

Maintenance Request

Now that we have ported over the documentation for this project to Read the Docs, we should clean up this project by removing outdated branches.

We can likely remove:

  • The prodution-docs branch which was used for storing the source code for the production documentation. At this point in time, this was v0.4.0.
  • The prod-gh-pages GitHub page branch which was the associated GitHub pages branch with the production docs.
  • The master gh-pages branch which had the GitHub pages information for the master branch.
  • Workflows for building the docs for GitHub pages.

Failure to assign replicates

Bug Description

Following the fixes in #22 where the sample sheet preparation was updated, I have been unable to generate a sample sheet without running into errors. Namely, the error occurs when the script attempts to assign a replicate number to each grouped sample and sample set here:

sample_sheet["Replicate"] = sample_sheet.groupby(
["sample_name", "sample_set"])["replicate"].transform(
assign_replicate).astype(int)

Digging into the code a bit more, the new replicate column is filled with NaN. Somehow, the assign_replicate function is generating NaN instead of the true replicate number. This causes the astype(int) command to fail as it cannot coerse NaN into an integer.

Steps to Reproduce

  1. Build or download the development version of the container.
  2. Isolate a capture plate and sample plate from a sequencing run.
  3. Run:
    $ singularity exec -B /work/apascha1/sample-sheet-prep:/opt/analysis \
      $miptools_dev.sif python /opt/src/sample_sheet_prep.py \
      --capture-plates capture_plates.tsv --sample-plates sample_plates.tsv --output-file samples.tsv

This should raise the following exception:

Exception: Error in assigning replicates. Please make sure the 'sample_name' and 'sample_set' fields have valid, non-empty values in all provided files.

Expected Behavior

There should be no NaNs in the replicate column after assigning replicates.


@aydemiro Do you have any idea what might be going wrong in the assign_replicate function?

Replace jupyter notebooks with scripts

Jupyter notebooks contain segmented python scripts with their outputs. Currently, they are used for many analysis steps for which simple scripts are better suited. Ideally, they should be reserved for plotting and quality control, and elsewhere replaced with scripts.

Downloading data from BaseSpace

Proposal

Currently, we use an outdated python script to download data from BaseSpace: https://github.com/bailey-lab/MIPTools/blob/70c9c26cd86af33f5eb75bdd0c4c43edfedc26d4/bin/BaseSpaceRunDownloader_v2.py. BaseSpace has released tools to work with data on the CLI. To prevent code breakage down the line, we propose using BaseSpace's tools.

Working With BaseSpace CLI

Below, I discuss some of my thoughts in reading through the BaseSpace documentation.

Installation

Installation is straightforward, however, we may consider changing the installation location. We may also need to change file permissions using chmod.

# Install
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs

# Change file permission
chmod u+x $HOME/bin/bs

Authentication

Interactively, the user can run the authentication command and then go the URL provided to sign in.

bs auth
#> Please go to this URL to authenticate:  https://basespace.illumina.com/oauth/device?code=6Cesj

The resulting config file will be stored in $HOME/.basespace.

However, there are a couple of additional factors to consider. We need to think about the best way to automate this process. A couple of notes to consider:

  1. We are able to specify the API server. It may be useful to let the user customize this depending on where they are located (e.g., US vs UK).
  2. The user can store config info in a file and then load it: bs load config. This may make it easier to inject credentials.

Downloading data

Downloading data is simple, but there are many options. What is the best strategy to implement for our purposes?

# Single run
bs download run -i <RunID> -o <output>

# Multiple runs in a project
bs download project -i <ProjectID> -o <output>

# Subset of a project
bs download project -i <ProjectID> -o <output> --extension=fastq.gz

Implementation

We can install the CLI tool into our container. We will need to modify the download app to call either a series of commands or a script rather than the python script. We can provide several options with default values:

Flag Function
-s, --api-server the API server
-i, -run-id run ID
-o, --out-dir output dir

Checks on user inputs

Related Problem

In the notebook analyses, users directly provide filenames, options, and settings. Typos and errors can introduce issues further downstream, at which point debugging can be difficult.

Solution Requested

All possible values should be printed before input is accepted. User input should be checked for potential issues immediately after it is accepted.

Human readable printing

It would be helpful to have human-readable formatting when info is printed about demux. The lines of interest are below:

MIPTools/src/demux_qc.py

Lines 71 to 75 in 37664d5

print(("Total number of raw reads and reads passing filter were "
"{0[NumberOfReadsRaw]} and {0[NumberOfReadsPF]}, "
"respectively.").format(
fsums.sum()
))

MIPTools/src/demux_qc.py

Lines 90 to 92 in 37664d5

print(("There were {} undetermined reads. {} of these belong to "
"possible primer pairs.").format(dsums["Read Count"].sum(),
caught["Read Count"].sum()))

To implement this, could use "{:,}".format(number).

Commas Considered Harmful

Bug Description

The pipeline produces sample sheets associated with wrangler files, and these might contain commas The analysis notebooks misreads these files and reports that 0 haplotypes have been found. have missing values or contain quotation marks. This can cause errors further in the pipeline, so researchers will manually edit entries with commas to avoid the issue.

Expected Behavior

Ideally, any output from one tools should be accepted as input to the next without need for modification. For now, it may be sufficient to include a warning in the analysis notebooks, since the error message downstream doesn't indicate the cause of the issue.

Reprex

I was not able to replicate this exact issue, but several researchers have experienced a problem here.

# insert reprex here

-->

Test data set analysis crashes

Bug Description

The current test analysis template found here is currently outdated. Due to changes in pandas, one of the code chunks in the call genotypes and prevalences section will crash:

mutant_index = targeted_prev.columns.droplevel(["Gene", "Mutation Name", "ExonicFunc"])
expected_freq = expected_freq.loc[mutant_index]
i_map = {(i[0], i[4]): i for i in targeted_prev.columns}
expected_freq.index = pd.MultiIndex.from_tuples(
    [i_map[i] for i in expected_freq.index],
    names=targeted_prev.columns.names)
expected_freq.fillna(0, inplace=True)

#> KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike

Updating the Test Data Analysis

Comparing the analysis of test data set file with the current analysis template the files are quite different from one another. It may be worthwhile to update the test data analysis file using the provided template. This would go hand in hand with the efforts in 69ff95f which provides improved documentation on analyzing the test data.

Freebayes fails with 1 thread

Bug Description

When the freebayes_caller.py script is called from within the container, with the default settings, the script crashes. I traced this to the number of threads option. When the number of threads is set to 1 (the default), the script fails. As the script fails this will likely mean that freebayes_call() is unable to run with 1 thread and some internal function is causing this bug.

Suggested Solution

We should set the default number of threads to a different number. We could use 20, which is the default in the Jupyter notebooks.

Reprex

Follow the analysis pipeline guide and run through the Jupyter app until you reach the Freebayes step. At this point, you can run the freebayes_caller.py script.

# This command fails
singularity exec \
  	-B test-data/DR1_project_resources:/opt/project_resources \
  	-B test-data/pf_species_resources:/opt/species_resources \
  	-B test-data/wrangler:/opt/data \
  	-B test-data/variant:/opt/analysis \
	miptools_v0.4.0.sif \
    python /opt/src/freebayes_caller.py --threads=1

# This command works
singularity exec \
  	-B test-data/DR1_project_resources:/opt/project_resources \
  	-B test-data/pf_species_resources:/opt/species_resources \
  	-B test-data/wrangler:/opt/data \
  	-B test-data/variant:/opt/analysis \
	miptools_v0.4.0.sif \
    python /opt/src/freebayes_caller.py --threads=2

Build and Deploy Image

This issue will be used to implement the following solution to #9:

Solution: Manually build and upload the container to Github packages the Sylabs Cloud. For this, we can use Seekdeep, which is able to build the image.
Limitations: We need to manually build the container rather than automating the process.

In particular, every two weeks, a bot will comment on this issue to remind the maintainers to build the container and upload it to the Sylabs Cloud.

Pulling the container will then be as easy as running the following:

singularity pull library://apascha1/miptools/miptools:{tag}

Wrangler: `--stitch-options` clarification

The --stitch-options (-x) argument's help indicates that it defines the "probe set to be processed." I suspect that this is actually incorrect as there is an argument --probe-sets (-p).

Experimenting with the generate wrangler script seems to indicate that --stitch-options allows the user to specify additional arguments when extracting the sequences and stitching forward and reverse reads using MIPWrangler mipSetupAndExtractByArm. @aydemiro can you clarify what --stitch-options is used for?

Possible Bug

If --stitch-options is indeed used to pass in additional arguments to MIPWrangler mipSetupAndExtractByArm there may be a bug in the code. I would expect that we would pass additional arguments using a list of arguments. For example: "--overWriteDirs,--overWriteLog". However, if we attempt to pass that into the wrangler app,

stitch_options="--overWriteDirs,--overWriteLog"

singularity run --app wrangler \
    -B $project_resources:/opt/project_resources \
    -B $fastq_dir:/opt/data \
    -B $wrangler_dir:/opt/analysis \
    $miptools -p $probe_sets_used -s $sample_sets_used -e $experiment_id \
    -l  $sample_list -c $cpu_number -m $min_capture_length -x $stitch_options

The code crashes with the following error: gnerate_wrangler_scripts.py: error: argument -x/--stitch-options: expected one argument. This is caused by argparse and it is a known issue that argpasrse will not parse dashes correctly. To address this, we could ask users to input the name of the flags without the dashes and then within the python script add the dashes to the flags.

Use Github Actions to build and deploy container

Github Actions could be a useful tool for us to test, document, and even build our container.

While there is a lot of actions we could eventually incorporate, as of now, the one that most catches my eye is the ability to build our container each time we push. The https://github.com/singularityhub/github-ci repo provides three example actions that we could experiment with.

With these actions, we can build and test the container. We can also use the Github Package repository to store our built containers. This can make it easier for users to download the built container, rather than needing to build the container on their own machine...

Before I go ahead and try this out, @AshlinHarris what do you think about implementing this action and Github Actions in general?

`plotlywidget/extension` reports problems when enabled

Bug Report

In the development version of MIPTools, the notebook extension plotlywidget/extension reports problems. The extension still appears to be enabled (by checking jupyter nbextension list), however, there may be problems when trying to use this extension within any notebooks.

Reprex

The reprex provided here shows the message when running within the Singularity shell. This message also appears when trying to run the jupyter app.

$ singularity shell miptools_v0.4.0.sif

Singularity> jupyter nbextension enable plotlywidget/extension

#> Enabling notebook extension plotlywidget/extension...
#>       - Validating: problems found:
#>         - require?  X plotlywidget/extension

Build error

When attempting to build {MIPTools} v1.0.0, I get the following error:

Build Error

One potential culprit is the {McCOILR} package. I have created a PR OJWatson/McCOILR#7 which aims to address the build errors of {McCOILR}. Hopefully, this will resolve the {MIPTools} build error.

Unable to find C and C++ compilers during build

Bug Description

When attempting to build MIPTools, we are unable to build the elucidator program. The build process crashes with an error message indicating that the C and C++ compilers were not found. This is most likely due to nickjhathaway/elucidator@9bcfe9c, where the compiler versions were upgraded. Now, gcc-10 (the C compiler) and g++-10 (the C++ compiler) are required.

The MIPTools container is built on Ubuntu 20.04. We install GCC via apt-get and g++ via anaconda. However, for Ubuntu 20.04, the default version of GCC/g++ seems to be gcc-9/g++-9. To fix, this bug, we likely need to specify the installation of GCC/g++ version 10.

Steps to Reproduce

  1. Build the development container

Document files generated

It would be helpful for users if we could provide a list of all the files that MIPTools generated and a brief description of each one. There are many files generated so it can be confusing navigating them and determining which one will best serve your downstream analysis needs. Although depending on how you use MIPTools, different files may be generated, it would be ideal to have a "master list" of all files.

List of Files

(Generated from two recent runs)

  • aligned_haplotypes.csv
  • all_barcode_counts.csv
  • all_haplotypes.csv
  • alternate_AA_table.csv
  • alternate_AN_table.csv
  • alternate_table.csv
  • barcode_counts.csv
  • coverage_AA_table.csv
  • coverage_AN_table.csv
  • coverage_table.csv
  • genotypes_AA_table.csv
  • genotypes_AN_table.csv
  • genotypes_table.csv
  • haplotype_counts.csv
  • mapped_haplotypes.csv
  • offtarget_haplotypes.csv
  • reference_AA_table.csv
  • reference_AN_table.csv
  • reference_table.csv
  • repool.csv
  • run_meta.csv
  • sample_summary.csv
  • unique_haplotypes.csv

Please add any files I have missed!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.