bailey-lab / miptools Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 9.0 293.87 MB

A suite of computational tools used for molecular inversion probe design, data processing, and analysis.

Home Page: https://miptools.readthedocs.io

License: MIT License

Python 31.11% Shell 0.39% Jupyter Notebook 68.50%

miptools's People

Contributors

Stargazers

Watchers

Forkers

arisp99 iek skiyaga monicambabazi rnaimehaom alfredsimkin awtum okitwiopio

miptools's Issues

Unit testing and automation

It would be great to add some unit testing at this stage. This can drastically cut down the time spent finding errors and tracking down their source, so we can make revisions with confidence to the testing branch.

Documentation fixes for v0.4.0

This issue serves as a central place to list changes that need to be made for the stable release of the documentation (v0.4.0). The current documentation for this release is copied over from the development documentation, so in many cases, we need to remove references to new capabilities. The source code for these docs can be found in the v0.4.0-prod-docs branch.

Remove the download app. v0.4.0 of MIPTools only contains the original app for downloading data from basespace which runs a python script
Remove additional arguments from the wrangler app that refer to downsampling of UMIs.
Remove discussion about freezing software versioning on the installation page.
Update the license year
Remove -h flag from all man pages

mamba install freezes

Bug Description

When trying to build the container, mamba freezes when trying to install packages.

Reprex

gh repo clone bailey-lab/miptools
sudo singularity build --force miptools_dev.sif MIPTools.def

Additional Information

Printing the info during mamba installation, mamba seems to fail when installing openssl. It prints the following info message: info libmamba Parsing MatchSpec openssl and never moves on from this.
I have attempted to build the development version and the latest stable release (v0.4.0) and both never move past this line.
I have also attempted to change mamba versions to no avail.

cc @aydemiro. Do you have any experience with this problem?

Optimal downsampling of UMIs during wrangling

Solution Requested

Modify algorithm downsampling script to sort UMI sequences deterministiically based on # of supporting reads and then trim off those with lower read support if the number of UMI sequences exceeds the input threshold.

Describe alternatives you've considered
I am not sure there is really justification for alterantives unless one can argue that one wants to explore the effect of suboptimally selecting UMI sequences

Wrangler app bad substitution error

Bug Description

In #26, we implemented checks to ensure that arguments passed to the wrangler app had no spaces in them. Our solution relies on parameter expansion. Crucially, this is a non-POSIX feature. While our container does run using bash, this error would indicate that the parameter expansion is not working properly.

Reported by @iek.

Specify version number for installed software

We may want to consider fixing the versions of the installed software in our container. I know that this has been mentioned by @JeffAndBailey a couple of times in conversation. I am not sure that we need to fix the versions for every piece of software installed in the container but it may be useful to fix the versions for larger pieces of software.

Currently, we install software using apt-get, conda, mamba, git, wget, and some packages through R via the use of install.packages(). Of these, my initial thought would be to update system packages (installed via apt-get) on build but fix all other software.

The software we install with each system can be found below...

`conda`

mamba

`mamba`

r
python
bcftools
samtools
vcftools
freebayes
gatk4
a ton of other packages...

`git`

msa2vcf (through jvarkit)
vt
MIPWrangler
elucidator

`wget`

basespace CLI
conda (via miniconda)

`install.packages()`

Installing bcl2fastq without rebuilding the container

Currently, we advise users that if they would like to demultiplex files they must first download bcl2fastq, move the .zip file into the programs directory, and then re-build the container. It would be convenient if users were able to add custom software into the container without needing to re-build it, perhaps by binding to a certain location within the container.

This issue was discussed previously by @JeffAndBailey in #32 (comment) and by @aydemiro in #32 (comment):

Let's see if we can download and build externally bcl2fastq and install it as a working version with any needed libraries or accessory files. If that is possible then really our fixed builds san bcl2fastq will be fine for reproducibility.

Originally posted by @JeffAndBailey in #32 (comment)

As for the bcl2fastq issue, I agree that we should explore building the software outside and providing the binary to the container as a binding. However, this is a compiled c++ program and how to create a portable binary is beyond my capabilities at the moment. Nick is probably the best person to consult on this.

Originally posted by @aydemiro in #32 (comment)

Arguments with spaces cause errors

Bug Description

When arguments contain spaces, argparse is unable to parse the spaces and causes our code to crash. I encountered this error when attempting to run through the analysis test run. When running through the first section of the tutorial, the user is asked to specify the probe sets used before running the wrangler app. In the example, we write

probe_sets_used="DR1,VAR4"

However, if the user adds a space in between these two probe sets:

probe_sets_used="DR1, VAR4"

and tries to run the wrangler app using the following:

singularity run --app wrangler \
    -B $project_resources:/opt/project_resources \
    -B $fastq_dir:/opt/data \
    -B $wrangler_dir:/opt/analysis \
    $miptools -p $probe_sets_used -s $sample_sets_used -e $experiment_id \
    -l  $sample_list -c $cpu_number -m $min_capture_length

the app will fail with the following error: generate_wrangler_scripts.py: error: unrecognized arguments: VAR4.

Our code does aim to deal with these cases by cleaning up spaces in the following lines:

MIPTools/src/generate_wrangler_scripts.py

Lines 79 to 96 in 09f52ef

 # get sample sets and probe sets to be processed. 

 # sample sets should be provided as a comma separated text 

 # sometimes semicolon is used and spaces may be included by mistake 

 # we'll check for these potential mistakes and create a list of sample sets 

 # to process. 

 sam_set = args["sample_sets"] 

 sam_set = sam_set.split(",") 

 sample_sets = [] 

 for s in sam_set: 

 sample_sets.extend(s.strip().split(";")) 

 sample_sets = set([s.strip() for s in sample_sets]) 

 # probe sets are provided and will be processed similarly to sample sets 

 pr_set = args["probe_sets"] 

 pr_set = pr_set.split(",") 

 probe_sets = [] 

 for p in pr_set: 

 probe_sets.extend(p.strip().split(";")) 

 probe_sets = set([p.strip() for p in probe_sets])

However, the code fails before we are able to reach this stage. It seems that the error is triggered by argparse here:

MIPTools/src/generate_wrangler_scripts.py

Line 68 in 09f52ef

args = vars(parser.parse_args())

Suggested Implementation

Seeing as we are unable to deal with spaces within the python file, I suggest we deal with any spaces and misformatted arguments within the wrangler app before feeding in our arguments to the python script. We can do simple string manipulation in bash as outlined here. For instance, to remove whitespace:

probe_sets_used="DR1, VAR4"
echo bash ${probe_sets_used// /}
#> DR1,VAR4

Move everything to GitHub

Everything related to MIPTools (the program itself, related software libraries, documentation and tutorials, even non-developer functions like error reporting and feature requests) should ultimately be put on the GitHub page so that everything is visible and cohesive.

Here is a list of all the external sources I know of:

Google Doc for feature requests (https://docs.google.com/document/d/1GQajbKtOC5kagDoZqfmjH1GAAt-4seJKQcC0bqwku-c/edit?usp=sharing)

Does v1.0.0 exist?

Bug Description

Does v1.0.0 exist?

Expected Behavior

You should be able to download this non-existant version if you put a link in the README? As this is public????

Reprex

Currently in read me you have

Download the latest stable release

singularity pull library://apascha1/miptools/miptools:v1.0.0 fails.

50FATAL: While pulling library image: error fetching image: unable to download image: error downloading image: context canceled
(base) [jbailey5@pbaileywebcit deleteme]$ singularity pull library://apascha1/miptools/miptools:v1.0.0
FATAL: While pulling library image: error fetching image: image does not exist in the library: apascha1/miptools/miptools:v1.0.0 (amd64)
(base) [jbailey5@pbaileywebcit deleteme]$ singularity pull library://apascha1/miptools/miptools:v1.0.0

-->

Duplicate code between PCA.py and plot_PCA.py

Some code for filtering, analysis, and plotting is duplicated between src/PCA.py and src/plot_PCA.py.

Wrangler: `--stitch-options` parsing

In #26, we addressed the fact that arguments might contain spaces by parsing the strings to remove whitespace. Following the clarification of the documentation for the --stitch-options (-x) argument of wrangler (#27), it makes sense to do some additional string parsing to ensure the arguments are passed in the correct format. Namely, in order to feed in arguments, we must have a leading comma. We can use an if statement to check if a leading comma is present and if it is not, add it.

`MIPWrangler` and `elucidator` shared libraries error

Bug Description

I have been unable to use MIPWrangler and elucidator using the development version of MIPTools.

Steps to Reproduce

After downloading the development version, run:

$ singularity shell miptools_dev.sif
Singularity> MIPWrangler
#> MIPWrangler: error while loading shared libraries: libbamtools.so.2.5.2: cannot open shared object file: No such file or directory

Expected Behavior

MIPWrangler and elucidator should work within singularity shell and within any apps that call the two programs.

Additional Notes

I suspect that this may have been caused by 26eca75 where we deleted the source code after installation of the binary. It may be the case that in order for these programs to run the binary needs access to the source code or other build artifacts.

Add `-h` flag to apps

In an effort to improve docs, we should add the -h flag to all the MIPTools apps. Then the user can type the flag and get a nicely formatted man page. An example format that we could use is below:

Description

Usage:
singularity run [options] --app [app_options]

Options:
See 'singularity run'.

App Options:
-a Required. Description
-b Optional argument. Default value.
-h Print the help page.

Examples:
# Set paths
$ resource_dir=/bin/MIPTools/base_resources
$ dir=/work/usr/example

# Run app
$ singularity run
-B $resource_dir:/opt/resources -B $dir:/opt/analysis
--app -h

Remove unused documentation branches

Maintenance Request

Now that we have ported over the documentation for this project to Read the Docs, we should clean up this project by removing outdated branches.

We can likely remove:

The prodution-docs branch which was used for storing the source code for the production documentation. At this point in time, this was v0.4.0.
The prod-gh-pages GitHub page branch which was the associated GitHub pages branch with the production docs.
The master gh-pages branch which had the GitHub pages information for the master branch.
Workflows for building the docs for GitHub pages.

Failure to assign replicates

Bug Description

Following the fixes in #22 where the sample sheet preparation was updated, I have been unable to generate a sample sheet without running into errors. Namely, the error occurs when the script attempts to assign a replicate number to each grouped sample and sample set here:

MIPTools/src/sample_sheet_prep.py

Lines 133 to 135 in 8e3ed00

 sample_sheet["Replicate"] = sample_sheet.groupby( 

 ["sample_name", "sample_set"])["replicate"].transform( 

 assign_replicate).astype(int)

Digging into the code a bit more, the new replicate column is filled with NaN. Somehow, the assign_replicate function is generating NaN instead of the true replicate number. This causes the astype(int) command to fail as it cannot coerse NaN into an integer.

Steps to Reproduce

Build or download the development version of the container.
Isolate a capture plate and sample plate from a sequencing run.

Run:

$ singularity exec -B /work/apascha1/sample-sheet-prep:/opt/analysis \
  $miptools_dev.sif python /opt/src/sample_sheet_prep.py \
  --capture-plates capture_plates.tsv --sample-plates sample_plates.tsv --output-file samples.tsv

This should raise the following exception:

Exception: Error in assigning replicates. Please make sure the 'sample_name' and 'sample_set' fields have valid, non-empty values in all provided files.

Expected Behavior

There should be no NaNs in the replicate column after assigning replicates.

@aydemiro Do you have any idea what might be going wrong in the assign_replicate function?

Replace jupyter notebooks with scripts

Jupyter notebooks contain segmented python scripts with their outputs. Currently, they are used for many analysis steps for which simple scripts are better suited. Ideally, they should be reserved for plotting and quality control, and elsewhere replaced with scripts.

Downloading data from BaseSpace

Proposal

Currently, we use an outdated python script to download data from BaseSpace: https://github.com/bailey-lab/MIPTools/blob/70c9c26cd86af33f5eb75bdd0c4c43edfedc26d4/bin/BaseSpaceRunDownloader_v2.py. BaseSpace has released tools to work with data on the CLI. To prevent code breakage down the line, we propose using BaseSpace's tools.

Working With BaseSpace CLI

Below, I discuss some of my thoughts in reading through the BaseSpace documentation.

Installation

Installation is straightforward, however, we may consider changing the installation location. We may also need to change file permissions using chmod.

# Install
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs

# Change file permission
chmod u+x $HOME/bin/bs

Authentication

Interactively, the user can run the authentication command and then go the URL provided to sign in.

bs auth
#> Please go to this URL to authenticate:  https://basespace.illumina.com/oauth/device?code=6Cesj

The resulting config file will be stored in $HOME/.basespace.

However, there are a couple of additional factors to consider. We need to think about the best way to automate this process. A couple of notes to consider:

We are able to specify the API server. It may be useful to let the user customize this depending on where they are located (e.g., US vs UK).
The user can store config info in a file and then load it: bs load config. This may make it easier to inject credentials.

Downloading data

Downloading data is simple, but there are many options. What is the best strategy to implement for our purposes?

# Single run
bs download run -i <RunID> -o <output>

# Multiple runs in a project
bs download project -i <ProjectID> -o <output>

# Subset of a project
bs download project -i <ProjectID> -o <output> --extension=fastq.gz

Implementation

We can install the CLI tool into our container. We will need to modify the download app to call either a series of commands or a script rather than the python script. We can provide several options with default values:

Flag	Function
-s, --api-server	the API server
-i, -run-id	run ID
-o, --out-dir	output dir

Checks on user inputs

Solution Requested

All possible values should be printed before input is accepted. User input should be checked for potential issues immediately after it is accepted.

Human readable printing

It would be helpful to have human-readable formatting when info is printed about demux. The lines of interest are below:

MIPTools/src/demux_qc.py

Lines 71 to 75 in 37664d5

 print(("Total number of raw reads and reads passing filter were " 

 "{0[NumberOfReadsRaw]} and {0[NumberOfReadsPF]}, " 

 "respectively.").format( 

 fsums.sum() 

 ))

MIPTools/src/demux_qc.py

Lines 90 to 92 in 37664d5

 print(("There were {} undetermined reads. {} of these belong to " 

 "possible primer pairs.").format(dsums["Read Count"].sum(), 

 caught["Read Count"].sum()))

To implement this, could use "{:,}".format(number).

Commas Considered Harmful

Bug Description

The pipeline produces sample sheets associated with wrangler files, and these might ~~contain commas The analysis notebooks misreads these files and reports that 0 haplotypes have been found.~~ have missing values or contain quotation marks. This can cause errors further in the pipeline, so researchers will manually edit entries with commas to avoid the issue.

Expected Behavior

Ideally, any output from one tools should be accepted as input to the next without need for modification. For now, it may be sufficient to include a warning in the analysis notebooks, since the error message downstream doesn't indicate the cause of the issue.

Reprex

I was not able to replicate this exact issue, but several researchers have experienced a problem here.

# insert reprex here

-->

Test data set analysis crashes

Bug Description

The current test analysis template found here is currently outdated. Due to changes in pandas, one of the code chunks in the call genotypes and prevalences section will crash:

mutant_index = targeted_prev.columns.droplevel(["Gene", "Mutation Name", "ExonicFunc"])
expected_freq = expected_freq.loc[mutant_index]
i_map = {(i[0], i[4]): i for i in targeted_prev.columns}
expected_freq.index = pd.MultiIndex.from_tuples(
    [i_map[i] for i in expected_freq.index],
    names=targeted_prev.columns.names)
expected_freq.fillna(0, inplace=True)

#> KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike

Updating the Test Data Analysis

Comparing the analysis of test data set file with the current analysis template the files are quite different from one another. It may be worthwhile to update the test data analysis file using the provided template. This would go hand in hand with the efforts in 69ff95f which provides improved documentation on analyzing the test data.

Freebayes fails with 1 thread

Bug Description

When the freebayes_caller.py script is called from within the container, with the default settings, the script crashes. I traced this to the number of threads option. When the number of threads is set to 1 (the default), the script fails. As the script fails this will likely mean that freebayes_call() is unable to run with 1 thread and some internal function is causing this bug.

Reprex

Follow the analysis pipeline guide and run through the Jupyter app until you reach the Freebayes step. At this point, you can run the freebayes_caller.py script.

# This command fails
singularity exec \
  	-B test-data/DR1_project_resources:/opt/project_resources \
  	-B test-data/pf_species_resources:/opt/species_resources \
  	-B test-data/wrangler:/opt/data \
  	-B test-data/variant:/opt/analysis \
	miptools_v0.4.0.sif \
    python /opt/src/freebayes_caller.py --threads=1

# This command works
singularity exec \
  	-B test-data/DR1_project_resources:/opt/project_resources \
  	-B test-data/pf_species_resources:/opt/species_resources \
  	-B test-data/wrangler:/opt/data \
  	-B test-data/variant:/opt/analysis \
	miptools_v0.4.0.sif \
    python /opt/src/freebayes_caller.py --threads=2

Build and Deploy Image

This issue will be used to implement the following solution to #9:

Solution: Manually build and upload the container to ~~Github packages~~ the Sylabs Cloud. For this, we can use Seekdeep, which is able to build the image.
Limitations: We need to manually build the container rather than automating the process.

In particular, every two weeks, a bot will comment on this issue to remind the maintainers to build the container and upload it to the Sylabs Cloud.

Pulling the container will then be as easy as running the following:

singularity pull library://apascha1/miptools/miptools:{tag}

Wrangler: `--stitch-options` clarification

The --stitch-options (-x) argument's help indicates that it defines the "probe set to be processed." I suspect that this is actually incorrect as there is an argument --probe-sets (-p).

Experimenting with the generate wrangler script seems to indicate that --stitch-options allows the user to specify additional arguments when extracting the sequences and stitching forward and reverse reads using MIPWrangler mipSetupAndExtractByArm. @aydemiro can you clarify what --stitch-options is used for?

Possible Bug

If --stitch-options is indeed used to pass in additional arguments to MIPWrangler mipSetupAndExtractByArm there may be a bug in the code. I would expect that we would pass additional arguments using a list of arguments. For example: "--overWriteDirs,--overWriteLog". However, if we attempt to pass that into the wrangler app,

stitch_options="--overWriteDirs,--overWriteLog"

singularity run --app wrangler \
    -B $project_resources:/opt/project_resources \
    -B $fastq_dir:/opt/data \
    -B $wrangler_dir:/opt/analysis \
    $miptools -p $probe_sets_used -s $sample_sets_used -e $experiment_id \
    -l  $sample_list -c $cpu_number -m $min_capture_length -x $stitch_options

The code crashes with the following error: gnerate_wrangler_scripts.py: error: argument -x/--stitch-options: expected one argument. This is caused by argparse and it is a known issue that argpasrse will not parse dashes correctly. To address this, we could ask users to input the name of the flags without the dashes and then within the python script add the dashes to the flags.

Use Github Actions to build and deploy container

Github Actions could be a useful tool for us to test, document, and even build our container.

While there is a lot of actions we could eventually incorporate, as of now, the one that most catches my eye is the ability to build our container each time we push. The https://github.com/singularityhub/github-ci repo provides three example actions that we could experiment with.

With these actions, we can build and test the container. We can also use the Github Package repository to store our built containers. This can make it easier for users to download the built container, rather than needing to build the container on their own machine...

Before I go ahead and try this out, @AshlinHarris what do you think about implementing this action and Github Actions in general?

Update muscle multiple sequence alignment program

The MUSCLE multiple sequence alignment program underwent a major improvement to version 5 in late 2021. The authors state that:

Muscle v5 is much more accurate, is often faster, and scales to much larger datasets.

Given these improvements, we should upgrade the current binary included in MIPTools.

`plotlywidget/extension` reports problems when enabled

Bug Report

In the development version of MIPTools, the notebook extension plotlywidget/extension reports problems. The extension still appears to be enabled (by checking jupyter nbextension list), however, there may be problems when trying to use this extension within any notebooks.

Reprex

The reprex provided here shows the message when running within the Singularity shell. This message also appears when trying to run the jupyter app.

$ singularity shell miptools_v0.4.0.sif

Singularity> jupyter nbextension enable plotlywidget/extension

#> Enabling notebook extension plotlywidget/extension...
#>       - Validating: problems found:
#>         - require?  X plotlywidget/extension

Build error

When attempting to build {MIPTools} v1.0.0, I get the following error:

One potential culprit is the {McCOILR} package. I have created a PR OJWatson/McCOILR#7 which aims to address the build errors of {McCOILR}. Hopefully, this will resolve the {MIPTools} build error.

Unable to find C and C++ compilers during build

Bug Description

When attempting to build MIPTools, we are unable to build the elucidator program. The build process crashes with an error message indicating that the C and C++ compilers were not found. This is most likely due to nickjhathaway/elucidator@9bcfe9c, where the compiler versions were upgraded. Now, gcc-10 (the C compiler) and g++-10 (the C++ compiler) are required.

The MIPTools container is built on Ubuntu 20.04. We install GCC via apt-get and g++ via anaconda. However, for Ubuntu 20.04, the default version of GCC/g++ seems to be gcc-9/g++-9. To fix, this bug, we likely need to specify the installation of GCC/g++ version 10.

Steps to Reproduce

Build the development container

Document files generated

It would be helpful for users if we could provide a list of all the files that MIPTools generated and a brief description of each one. There are many files generated so it can be confusing navigating them and determining which one will best serve your downstream analysis needs. Although depending on how you use MIPTools, different files may be generated, it would be ideal to have a "master list" of all files.

List of Files

(Generated from two recent runs)

aligned_haplotypes.csv
all_barcode_counts.csv
all_haplotypes.csv
alternate_AA_table.csv
alternate_AN_table.csv
alternate_table.csv
barcode_counts.csv
coverage_AA_table.csv
coverage_AN_table.csv
coverage_table.csv
genotypes_AA_table.csv
genotypes_AN_table.csv
genotypes_table.csv
haplotype_counts.csv
mapped_haplotypes.csv
offtarget_haplotypes.csv
reference_AA_table.csv
reference_AN_table.csv
reference_table.csv
repool.csv
run_meta.csv
sample_summary.csv
unique_haplotypes.csv

Please add any files I have missed!

	# get sample sets and probe sets to be processed.
	# sample sets should be provided as a comma separated text
	# sometimes semicolon is used and spaces may be included by mistake
	# we'll check for these potential mistakes and create a list of sample sets
	# to process.
	sam_set = args["sample_sets"]
	sam_set = sam_set.split(",")
	sample_sets = []
	for s in sam_set:
	sample_sets.extend(s.strip().split(";"))
	sample_sets = set([s.strip() for s in sample_sets])
	# probe sets are provided and will be processed similarly to sample sets
	pr_set = args["probe_sets"]
	pr_set = pr_set.split(",")
	probe_sets = []
	for p in pr_set:
	probe_sets.extend(p.strip().split(";"))
	probe_sets = set([p.strip() for p in probe_sets])

	sample_sheet["Replicate"] = sample_sheet.groupby(
	["sample_name", "sample_set"])["replicate"].transform(
	assign_replicate).astype(int)

	print(("Total number of raw reads and reads passing filter were "
	"{0[NumberOfReadsRaw]} and {0[NumberOfReadsPF]}, "
	"respectively.").format(
	fsums.sum()
	))

	print(("There were {} undetermined reads. {} of these belong to "
	"possible primer pairs.").format(dsums["Read Count"].sum(),
	caught["Read Count"].sum()))

bailey-lab / miptools Goto Github PK

miptools's People

Contributors

Stargazers

Watchers

Forkers

miptools's Issues

Bug Description

Reprex

Additional Information

Related Problem

Solution Requested

Bug Description

conda

mamba

git

wget

install.packages()

Bug Description

Suggested Implementation

Bug Description

Expected Behavior

Reprex

Download the latest stable release

Bug Description

Steps to Reproduce

Expected Behavior

Additional Notes

Maintenance Request

Bug Description

Steps to Reproduce

Expected Behavior

Proposal

Working With BaseSpace CLI

Installation

Authentication

Downloading data

Implementation

Related Problem

Solution Requested

Bug Description

Expected Behavior

Reprex

Bug Description

Updating the Test Data Analysis

Bug Description

Suggested Solution

Reprex

Possible Bug

Bug Report

Reprex

Bug Description

Steps to Reproduce

List of Files

Recommend Projects

Recommend Topics

Recommend Org

`conda`

`mamba`

`git`

`wget`

`install.packages()`