bailey-lab / miptools Goto Github PK
View Code? Open in Web Editor NEWA suite of computational tools used for molecular inversion probe design, data processing, and analysis.
Home Page: https://miptools.readthedocs.io
License: MIT License
A suite of computational tools used for molecular inversion probe design, data processing, and analysis.
Home Page: https://miptools.readthedocs.io
License: MIT License
It would be great to add some unit testing at this stage. This can drastically cut down the time spent finding errors and tracking down their source, so we can make revisions with confidence to the testing branch.
This issue serves as a central place to list changes that need to be made for the stable release of the documentation (v0.4.0). The current documentation for this release is copied over from the development documentation, so in many cases, we need to remove references to new capabilities. The source code for these docs can be found in the v0.4.0-prod-docs
branch.
-h
flag from all man pagesWhen trying to build the container, mamba freezes when trying to install packages.
gh repo clone bailey-lab/miptools
sudo singularity build --force miptools_dev.sif MIPTools.def
info libmamba Parsing MatchSpec openssl
and never moves on from this.cc @aydemiro. Do you have any experience with this problem?
When performing large scale seqeuncing the input for certain samples and particular MIPs within can be extremely deep (may reads for a given MIP in a given sample). This occurs when controls are repeated sequenced and merged together. The best place to subsample to reduce depth is after UMI determination and correction. The follwoing script does the subsampling
https://github.com/bailey-lab/MIPTools/blob/master/src/wrangler_downsample_umi.py
However the subsampling is random which is not optimal as it would be preferable to have this deterministic. Also, UMIs with the most read support make the most optimal sequences to subsample.
Modify algorithm downsampling script to sort UMI sequences deterministiically based on # of supporting reads and then trim off those with lower read support if the number of UMI sequences exceeds the input threshold.
Describe alternatives you've considered
I am not sure there is really justification for alterantives unless one can argue that one wants to explore the effect of suboptimally selecting UMI sequences
In #26, we implemented checks to ensure that arguments passed to the wrangler app had no spaces in them. Our solution relies on parameter expansion. Crucially, this is a non-POSIX feature. While our container does run using bash, this error would indicate that the parameter expansion is not working properly.
Reported by @iek.
We may want to consider fixing the versions of the installed software in our container. I know that this has been mentioned by @JeffAndBailey a couple of times in conversation. I am not sure that we need to fix the versions for every piece of software installed in the container but it may be useful to fix the versions for larger pieces of software.
Currently, we install software using apt-get
, conda
, mamba
, git
, wget
, and some packages through R
via the use of install.packages()
. Of these, my initial thought would be to update system packages (installed via apt-get
) on build but fix all other software.
conda
mamba
mamba
r
python
bcftools
samtools
vcftools
freebayes
gatk4
git
msa2vcf
(through jvarkit
)vt
MIPWrangler
elucidator
wget
conda
(via miniconda
)install.packages()
Currently, we advise users that if they would like to demultiplex files they must first download bcl2fastq
, move the .zip
file into the programs
directory, and then re-build the container. It would be convenient if users were able to add custom software into the container without needing to re-build it, perhaps by binding to a certain location within the container.
This issue was discussed previously by @JeffAndBailey in #32 (comment) and by @aydemiro in #32 (comment):
Let's see if we can download and build externally bcl2fastq and install it as a working version with any needed libraries or accessory files. If that is possible then really our fixed builds san bcl2fastq will be fine for reproducibility.
Originally posted by @JeffAndBailey in #32 (comment)
As for the bcl2fastq issue, I agree that we should explore building the software outside and providing the binary to the container as a binding. However, this is a compiled c++ program and how to create a portable binary is beyond my capabilities at the moment. Nick is probably the best person to consult on this.
Originally posted by @aydemiro in #32 (comment)
When arguments contain spaces, argparse
is unable to parse the spaces and causes our code to crash. I encountered this error when attempting to run through the analysis test run. When running through the first section of the tutorial, the user is asked to specify the probe sets used before running the wrangler app. In the example, we write
probe_sets_used="DR1,VAR4"
However, if the user adds a space in between these two probe sets:
probe_sets_used="DR1, VAR4"
and tries to run the wrangler app using the following:
singularity run --app wrangler \
-B $project_resources:/opt/project_resources \
-B $fastq_dir:/opt/data \
-B $wrangler_dir:/opt/analysis \
$miptools -p $probe_sets_used -s $sample_sets_used -e $experiment_id \
-l $sample_list -c $cpu_number -m $min_capture_length
the app will fail with the following error: generate_wrangler_scripts.py: error: unrecognized arguments: VAR4
.
Our code does aim to deal with these cases by cleaning up spaces in the following lines:
MIPTools/src/generate_wrangler_scripts.py
Lines 79 to 96 in 09f52ef
However, the code fails before we are able to reach this stage. It seems that the error is triggered by argparse
here:
MIPTools/src/generate_wrangler_scripts.py
Line 68 in 09f52ef
Seeing as we are unable to deal with spaces within the python file, I suggest we deal with any spaces and misformatted arguments within the wrangler app before feeding in our arguments to the python script. We can do simple string manipulation in bash as outlined here. For instance, to remove whitespace:
probe_sets_used="DR1, VAR4"
echo bash ${probe_sets_used// /}
#> DR1,VAR4
Everything related to MIPTools (the program itself, related software libraries, documentation and tutorials, even non-developer functions like error reporting and feature requests) should ultimately be put on the GitHub page so that everything is visible and cohesive.
Here is a list of all the external sources I know of:
Does v1.0.0 exist?
You should be able to download this non-existant version if you put a link in the README? As this is public????
Currently in read me you have
singularity pull library://apascha1/miptools/miptools:v1.0.0 fails.
50FATAL: While pulling library image: error fetching image: unable to download image: error downloading image: context canceled
(base) [jbailey5@pbaileywebcit deleteme]$ singularity pull library://apascha1/miptools/miptools:v1.0.0
FATAL: While pulling library image: error fetching image: image does not exist in the library: apascha1/miptools/miptools:v1.0.0 (amd64)
(base) [jbailey5@pbaileywebcit deleteme]$ singularity pull library://apascha1/miptools/miptools:v1.0.0
-->
Some code for filtering, analysis, and plotting is duplicated between src/PCA.py
and src/plot_PCA.py
.
In #26, we addressed the fact that arguments might contain spaces by parsing the strings to remove whitespace. Following the clarification of the documentation for the --stitch-options
(-x
) argument of wrangler (#27), it makes sense to do some additional string parsing to ensure the arguments are passed in the correct format. Namely, in order to feed in arguments, we must have a leading comma. We can use an if statement to check if a leading comma is present and if it is not, add it.
I have been unable to use MIPWrangler and elucidator using the development version of MIPTools.
After downloading the development version, run:
$ singularity shell miptools_dev.sif
Singularity> MIPWrangler
#> MIPWrangler: error while loading shared libraries: libbamtools.so.2.5.2: cannot open shared object file: No such file or directory
MIPWrangler and elucidator should work within singularity shell
and within any apps that call the two programs.
I suspect that this may have been caused by 26eca75 where we deleted the source code after installation of the binary. It may be the case that in order for these programs to run the binary needs access to the source code or other build artifacts.
In an effort to improve docs, we should add the -h
flag to all the MIPTools apps. Then the user can type the flag and get a nicely formatted man page. An example format that we could use is below:
Description
Usage:
singularity run [options] --app [app_options]Options:
See 'singularity run'.App Options:
-a Required. Description
-b Optional argument. Default value.
-h Print the help page.Examples:
# Set paths
$ resource_dir=/bin/MIPTools/base_resources
$ dir=/work/usr/example# Run app
$ singularity run
-B $resource_dir:/opt/resources -B $dir:/opt/analysis
--app -h
Now that we have ported over the documentation for this project to Read the Docs, we should clean up this project by removing outdated branches.
We can likely remove:
prodution-docs
branch which was used for storing the source code for the production documentation. At this point in time, this was v0.4.0.prod-gh-pages
GitHub page branch which was the associated GitHub pages branch with the production docs.gh-pages
branch which had the GitHub pages information for the master branch.Following the fixes in #22 where the sample sheet preparation was updated, I have been unable to generate a sample sheet without running into errors. Namely, the error occurs when the script attempts to assign a replicate number to each grouped sample and sample set here:
MIPTools/src/sample_sheet_prep.py
Lines 133 to 135 in 8e3ed00
Digging into the code a bit more, the new replicate column is filled with NaN
. Somehow, the assign_replicate
function is generating NaN
instead of the true replicate number. This causes the astype(int)
command to fail as it cannot coerse NaN
into an integer.
$ singularity exec -B /work/apascha1/sample-sheet-prep:/opt/analysis \
$miptools_dev.sif python /opt/src/sample_sheet_prep.py \
--capture-plates capture_plates.tsv --sample-plates sample_plates.tsv --output-file samples.tsv
This should raise the following exception:
Exception: Error in assigning replicates. Please make sure the 'sample_name' and 'sample_set' fields have valid, non-empty values in all provided files.
There should be no NaN
s in the replicate column after assigning replicates.
@aydemiro Do you have any idea what might be going wrong in the assign_replicate
function?
Jupyter notebooks contain segmented python scripts with their outputs. Currently, they are used for many analysis steps for which simple scripts are better suited. Ideally, they should be reserved for plotting and quality control, and elsewhere replaced with scripts.
Currently, we use an outdated python script to download data from BaseSpace: https://github.com/bailey-lab/MIPTools/blob/70c9c26cd86af33f5eb75bdd0c4c43edfedc26d4/bin/BaseSpaceRunDownloader_v2.py. BaseSpace has released tools to work with data on the CLI. To prevent code breakage down the line, we propose using BaseSpace's tools.
Below, I discuss some of my thoughts in reading through the BaseSpace documentation.
Installation is straightforward, however, we may consider changing the installation location. We may also need to change file permissions using chmod
.
# Install
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs
# Change file permission
chmod u+x $HOME/bin/bs
Interactively, the user can run the authentication command and then go the URL provided to sign in.
bs auth
#> Please go to this URL to authenticate: https://basespace.illumina.com/oauth/device?code=6Cesj
The resulting config file will be stored in $HOME/.basespace
.
However, there are a couple of additional factors to consider. We need to think about the best way to automate this process. A couple of notes to consider:
bs load config
. This may make it easier to inject credentials.Downloading data is simple, but there are many options. What is the best strategy to implement for our purposes?
# Single run
bs download run -i <RunID> -o <output>
# Multiple runs in a project
bs download project -i <ProjectID> -o <output>
# Subset of a project
bs download project -i <ProjectID> -o <output> --extension=fastq.gz
We can install the CLI tool into our container. We will need to modify the download app to call either a series of commands or a script rather than the python script. We can provide several options with default values:
Flag | Function |
---|---|
-s, --api-server | the API server |
-i, -run-id | run ID |
-o, --out-dir | output dir |
In the notebook analyses, users directly provide filenames, options, and settings. Typos and errors can introduce issues further downstream, at which point debugging can be difficult.
All possible values should be printed before input is accepted. User input should be checked for potential issues immediately after it is accepted.
The pipeline produces sample sheets associated with wrangler files, and these might contain commas The analysis notebooks misreads these files and reports that 0 haplotypes have been found. have missing values or contain quotation marks. This can cause errors further in the pipeline, so researchers will manually edit entries with commas to avoid the issue.
Ideally, any output from one tools should be accepted as input to the next without need for modification. For now, it may be sufficient to include a warning in the analysis notebooks, since the error message downstream doesn't indicate the cause of the issue.
I was not able to replicate this exact issue, but several researchers have experienced a problem here.
# insert reprex here
-->
The current test analysis template found here is currently outdated. Due to changes in pandas, one of the code chunks in the call genotypes and prevalences section will crash:
mutant_index = targeted_prev.columns.droplevel(["Gene", "Mutation Name", "ExonicFunc"])
expected_freq = expected_freq.loc[mutant_index]
i_map = {(i[0], i[4]): i for i in targeted_prev.columns}
expected_freq.index = pd.MultiIndex.from_tuples(
[i_map[i] for i in expected_freq.index],
names=targeted_prev.columns.names)
expected_freq.fillna(0, inplace=True)
#> KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
Comparing the analysis of test data set file with the current analysis template the files are quite different from one another. It may be worthwhile to update the test data analysis file using the provided template. This would go hand in hand with the efforts in 69ff95f which provides improved documentation on analyzing the test data.
When the freebayes_caller.py
script is called from within the container, with the default settings, the script crashes. I traced this to the number of threads option. When the number of threads is set to 1 (the default), the script fails. As the script fails this will likely mean that freebayes_call()
is unable to run with 1 thread and some internal function is causing this bug.
We should set the default number of threads to a different number. We could use 20, which is the default in the Jupyter notebooks.
Follow the analysis pipeline guide and run through the Jupyter app until you reach the Freebayes step. At this point, you can run the freebayes_caller.py
script.
# This command fails
singularity exec \
-B test-data/DR1_project_resources:/opt/project_resources \
-B test-data/pf_species_resources:/opt/species_resources \
-B test-data/wrangler:/opt/data \
-B test-data/variant:/opt/analysis \
miptools_v0.4.0.sif \
python /opt/src/freebayes_caller.py --threads=1
# This command works
singularity exec \
-B test-data/DR1_project_resources:/opt/project_resources \
-B test-data/pf_species_resources:/opt/species_resources \
-B test-data/wrangler:/opt/data \
-B test-data/variant:/opt/analysis \
miptools_v0.4.0.sif \
python /opt/src/freebayes_caller.py --threads=2
This issue will be used to implement the following solution to #9:
Solution: Manually build and upload the container to
Github packagesthe Sylabs Cloud. For this, we can use Seekdeep, which is able to build the image.
Limitations: We need to manually build the container rather than automating the process.
In particular, every two weeks, a bot will comment on this issue to remind the maintainers to build the container and upload it to the Sylabs Cloud.
Pulling the container will then be as easy as running the following:
singularity pull library://apascha1/miptools/miptools:{tag}
The --stitch-options
(-x
) argument's help indicates that it defines the "probe set to be processed." I suspect that this is actually incorrect as there is an argument --probe-sets
(-p
).
Experimenting with the generate wrangler script seems to indicate that --stitch-options
allows the user to specify additional arguments when extracting the sequences and stitching forward and reverse reads using MIPWrangler mipSetupAndExtractByArm
. @aydemiro can you clarify what --stitch-options
is used for?
If --stitch-options
is indeed used to pass in additional arguments to MIPWrangler mipSetupAndExtractByArm
there may be a bug in the code. I would expect that we would pass additional arguments using a list of arguments. For example: "--overWriteDirs,--overWriteLog"
. However, if we attempt to pass that into the wrangler app,
stitch_options="--overWriteDirs,--overWriteLog"
singularity run --app wrangler \
-B $project_resources:/opt/project_resources \
-B $fastq_dir:/opt/data \
-B $wrangler_dir:/opt/analysis \
$miptools -p $probe_sets_used -s $sample_sets_used -e $experiment_id \
-l $sample_list -c $cpu_number -m $min_capture_length -x $stitch_options
The code crashes with the following error: gnerate_wrangler_scripts.py: error: argument -x/--stitch-options: expected one argument
. This is caused by argparse
and it is a known issue that argpasrse
will not parse dashes correctly. To address this, we could ask users to input the name of the flags without the dashes and then within the python script add the dashes to the flags.
Github Actions could be a useful tool for us to test, document, and even build our container.
While there is a lot of actions we could eventually incorporate, as of now, the one that most catches my eye is the ability to build our container each time we push. The https://github.com/singularityhub/github-ci repo provides three example actions that we could experiment with.
With these actions, we can build and test the container. We can also use the Github Package repository to store our built containers. This can make it easier for users to download the built container, rather than needing to build the container on their own machine...
Before I go ahead and try this out, @AshlinHarris what do you think about implementing this action and Github Actions in general?
The MUSCLE multiple sequence alignment program underwent a major improvement to version 5 in late 2021. The authors state that:
Muscle v5 is much more accurate, is often faster, and scales to much larger datasets.
Given these improvements, we should upgrade the current binary included in MIPTools.
In the development version of MIPTools, the notebook extension plotlywidget/extension
reports problems. The extension still appears to be enabled (by checking jupyter nbextension list
), however, there may be problems when trying to use this extension within any notebooks.
The reprex provided here shows the message when running within the Singularity shell. This message also appears when trying to run the jupyter
app.
$ singularity shell miptools_v0.4.0.sif
Singularity> jupyter nbextension enable plotlywidget/extension
#> Enabling notebook extension plotlywidget/extension...
#> - Validating: problems found:
#> - require? X plotlywidget/extension
When attempting to build {MIPTools}
v1.0.0, I get the following error:
One potential culprit is the {McCOILR}
package. I have created a PR OJWatson/McCOILR#7 which aims to address the build errors of {McCOILR}
. Hopefully, this will resolve the {MIPTools}
build error.
When attempting to build MIPTools, we are unable to build the elucidator program. The build process crashes with an error message indicating that the C and C++ compilers were not found. This is most likely due to nickjhathaway/elucidator@9bcfe9c, where the compiler versions were upgraded. Now, gcc-10
(the C compiler) and g++-10
(the C++ compiler) are required.
The MIPTools container is built on Ubuntu 20.04. We install GCC via apt-get
and g++ via anaconda
. However, for Ubuntu 20.04, the default version of GCC/g++ seems to be gcc-9
/g++-9
. To fix, this bug, we likely need to specify the installation of GCC/g++ version 10.
It would be helpful for users if we could provide a list of all the files that MIPTools generated and a brief description of each one. There are many files generated so it can be confusing navigating them and determining which one will best serve your downstream analysis needs. Although depending on how you use MIPTools, different files may be generated, it would be ideal to have a "master list" of all files.
(Generated from two recent runs)
Please add any files I have missed!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.