tbradley27 / filtar Goto Github PK
View Code? Open in Web Editor NEWUsing RNA-Seq data to improve microRNA target prediction accuracy in animals
Home Page: https://tbradley27.github.io/FilTar/
License: GNU General Public License v3.0
Using RNA-Seq data to improve microRNA target prediction accuracy in animals
Home Page: https://tbradley27.github.io/FilTar/
License: GNU General Public License v3.0
Fix linting and formatting as defined by the criteria defined here:
At the moment, in order to generate multiple sequence alignments for each 3' UTR, FilTar downloads entire whole genome alignments and then extracts the relevant genomic co-ordinates (using the transcript co-ordinates of a reference species).
This is extremely inefficient and costly. FilTar should instead download MSAs on a UTR-by-UTR basis by more cleverly interacting with the UCSC server.
The main targetscan
dependencies are rnaplfold
(an aspect of the broader viennarna
package), perl Statistics::Lite
and the perl Bio::TreeIO
package (which depends on Bio::Perl
).
The major hindrance in overcoming this issue is that there is no current implementation of Bio::TreeIO
on conda. Having an implementation of Bio::TreeIO
on conda will allow us (in theory) to manage targetscan easily within the same conda environment.
Until we can do this, we have to awkwardly install a niche conda-forge distribution of perl in order to install cpanm, and through cpanm install Bio::TreeIO
- which is cumbersome. This method also requires users to awkwardly implement a patch to one of the BioPerl dependencies using CPAN (which is really not ideal!).
An alternative to this approach would be to use the system perl owned by root in order to install perl dependencies in a user-defined library/path. However, snakemake uses perl as a dependency, so a conda-specific perl in the filtar conda environment is probably necessary. So this approach is simply not possible unless we forfeit the use of our general snakemake/filtar environment - which is not advisable.
From reading the bioconda documentation, it seems that the addition of packages to bioconda is done through an automated process using GitHub pull requests with checks by a CI application. Looking into building our own bioconda package would definitely be the next step in resolving this issue.
At the moment, there are no good options here.
Bedgraph files undergo a number of mutations, which isn't checked by APAtrap so we have to validate this ourselves
At the moment, the application can very easily be ran on standard HPC architectures using standard HPC job schedulers (e.g. slurm/LSF). However, in the current implementation, the whole pipeline is ran as one job, with a single set of core requirements, and a single set of memory requirements for each rule.
However, Snakemake allows developers to implement HPC resource management and submission on a rule-by-rule basis. Taking advantage of this feature will reduce the overall cost of FilTar HPC submissions, likely reduce HPC queue waiting times, and probably improve pipeline logging/administration easier as well
Rather than try to have this inferred from running Salmon
Make use of the '--profiles' flag in order to determine configuration schemes for cluster use
Necessary for mapping to miRNA name identifiers, but otherwise unnecessary and should be avoided
Add a simple diagram to give a very quick, general overview to users of the core processes undertaken during a single FilTar run
Encoding metadata in a serialisation language such as YAML has quite a few drawbacks - either you have to heavily normalise the data (making data association more difficult), or you have to encode the data in highly nested structures which is also quite difficult to manipulate or reason with.
It would be better to encode all metadata in tsv format. This would necessitate rescripting of downstream workflow management processes. It is also possible that this rescripting will make workflow management simpler, more interpretable and easier to manage.
It is also a good idea to separate configuration data and metadata into different files which is not currently the case.
The applications in its current state (i.e. v1.2.3) implements conda environment management on a snakemake rule-by-rule basis. This is excessive, as the vast majority of the dependencies for individual rules will not conflict with each other. This unnecessary complexity needlessly increases the overhead for the maintainer of the repository, and it is generally more confusing for end-users.
A better solution would be to implement a single main environment which most rules reference, and then add additional environments on a rule-by-rule basis where necessary. This approach may be too crude however, and a compromise between the current and the proposed state may end up being implemented.
At the moment, FilTar can only process GTF files in the format specified by Ensembl/gencode. I would like to make FilTar more flexible in this regard.
Hi @TBradley27
This tools really great. I am having few very basic question. Can you please provide your input on it.
Thanks
A lot of rule specific configuration at the moment (i.e. v.1.2.3) is achieved by manually editing relevant Snakefiles. This is not ideal as we don't ever really want the user to have go into the Snakefiles themselves and do some editing in order to reconfigure how FilTar is executed.
If we bring this configuration backwards, closer to the root, and into purpose-built YAML config files then this will make life a lot easier for the user. It is also a benefit to see the entire project configuration in a single or small number of YAML files rather than having to hunt for this information in nested file structures.
The tool in its present state (i.e. v1.2.3) by default downloads an entire genome into the data/ directory if the users doesn't already have that information contained within this directory.
This is generally bad practice, as many different applications act on genomes, and it is not reasonable to expect a fresh download of genomes on an application-by-application basis. The same principle could perhaps be extended to other sequence data files currently used by FilTar. Instead, it would be better if genome files were discoverable by the application through the use of symbolic links
It would be for the best if some aspects of data download were made non-default unless explicitly requested by the user.
NB: One potential caveat is that it is not currently understood how snakemake acts upon symbolic links which would need to be explored.
NB: Another caveat is that effectively deautomating pre-alignment processes will increase the overhead to the user with respect to making sure that genome and annotations files match. Careful exception handling ought to be implemented here.
E.g. re-examine the 'mod' gtf files
Catch errors in cases in which there are typos of mis-specification fo target prediction algorithm names
At the moment, the ability to install FilTar dependencies very much depends on whether the installation is attempted within a conda environment or not.
This requires further investigation
As each dependency is installed within its own conda environment, it is difficult to track the version of each dependency. Sometimes the dependency downloaded and installed is not the same as the version specified in the conda environment configuration file.
It would be a good to determine a way of summarising dependency version information in a way which is easily accessible and interpretable to users.
Hi. Dr. Thomas Bradley,
I found your paper, FilTar: using RNA-Seq data to improve microRNA target prediction accuracy in animals, and your program really fit into my project.
On the first attempt with default parameters, the following error message is displayed:
Error in rule salmon_index_for_lib_types:
Error in rule salmon_index:
jobid: 810
jobid: 575
output: results/salmon/indexes/lib_type_identification/hsa
output: results/salmon/indexes/hsa
conda-env: /home/wooje/NGS_tools/FilTar-master/.snakemake/conda/3399c010
conda-env: /home/wooje/NGS_tools/FilTar-master/.snakemake/conda/3399c010
shell:
salmon index --threads 8 -t data/Homo_sapiens.GRCh38.cdna.all.fa -i results/salmon/indexes/lib_type_identification/hsa --type quasi -k 31
(exited with non-zero exit code)
shell:
salmon index --threads 8 -t data/Homo_sapiens.GRCh38.cdna.all.fa -i results/salmon/indexes/hsa --type quasi -k 31
(exited with non-zero exit code)
Because I also receive "Exception : [Error: RapMap-based indexing is not supported in this version of salmon.]", I may need to replace the latest version of Salmon with V.0.11.3.
But I am new to computer programming, so I don't know how to replace it.
Could you please help me how to solve the problem?
Thank you.
Best regards,
Wooje
All scripts should have script-level documentation near the start of the script describing the general aim/purpose of that script
There is no reason why we should be excluding non-localised scaffolds and contigs
There are some studies which demonstrate that trimming of RNA-Seq reads does not have a substantial impact on the accuracy of the quantification of gene expression. For example, the following preprint:
https://www.biorxiv.org/content/10.1101/833962v1.abstract
Removing trimming from the workflow would substantially reduce overall workflow runtime.
Therefore, it would be beneficial if trimming in FilTar could be made optional at this stage, but made the default option nonetheless (until more thoughtful consideration is given to this topic by myself).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.