tbradley27 / filtar Goto Github PK

View Code? Open in Web Editor NEW

8.0 2.0 5.0 27.72 MB

Using RNA-Seq data to improve microRNA target prediction accuracy in animals

Home Page: https://tbradley27.github.io/FilTar/

License: GNU General Public License v3.0

Python 80.82% R 9.83% Shell 6.37% Rebol 2.98%

mirna-targets mirna mirna-mrna-interaction

filtar's People

Contributors

Stargazers

Watchers

Forkers

pythseq naveen584 andelyu mavr3 paolabc

filtar's Issues

Linting and formatting

Fix linting and formatting as defined by the criteria defined here:

https://snakemake.github.io/snakemake-workflow-catalog/

Eliminate use of shell scripts

Reduce inefficiency in use of whole-genome alignments

At the moment, in order to generate multiple sequence alignments for each 3' UTR, FilTar downloads entire whole genome alignments and then extracts the relevant genomic co-ordinates (using the transcript co-ordinates of a reference species).

This is extremely inefficient and costly. FilTar should instead download MSAs on a UTR-by-UTR basis by more cleverly interacting with the UCSC server.

Where appropriate denote rule input/output dependency relationships explicitly

Manage targetscan dependencies within a single conda environment

The main targetscan dependencies are rnaplfold (an aspect of the broader viennarna package), perl Statistics::Lite and the perl Bio::TreeIO package (which depends on Bio::Perl).

The major hindrance in overcoming this issue is that there is no current implementation of Bio::TreeIO on conda. Having an implementation of Bio::TreeIO on conda will allow us (in theory) to manage targetscan easily within the same conda environment.

Until we can do this, we have to awkwardly install a niche conda-forge distribution of perl in order to install cpanm, and through cpanm install Bio::TreeIO - which is cumbersome. This method also requires users to awkwardly implement a patch to one of the BioPerl dependencies using CPAN (which is really not ideal!).

An alternative to this approach would be to use the system perl owned by root in order to install perl dependencies in a user-defined library/path. However, snakemake uses perl as a dependency, so a conda-specific perl in the filtar conda environment is probably necessary. So this approach is simply not possible unless we forfeit the use of our general snakemake/filtar environment - which is not advisable.

From reading the bioconda documentation, it seems that the addition of packages to bioconda is done through an automated process using GitHub pull requests with checks by a CI application. Looking into building our own bioconda package would definitely be the next step in resolving this issue.

At the moment, there are no good options here.

Add rule to validate generated bedgraph files

Bedgraph files undergo a number of mutations, which isn't checked by APAtrap so we have to validate this ourselves

Generate a report regarding the extent of 3'UTR reannotation

Improve cluster support/integration

At the moment, the application can very easily be ran on standard HPC architectures using standard HPC job schedulers (e.g. slurm/LSF). However, in the current implementation, the whole pipeline is ran as one job, with a single set of core requirements, and a single set of memory requirements for each rule.

However, Snakemake allows developers to implement HPC resource management and submission on a rule-by-rule basis. Taking advantage of this feature will reduce the overall cost of FilTar HPC submissions, likely reduce HPC queue waiting times, and probably improve pipeline logging/administration easier as well

Allow the user to set library type if this information is known in advance

Rather than try to have this inferred from running Salmon

Collect the 'with_reannotation' and 'without_reannotation' modules into the same directory

Make use of snakemake resource usage profiles

Make use of the '--profiles' flag in order to determine configuration schemes for cluster use

All tabular data read-in by FilTar scripts should explicitly declare column names

Switch away from using three letter species initials

Necessary for mapping to miRNA name identifiers, but otherwise unnecessary and should be avoided

Add gene symbol metadata when executing APAtrap

Add a line at the top of each script stating its function

Add a basic workflow to the README

Add a simple diagram to give a very quick, general overview to users of the core processes undertaken during a single FilTar run

Encode metadata in TSV format rather than YAML

Encoding metadata in a serialisation language such as YAML has quite a few drawbacks - either you have to heavily normalise the data (making data association more difficult), or you have to encode the data in highly nested structures which is also quite difficult to manipulate or reason with.

It would be better to encode all metadata in tsv format. This would necessitate rescripting of downstream workflow management processes. It is also possible that this rescripting will make workflow management simpler, more interpretable and easier to manage.

It is also a good idea to separate configuration data and metadata into different files which is not currently the case.

Simplify conda environment management

The applications in its current state (i.e. v1.2.3) implements conda environment management on a snakemake rule-by-rule basis. This is excessive, as the vast majority of the dependencies for individual rules will not conflict with each other. This unnecessary complexity needlessly increases the overhead for the maintainer of the repository, and it is generally more confusing for end-users.

A better solution would be to implement a single main environment which most rules reference, and then add additional environments on a rule-by-rule basis where necessary. This approach may be too crude however, and a compromise between the current and the proposed state may end up being implemented.

Enable processing of non-Ensembl GTF files

At the moment, FilTar can only process GTF files in the format specified by Ensembl/gencode. I would like to make FilTar more flexible in this regard.

Few general queries

Hi @TBradley27

This tools really great. I am having few very basic question. Can you please provide your input on it.

Can I provide an independent gtf file (non-gencode) as transcript input?
Can the specify the parameters for transcript expression?
Finally, have you tried it on long read fastq files?

Thanks

Move rule configuration values backwards into YAML config files

A lot of rule specific configuration at the moment (i.e. v.1.2.3) is achieved by manually editing relevant Snakefiles. This is not ideal as we don't ever really want the user to have go into the Snakefiles themselves and do some editing in order to reconfigure how FilTar is executed.

If we bring this configuration backwards, closer to the root, and into purpose-built YAML config files then this will make life a lot easier for the user. It is also a benefit to see the entire project configuration in a single or small number of YAML files rather than having to hunt for this information in nested file structures.

Make some data download functionality non-default

The tool in its present state (i.e. v1.2.3) by default downloads an entire genome into the data/ directory if the users doesn't already have that information contained within this directory.

This is generally bad practice, as many different applications act on genomes, and it is not reasonable to expect a fresh download of genomes on an application-by-application basis. The same principle could perhaps be extended to other sequence data files currently used by FilTar. Instead, it would be better if genome files were discoverable by the application through the use of symbolic links

It would be for the best if some aspects of data download were made non-default unless explicitly requested by the user.

NB: One potential caveat is that it is not currently understood how snakemake acts upon symbolic links which would need to be explored.

NB: Another caveat is that effectively deautomating pre-alignment processes will increase the overhead to the user with respect to making sure that genome and annotations files match. Careful exception handling ought to be implemented here.

Standardise and minimise name space throughout the repository

Implement a more sophisticated method of mapping chromosome/scaffold identifiers

E.g. re-examine the 'mod' gtf files

Throw an error if the user does not specify a valid target prediction algorithm

Catch errors in cases in which there are typos of mis-specification fo target prediction algorithm names

Resolve the issue of different conda installation methods

At the moment, the ability to install FilTar dependencies very much depends on whether the installation is attempted within a conda environment or not.

This requires further investigation

Speed up the process of generating average bedgraph coverage values from many bedgraph files

Version tracking of dependencies

As each dependency is installed within its own conda environment, it is difficult to track the version of each dependency. Sometimes the dependency downloaded and installed is not the same as the version specified in the conda environment configuration file.

It would be a good to determine a way of summarising dependency version information in a way which is easily accessible and interpretable to users.

Process genomic data not split into sequencing block (i.e. contig level assemblies)

FilTar error message

Hi. Dr. Thomas Bradley,

I found your paper, FilTar: using RNA-Seq data to improve microRNA target prediction accuracy in animals, and your program really fit into my project.

On the first attempt with default parameters, the following error message is displayed:

Error in rule salmon_index_for_lib_types:
Error in rule salmon_index:
jobid: 810
jobid: 575
output: results/salmon/indexes/lib_type_identification/hsa
output: results/salmon/indexes/hsa
conda-env: /home/wooje/NGS_tools/FilTar-master/.snakemake/conda/3399c010
conda-env: /home/wooje/NGS_tools/FilTar-master/.snakemake/conda/3399c010
shell:
salmon index --threads 8 -t data/Homo_sapiens.GRCh38.cdna.all.fa -i results/salmon/indexes/lib_type_identification/hsa --type quasi -k 31
(exited with non-zero exit code)
shell:
salmon index --threads 8 -t data/Homo_sapiens.GRCh38.cdna.all.fa -i results/salmon/indexes/hsa --type quasi -k 31
(exited with non-zero exit code)

Because I also receive "Exception : [Error: RapMap-based indexing is not supported in this version of salmon.]", I may need to replace the latest version of Salmon with V.0.11.3.
But I am new to computer programming, so I don't know how to replace it.
Could you please help me how to solve the problem?

Thank you.

Best regards,
Wooje

Add script-level comments at the top of scripts

All scripts should have script-level documentation near the start of the script describing the general aim/purpose of that script

Switch to using toplevel genomes only

There is no reason why we should be excluding non-localised scaffolds and contigs

Make read trimming optional

There are some studies which demonstrate that trimming of RNA-Seq reads does not have a substantial impact on the accuracy of the quantification of gene expression. For example, the following preprint:

https://www.biorxiv.org/content/10.1101/833962v1.abstract

Removing trimming from the workflow would substantially reduce overall workflow runtime.

Therefore, it would be beneficial if trimming in FilTar could be made optional at this stage, but made the default option nonetheless (until more thoughtful consideration is given to this topic by myself).