Git Product home page Git Product logo

filtar's People

Contributors

tbradley27 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

filtar's Issues

Reduce inefficiency in use of whole-genome alignments

At the moment, in order to generate multiple sequence alignments for each 3' UTR, FilTar downloads entire whole genome alignments and then extracts the relevant genomic co-ordinates (using the transcript co-ordinates of a reference species).

This is extremely inefficient and costly. FilTar should instead download MSAs on a UTR-by-UTR basis by more cleverly interacting with the UCSC server.

Manage targetscan dependencies within a single conda environment

The main targetscan dependencies are rnaplfold (an aspect of the broader viennarna package), perl Statistics::Lite and the perl Bio::TreeIO package (which depends on Bio::Perl).

The major hindrance in overcoming this issue is that there is no current implementation of Bio::TreeIO on conda. Having an implementation of Bio::TreeIO on conda will allow us (in theory) to manage targetscan easily within the same conda environment.

Until we can do this, we have to awkwardly install a niche conda-forge distribution of perl in order to install cpanm, and through cpanm install Bio::TreeIO - which is cumbersome. This method also requires users to awkwardly implement a patch to one of the BioPerl dependencies using CPAN (which is really not ideal!).

An alternative to this approach would be to use the system perl owned by root in order to install perl dependencies in a user-defined library/path. However, snakemake uses perl as a dependency, so a conda-specific perl in the filtar conda environment is probably necessary. So this approach is simply not possible unless we forfeit the use of our general snakemake/filtar environment - which is not advisable.

From reading the bioconda documentation, it seems that the addition of packages to bioconda is done through an automated process using GitHub pull requests with checks by a CI application. Looking into building our own bioconda package would definitely be the next step in resolving this issue.

At the moment, there are no good options here.

Improve cluster support/integration

At the moment, the application can very easily be ran on standard HPC architectures using standard HPC job schedulers (e.g. slurm/LSF). However, in the current implementation, the whole pipeline is ran as one job, with a single set of core requirements, and a single set of memory requirements for each rule.

However, Snakemake allows developers to implement HPC resource management and submission on a rule-by-rule basis. Taking advantage of this feature will reduce the overall cost of FilTar HPC submissions, likely reduce HPC queue waiting times, and probably improve pipeline logging/administration easier as well

Encode metadata in TSV format rather than YAML

Encoding metadata in a serialisation language such as YAML has quite a few drawbacks - either you have to heavily normalise the data (making data association more difficult), or you have to encode the data in highly nested structures which is also quite difficult to manipulate or reason with.

It would be better to encode all metadata in tsv format. This would necessitate rescripting of downstream workflow management processes. It is also possible that this rescripting will make workflow management simpler, more interpretable and easier to manage.

It is also a good idea to separate configuration data and metadata into different files which is not currently the case.

Simplify conda environment management

The applications in its current state (i.e. v1.2.3) implements conda environment management on a snakemake rule-by-rule basis. This is excessive, as the vast majority of the dependencies for individual rules will not conflict with each other. This unnecessary complexity needlessly increases the overhead for the maintainer of the repository, and it is generally more confusing for end-users.

A better solution would be to implement a single main environment which most rules reference, and then add additional environments on a rule-by-rule basis where necessary. This approach may be too crude however, and a compromise between the current and the proposed state may end up being implemented.

Few general queries

Hi @TBradley27

This tools really great. I am having few very basic question. Can you please provide your input on it.

  1. Can I provide an independent gtf file (non-gencode) as transcript input?
  2. Can the specify the parameters for transcript expression?
  3. Finally, have you tried it on long read fastq files?

Thanks

Move rule configuration values backwards into YAML config files

A lot of rule specific configuration at the moment (i.e. v.1.2.3) is achieved by manually editing relevant Snakefiles. This is not ideal as we don't ever really want the user to have go into the Snakefiles themselves and do some editing in order to reconfigure how FilTar is executed.

If we bring this configuration backwards, closer to the root, and into purpose-built YAML config files then this will make life a lot easier for the user. It is also a benefit to see the entire project configuration in a single or small number of YAML files rather than having to hunt for this information in nested file structures.

Make some data download functionality non-default

The tool in its present state (i.e. v1.2.3) by default downloads an entire genome into the data/ directory if the users doesn't already have that information contained within this directory.

This is generally bad practice, as many different applications act on genomes, and it is not reasonable to expect a fresh download of genomes on an application-by-application basis. The same principle could perhaps be extended to other sequence data files currently used by FilTar. Instead, it would be better if genome files were discoverable by the application through the use of symbolic links

It would be for the best if some aspects of data download were made non-default unless explicitly requested by the user.

NB: One potential caveat is that it is not currently understood how snakemake acts upon symbolic links which would need to be explored.

NB: Another caveat is that effectively deautomating pre-alignment processes will increase the overhead to the user with respect to making sure that genome and annotations files match. Careful exception handling ought to be implemented here.

Version tracking of dependencies

As each dependency is installed within its own conda environment, it is difficult to track the version of each dependency. Sometimes the dependency downloaded and installed is not the same as the version specified in the conda environment configuration file.

It would be a good to determine a way of summarising dependency version information in a way which is easily accessible and interpretable to users.

FilTar error message

Hi. Dr. Thomas Bradley,

I found your paper, FilTar: using RNA-Seq data to improve microRNA target prediction accuracy in animals, and your program really fit into my project.

On the first attempt with default parameters, the following error message is displayed:

Error in rule salmon_index_for_lib_types:
Error in rule salmon_index:
jobid: 810
jobid: 575
output: results/salmon/indexes/lib_type_identification/hsa
output: results/salmon/indexes/hsa
conda-env: /home/wooje/NGS_tools/FilTar-master/.snakemake/conda/3399c010
conda-env: /home/wooje/NGS_tools/FilTar-master/.snakemake/conda/3399c010
shell:
salmon index --threads 8 -t data/Homo_sapiens.GRCh38.cdna.all.fa -i results/salmon/indexes/lib_type_identification/hsa --type quasi -k 31
(exited with non-zero exit code)
shell:
salmon index --threads 8 -t data/Homo_sapiens.GRCh38.cdna.all.fa -i results/salmon/indexes/hsa --type quasi -k 31
(exited with non-zero exit code)

Because I also receive "Exception : [Error: RapMap-based indexing is not supported in this version of salmon.]", I may need to replace the latest version of Salmon with V.0.11.3.
But I am new to computer programming, so I don't know how to replace it.
Could you please help me how to solve the problem?

Thank you.

Best regards,
Wooje

Make read trimming optional

There are some studies which demonstrate that trimming of RNA-Seq reads does not have a substantial impact on the accuracy of the quantification of gene expression. For example, the following preprint:

https://www.biorxiv.org/content/10.1101/833962v1.abstract

Removing trimming from the workflow would substantially reduce overall workflow runtime.

Therefore, it would be beneficial if trimming in FilTar could be made optional at this stage, but made the default option nonetheless (until more thoughtful consideration is given to this topic by myself).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.