Git Product home page Git Product logo

transator-java's Introduction

TransATor

This repository contains the cheminformatics (PKS Structure Generator and Runner) and web parts (Web and REST) of TransAtor. TransAtor is a tool for identification of tran-AT PKS domains and further generation of a first chemical hypothesis of the polyketide generated by the PKS enzyme in question. For more details see the wiki.

transator-java's People

Contributors

ayedo avatar ejnhelfrich avatar pcm32 avatar

Watchers

 avatar  avatar

transator-java's Issues

Task to be done for paper release

  • Produce new cladiifications HMMER models and setup directories (expected 0.5 days)
    • Details on #5 (spent so far 2 days)
    • We have a first version, but there are some issues with the cladification annotation files that need to be sorted out by Eric.
  • Python refactoring to new annotation scheme (0.5 - 1 day)
    • Details on #4 (spent so far 2 day, but touches upon other parts as well, like domain verifications, includes DH/PS)
  • Adapt Java code to Python refactoring (1 - 2 days) (spent so far 3 days, includes DH/PS)
  • Cheminformatics work (all refer to working document):
    • P1. DH/PS resolution, including testing (1 to 2 days)
      - Done through domain verification, parts implemented in both Java and python. DH/PS domains that don't pass required verfications are dropped for the sake of molecule building. (1 day spent so far)
    • P1b. Use KR based stereo, including testing, might require some code re-design (how classes interact) (1 to 3 days)
    • P7. Work on termination rules implementation, using termination definition file (2 days)
      • P8. Work on sub features PK molecule changes (1 to 3 days)
      • P5. Add TE domains given to other domains models (0.5 days, surely less)
      • P6. Status of NRPS usage, write tests (1 day)
  • Other new issues that came up:
    • Fit case where starter monomer is found in the middle of the molecule within the new annotation scheme. This introduces some combination aspects in the domain-based verification steps.
    • Fix stereochemistry CDK exception seen on certain cases when moving to latest CDK 2.0

Adapt Java code to Python refactoring

The Java modules needs to be able to read the new annotation file and make use of this for shaping the PK molecule. 0.5 days used so far.

  • Release current version as 1.1 (the one that works with the flagged python 0.1).
  • Add ability to read additional verification field for sequence feature.
  • Define in which section is the annotation data required within the Java code.
    • Implement objects for CladeAnnotation
    • Implement ability to skip clades if they are not verified, use highest ranking
    • Update some of the tests to the new cladification annotation.
  • Identify clades that produce stereo chemistry related exceptions with non-planar bonds.
    • Move to newest CDK from 1.5.10, John suggests that this might fix this issue.

Produce new cladiifications HMMER models and setup directories

  • Took 0.5 days to delve into code

  • Realised that we cannot use a single newick file with complete cladification, but either a separate newick per clade or a fasta identifier to clade assignment, compatible with IDs used in annotation file.

  • Go into code and check what is needed, fix some bugs found (0.25)

  • Document what is needed (including got refactoring part #4 ) (0.15)

  • Discuss needs with Eric (0.1)

  • Inspect new fasta to clade assignment file given by Eric (1 including fixes)

    • Fix errors
  • Try setup scripts with new input from Eric and document process (0.5)

  • Finish documentation for setup.

Python refactoring to new annotation scheme

Python part needs to be refactored to use the new annotation scheme. Currently, the python modules read the following from the annotation file:

  • Clade identifier
  • Description of the clade

The current annotation file is expected to be in the same path as the HMMER model, with an .annot file extension. For backward compatibility, we could add a flag for the new annotation file, and read it if provided, otherwise, expect to find the previous annotation file in the expected location. The class currently responsible for reading the annotation is hmmer/core/ModelAnnotator.

  • Write reader and related classes for new annotation scheme.
  • Add backward compatible ability to read new annotation file, extracting from here the clade description that it was obtained from the older annotation file.
    • Override constructor of the class that process it (ModelAnnotator) to obtain these descriptions from a CladeAnnotation object if provided.

The Python code might need the following data from the annotation file:

  • Clade ID
  • Clade description as shown in tool
  • Mol file for monomer (this was previously based on the clade identifer, not anymore)
    • [ ] Make changes in code to use the mol file name given in the new annotation format, if provided. Only needed in the Java part.
  • Postprocessor: This is used by the Java part and should be passed along.
  • VerificationDomains: This probably should be used within the Python part, to correct the annotation given to the Java part. I have my doubts here.
    • Use annotation object with Domain_Verifier classes, instead of local loader previously implemented.
    • Test DomainVerifier classes with Annotation reader.
    • Invoke DomainVerifier classes from main script, to influence the resulting SeqObj's features.
      • Write test for SimpleFeatureWriter making sure that the verification column appears adequately, fix any issues
    • Compare output to expected outputs for some sequences, fixing missing annotations that arise.
  • TerminationRule: This is used by the Java part and should be passed along.
  • NonElongating: This is used by the Java part and should be passed along.
  • VerificationDomainIsMandatory: Used in the python part.
    • Should be used after calling the DomainVerifier classes, possibly to execute some changes (either remove the feature, which is preferred, or change it) if the verification fails. Will be used only on the Java section, to decide whether to make use of the verifications done.

This also means that these fields need to make it into the new file that Python writes for the Java-CDK part (features file), or that Python generates a simplified file for Java and Java reads all these from the annotation file. One way to go would be to combine in the feature file the fields produced by Python from the sequence search and all the elements read from the annotation file, to avoid the risk of the Java part running with an incorrect annotation file. All the fields produced through the sequence search are stored initially in qualifiers inside SeqFeatures objects, which go inside the SeqRecords returned by the FeatureMarker classes in Query.core. These are in turn written to the .feature file passed to Java by SimpleFeatWriter class in SimpleFeatWriter.core. This could the place to add all the annotation elements if a unified output is to be used.

Alternatively, the annotation file can be passed to Java, alongside the file with the results of the sequence searches and domain annotations alterations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.