Git Product home page Git Product logo

Comments (5)

JBerthelier avatar JBerthelier commented on July 29, 2024

Hi Jack,

The automatisation of this part is the major update of the new version of PiRATE.
But the script construction is still in progress sorry.

Unfortunately, you have to do it by yourself at now.

The goal of this step is to build your autonomous TE library for the annotation.

  1. You have to download the output of PASTEC (.tabular) and open it with a spreadsheet

  2. Sort the classified sequences according to their Order (e.g TIR, LTR)

[If you have time you can even sort them at the Superfally level (e.g; LTR/Copia, LTR/Gypsy, TIR/hAT according to the evidence section produced by PASTEC). This is what I done for the PiRATE paper.]

  1. Select the names/ID of the sequences belonging to same Order and extract the DNA sequences from the fasta file that you used for PASTEC. You have to build/extract a fasta file for each Order.
    (a fasta file for LTR, a fasta file for LINE, a fatsa file for TIR ...)
    You have to do it by yourself with the linux Konsol--> for exemple http://seqanswers.com/forums/showthread.php?t=50014

  2. Don't forget to add the classification information in header of each sequences (e.g. add "LTR" or "LINE")
    With a the linux Konsol --> for exemple
    https://www.unix.com/unix-for-dummies-questions-and-answers/242665-append-file-name-fasta-file-headers-linux.html

  3. The goal of the manual check with MCL is to cluster the TE sequences of each Order in TE families. This is recommended.

Run MCL with each fasta file of each Order of autonomous TEs, your TE sequences will be clustered into predicted TE familly.
The purpose of BLASTn is only to check if you are agree with the clustering of MCL.

  1. Build you autonomous TE library with the classified (and clustered) autonomous TE sequences.
    If you have time you can only select the larger (potentially complet) sequences of each cluster, they represent nice referent sequences of each autonomous TE familly.

I advise you to firsly build a TE library containing only the potentialy autonomous TEs Order (e.g. LTR, LINE, TIR).

  1. Then, you can go for the annotation with TEannot.

--> You can build other library contenning autonomous TEs Orders and non-autonomous (SINE, MITE, LARD, TRIM), but the latter are difficult to detect and classify and need to be carfully regarded.

I hope this will help you,

Best,

Jérémy

from pirate.

jphruska avatar jphruska commented on July 29, 2024

Thanks Jeremy.

I will look into your recommendations.

Much appreciated

Jack

from pirate.

jphruska avatar jphruska commented on July 29, 2024

Hi Jeremy--

Another quick question if you don't mind me asking.

My apologies in advance for the naive question, but I've only just begun working on TEs and annotations. I've run putative LINE sequences (according to PASTEC) through MCL, and I am wondering if you would help me interpret the output. What I have received now is the name of the sequences that match specific groups (which I would assume are the clusters?). In addition, the fasta headers are now renamed with the cluster ID they are assigned to. How are these clusters determined? Also, some clusters appear to have way more sequences than others. What would the clusters pertain to, exactly? How can I interpret this output in combination with the BLASTn search that follows?

Many thanks for your time, and again, apologies for the barrage of questions.

Best
Jack

from pirate.

JBerthelier avatar JBerthelier commented on July 29, 2024

Hi Jack,

No problem for the question. Sorry for the delay of the answer.

Thanks to MCL you get your potential group/cluster/famillies of LINE.

This will help you for example to identify what TE familly have a large number of copy.

The TE familly are formed by similar copy of TEs (basically sharing at least 80% of similarity for at least 80% of the sequences).

That's why we proposed to cluster the sequences into familly before the annotation.

You can see the "Definition of family and subfamily" in the paper from Wicker et al. 2007 Nature named

"A unified classification system for eukaryotic transposbale elments".

It's normal that some clusters have more sequences than others. This means that some TE familly have maybe more copies in the genome or are easier to detect. Be carefull with the clusters composed of one copy, they can be false positives.

About BLASTn:
You can submit the fasta output of MCL (with the ID of cluster/familly) to BLASTn in order to check if the similar sequences are well grouped. It's an optional manual check process. If you estimate that a sequence has been badly grouped by MCL, you can rename manualy its cluster/familly ID.

However, if you are not interested to have your LINE sequences grouped into TE familly, you can avoid this step and directy go for the annotation.

It's really dependending on the aim of your study.

Best,

Jérémy

from pirate.

jphruska avatar jphruska commented on July 29, 2024

Thanks for the information, Jeremy.

All the best
Jack

from pirate.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.