Hi Jeremy-- Once again, thanks for constructing PiRATE. I've very mu

Thanks Jeremy. I will look into your recommendations. <p dir="au

Query regarding construction of libraries for annotation about pirate HOT 5 CLOSED

jphruska commented on July 29, 2024

Query regarding construction of libraries for annotation

from pirate.

Comments (5)

JBerthelier commented on July 29, 2024

Hi Jack,

The automatisation of this part is the major update of the new version of PiRATE.
But the script construction is still in progress sorry.

Unfortunately, you have to do it by yourself at now.

The goal of this step is to build your autonomous TE library for the annotation.

You have to download the output of PASTEC (.tabular) and open it with a spreadsheet
Sort the classified sequences according to their Order (e.g TIR, LTR)

[If you have time you can even sort them at the Superfally level (e.g; LTR/Copia, LTR/Gypsy, TIR/hAT according to the evidence section produced by PASTEC). This is what I done for the PiRATE paper.]

Select the names/ID of the sequences belonging to same Order and extract the DNA sequences from the fasta file that you used for PASTEC. You have to build/extract a fasta file for each Order.
(a fasta file for LTR, a fasta file for LINE, a fatsa file for TIR ...)
You have to do it by yourself with the linux Konsol--> for exemple http://seqanswers.com/forums/showthread.php?t=50014
Don't forget to add the classification information in header of each sequences (e.g. add "LTR" or "LINE")
With a the linux Konsol --> for exemple
https://www.unix.com/unix-for-dummies-questions-and-answers/242665-append-file-name-fasta-file-headers-linux.html
The goal of the manual check with MCL is to cluster the TE sequences of each Order in TE families. This is recommended.

Run MCL with each fasta file of each Order of autonomous TEs, your TE sequences will be clustered into predicted TE familly.
The purpose of BLASTn is only to check if you are agree with the clustering of MCL.

Build you autonomous TE library with the classified (and clustered) autonomous TE sequences.
If you have time you can only select the larger (potentially complet) sequences of each cluster, they represent nice referent sequences of each autonomous TE familly.

I advise you to firsly build a TE library containing only the potentialy autonomous TEs Order (e.g. LTR, LINE, TIR).

Then, you can go for the annotation with TEannot.

--> You can build other library contenning autonomous TEs Orders and non-autonomous (SINE, MITE, LARD, TRIM), but the latter are difficult to detect and classify and need to be carfully regarded.

I hope this will help you,

Best,

Jérémy

from pirate.

jphruska commented on July 29, 2024

Thanks Jeremy.

I will look into your recommendations.

Much appreciated

Jack

from pirate.

jphruska commented on July 29, 2024

Hi Jeremy--

Another quick question if you don't mind me asking.

My apologies in advance for the naive question, but I've only just begun working on TEs and annotations. I've run putative LINE sequences (according to PASTEC) through MCL, and I am wondering if you would help me interpret the output. What I have received now is the name of the sequences that match specific groups (which I would assume are the clusters?). In addition, the fasta headers are now renamed with the cluster ID they are assigned to. How are these clusters determined? Also, some clusters appear to have way more sequences than others. What would the clusters pertain to, exactly? How can I interpret this output in combination with the BLASTn search that follows?

Many thanks for your time, and again, apologies for the barrage of questions.

Best
Jack

from pirate.

JBerthelier commented on July 29, 2024

Hi Jack,

No problem for the question. Sorry for the delay of the answer.

Thanks to MCL you get your potential group/cluster/famillies of LINE.

This will help you for example to identify what TE familly have a large number of copy.

The TE familly are formed by similar copy of TEs (basically sharing at least 80% of similarity for at least 80% of the sequences).

That's why we proposed to cluster the sequences into familly before the annotation.

You can see the "Definition of family and subfamily" in the paper from Wicker et al. 2007 Nature named

"A unified classification system for eukaryotic transposbale elments".

It's normal that some clusters have more sequences than others. This means that some TE familly have maybe more copies in the genome or are easier to detect. Be carefull with the clusters composed of one copy, they can be false positives.

About BLASTn:
You can submit the fasta output of MCL (with the ID of cluster/familly) to BLASTn in order to check if the similar sequences are well grouped. It's an optional manual check process. If you estimate that a sequence has been badly grouped by MCL, you can rename manualy its cluster/familly ID.

However, if you are not interested to have your LINE sequences grouped into TE familly, you can avoid this step and directy go for the annotation.

It's really dependending on the aim of your study.

Best,

Jérémy

from pirate.

jphruska commented on July 29, 2024

Thanks for the information, Jeremy.

All the best
Jack

from pirate.

Query regarding construction of libraries for annotation about pirate HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent