jberthelier / pirate Goto Github PK

View Code? Open in Web Editor NEW

18.0 18.0 5.0 394 KB

PiRATE (Pipeline to Retrieve and Annotate Transposable Elements)

Home Page: http://doi.org/10.17882/51795

pirate's People

Contributors

Stargazers

Watchers

Forkers

altingia hui-liu nicholas-nvs cnyuanh zhaolina0827

pirate's Issues

How to upload huge data on the virtual computer to galaxy via FTP?

Hi,
I have problems with uploading data. My Illumina data is ~70G, it's over 1G, so I tried to use ftp to upload my data. My data files have already in my virtual computer. Here are two methods I tried:

copy-past my files in the directory “[email protected]” of the PiRATE-VM /home/Jeremy/Documents/[email protected].
It dosen't work. I saw the files in the [email protected], but I saw nothing when I selected "Choose FTP files" to upload data in the galaxy web.
I used FileZilla to transfer files. Host:134.246.55.42 usrname: my_email password:****.
It dosen't work. FileZilla told me "Couldn't connect to server".
But when used the HOST: usegalaxy.org, I succeeded to connect to server and upload files.

Is there anything wrong with my operation? And how can I solve this problem----upload data via FTP?

Thank you for your attention and hope for your solution.

RNA-seq data analysis

Hello @ Jérémy Berthelier
Is it possible to just directly map potentially autonomous TEs with RNA-seq data, without mapping to genome assembly?

Install PiRATE

Hello,
I plan to install and use the PiRATE on VirtualBox in my laptop.
However, I cannot download the PiRATE-VM OVA file on my local folder.
When I clicked the PiRATE-VM, new window popped up with unknown scripts, which is not normal downloading I guess.
If you guide me how to download it, it would be appreciated.
Kyung

PiRATE availability

Hi there, I was wondering ... is there a way to get the software not as virtual machine? I ask this as the VM won't work with our cluster.

How to connect via FTP on tne Virtual Machine to transfert my data

Several people ask for that:

Here is my tuto

1)Firsly check your Network Configuration in VirtualBox for the PiRATE-VM

Select Global Settings from the File menu --> Select the Network item in the list
-In "attached to" check that the "Bridged Network" is selected
-Go in "advanced" (blue arrow)
-Check that for the Promiscuous mode "Allow All"
-Cable connected: check

2) Start your PiRATE-VM

2.a-just in case restart the proftdp:

open a konsol and run:
sudo /etc/init.d/proftpd reload
sudo /etc/init.d/proftpd restart

( the admin pasword is jeremy07 )

2.b-Obtain your IP adress
open a konsol and run the ifconfig command (https://www.computerhope.com/unix/uifconfi.htm)

Example:

eth0 Link encap:Ethernet HWaddr 09:00:12:90:e3:e5
inet addr:192.168.1.29 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::a00:27ff:fe70:e3f5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:54071 errors:1 dropped:0 overruns:0 frame:0
TX packets:48515 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:22009423 (20.9 MiB) TX bytes:25690847 (24.5 MiB)

this will show you the "inet addr:" which is your ip adress

Then go back to your physical machine

3) Open Filezilla and connect you to the PiRATE-VM

Host: your IP adress login: jeremy password: jeremy07 Port: 21

I just re-checked this method and it works for me,

Best!

Jeremy

Error with Helsearch

I have changed the input file with 60bp per line, but I still get an error for Helsearch

Fatal error: Exit code 1 () 

[formatdb] ERROR: ./working/GCAGGT_CGA_ACCTGC_5/GCAGGT_CGA_ACCTGC_5.nhrOutput Blast-def-line-set.E.<title> 

Invalid value(s) [9] in VisibleString [genome_P#12826044#12826244 ...] 

[formatdb] ERROR: ./working/GCAGGT_CGA_ACCTGC_5/

Query regarding construction of libraries for annotation

Hi Jeremy--

Once again, thanks for constructing PiRATE. I've very much enjoyed using it so far.

I do have another question, if you don't mind. More like a clarification. I've finished the classification step and have received my output from PASTEC. Now I'm starting to look into the manual check step, and had a question regarding how to proceed. My understanding is that the manual check is only meant to be performed on the autonomous sequences, using MCL and Blastn. In order to do so, is the expectation that one has to pull LTR, LINE and TIR sequences (using PASTEC information) manually and then feed that to MCL and Blastn? If so, was there a script that was used for this purpose, or is the expectation that we write that ourselves?

Thanks for the clarification.
Best
Jack

Does it necessary to have an assembled genome to run PiRATE

Hi @JBerthelier ,
Hope all is fine with you. Before running PiRATE pipeline I just want to confirm that, Does it necessary to have an assembled genome to run PiRATE pipeline. Because species I'm working on not possible to have an assembled genome due to large genome size. Can I use approach 4 (build repeated elements) in step one, then proceed the following steps as it is. The questions why I'm asking because I checked the final step TEannot there is two input requirements

genomic Sequences (Enter the genome assembly)
TE Library (the manually constructed TE Libraries) If I'm not worng.
Your suggestions would be appreciated !
Thanks
Regards,
Majie

Error with Helsearch

A PiRATE user reported this error with Helsearch:

Fatal error: Exit code 1 () [formatdb] WARNING: Cannot add sequence number 1 (lcl|1_./working/GCCCTGA_CA_TCAGGGC_156) because it has zero-length. [formatdb] FATAL ERROR: Fatal error when adding sequence to BLAST database. rm: impossible de supprimer «./working/GCCCGC_AAGA_GCGGGC_2/GCCCGC_AAGA_GCGGGC_2.nhr»: Aucun fichier ou dossier de ce type rm: impossible de supprimer «./working/GCCCGC_AAGA_GCGGGC_2/GCCCGC_AAGA_GCGGGC_2.nsq»: Aucun fichier ou dossier de ce type [formatdb] ERROR: ./working/TCCAAT_AAT_ATTGGA_2/TCCAAT_AAT_ATTGGA_2.nhrOutput Blast-def-line-set.E.<title> Invalid value(s) [9] in VisibleString [genome_P#2379729#2379929 ...] [formatdb] ERROR: ./working/TCCAAT_AAT_ATTGGA_2/TCCAAT_AAT_ATTGGA_2.nhrOutput Blast-def-line-set.E.<title> Invalid value(s) [9] in VisibleString [genome_P#28421039#28421239 ...] cp: impossible d'évaluer «/home/jeremy/Pipeline/helsearch1.0/helitron_contain/helitron.fa»: Aucun fichier ou dossier de ce type

To fix this error, the input sequences must have a width of 60 pb with per lines.

Best,

Weight below or more than 1 GO

Hello @ Jérémy Berthelier
What does weight below 1 ( for genome assembly) mean? or weight of more than 1 GO ( for Illumina raw data)

PiRATE DEFAULT DATABANKS

Hello @ Jérémy Berthelier
Does nucleotide and protein default databanks of PiRATE include all the sequences from Repbase Update, I am planning to do a comprehensive genome-wide TE analysis in my study. Thank you in advance!

FASTQ to FASTA error

When trying to transform my FASTQ to FASTA I get the following error:

fastq_to_fasta: Invalid quality score value (char '#' ord 35 quality value -29) on line 4 gzip: stdout: Broken pipe

GFF output from TE annotation

Hi there,
Thank you for providing this package for TE annotation. It's super easy to use so many TE annotation tools. At the same time, I have a quick question about the output of different annotation programs. I found it usually just provide a .fasta file output. I was wondering what would be the best way to retrieve the location information on the genome of TE annotations? For example, is there a gff format output I can easily get from the output?

Thank you

french keyboard in shell

I can't seem to change the keyboard in shell to english. I can't find / tried all the keys to see if can find it. Why not universal keyboard?

No internet connection

I am unable to have internet connection in the PiRATE-VM
I tried many network setting including Bridged Adapter, but nothing changes.

Can someone help me?

Many thanks in advance

FTP is not available

Hi Jeremy,

I've tried to upload data via ftp; however, there is no ftp option that appears on the tab for data file drop. There is only a button for choose local file. I can not upload the data via USB or drag and drop because I can not access my local computer drives once I am on the PiRATE VM. Do you have any suggestions?

thank you,

TEdenovo error with 60bp file

I receive the following error message when when I try to run TEdenovo on my assembly

Fatal error: Exit code 1 ()
cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Grouper_Map/genome_Blaster_Grouper_Map_consensus.fa: Aucun fichier ou dossier de ce type
cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Piler_Map/genome_Blaster_Piler_Map_consensus.fa: Aucun fichier ou dossier de ce type
cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Recon_Map/genome_Blaster_Recon_Map_consensus.fa: Aucun fichier ou dossier de ce type

The assembly has fasta lines no longer than 60bp, and the read names are formatted as follow >Ccostatus_symbiont_assembly_1

any suggestion?

Problem due to the password fixed - No Password needed now

There was a problem for some users to log into the PiRATE VM.

The cause was that we used an AZERTY Keyboard, causing probleme for QWERTY and others Keyboards.

The PiRATE VM has been update to fix this problem,
no password needed now, but please configure your keyboard in the PiRATE-VM.

Best,

transferring data to vm

I have a shared drive setup but I can't get to it from the vm. All I want to do is access my data from the vm or transfer it to it. The tutorial is not working for me at this stage. Any more info on how to do this and change keyboard as I can get round launching galaxy but I can't find the $ key

RepeatMasker encountered a line in an unrecognized format.

Hello,

During the past couple weeks I tried to run several Detection steps with the PiRATE pipeline and all of them reported the same error. It is about invalid characters in the fasta file. Below is the error reported by the RepeatMasker.

Fatal error: Exit code 1 () FastaDB::_cleanIndexAndCompact - WARNING: RepeatMasker encountered a line in an unrecognized format. The offending subsequence is "GTTGG)N;GPM". The offending line is "TGTACGGGTGTAATGATGTAATTGCTTTTGTATGTTGG)N;GPMXK6I,3". Seq

The long report displays a lot of errors like this. In every line there are different characters not recognized, however the fasta file does not have these. I've searched for the lines with this error and also the specific characters, in both files in the VM and my PC, and found nothing.
I think these files may be corrupted after input in the pipeline.

Is there something wrong I am doing with file management or a problem in my .ova file?
PS. I transfer the fasta file with FileZilla.

Thank you!

Database bug in PiRATE?

I tried running the PASTEClassifier job in PiRATE, but got the follwing error message:

SSI index construction failed:
  primary keys not unique: 'PiRATEdb_Crypton_YR_NA' occurs more than once
Traceback (most recent call last):
  File "/home/jeremy/Pipeline/REPET_2.5/bin/PASTEClassifier.py", line 405, in <module>

I assume this has to do with the MySQL database? Any way to fix it?

Does it necessary to have an assembled genome to run PiRATE pipeline

genomic Sequences (Enter the genome assembly)
TE Library (the manually constructed TE Libraries) If I'm not worng.
Your suggestions would be appreciated !
Sorry due to Server link down problem this issue gets duplicated, allow me to delete one of them.
Thanks
Regards,
Majie

Encountered error in Hmmer

Hello Jeremy, I encountered this error during my analysis with TE-HMMER.

Fatal error: Exit code 134 ()
Translate nucleic acid sequences
Fatal exception ( source file p7_alidisplay .c , line 729):
backconverted subseq didnt end at expected length (scaffold255_4/PiRATEdb_Academ_HEL_NA)
/home/jeremy/galaxy/database/jobs_directory/001/1961/tool_script . sh : ligne 9 : 10158 Abandon (core dumped) hmm search -A result _hmmer . stock / home/jeremy/galaxy/tools/Pipeline/Hmmer/PiRATE_hmmer.hmm genome_prot. fa > /home/jeremy/galaxy/database/files/003/dataset_3875. dat

Unable to upload data from my Local PC.

Hi Dear @JBerthelier ,
I just successfully installed PiRATE, But the problem I'm facing right now related to uploading data from my Local device. Whenever I selected "choose local file" option it always directed to your directory instead of mine local PC. Because I haven't used this platform before, Could you please help me to resolve this issue.
Many Thanks!

Progress with Docker version

I'm really interested in using your pipeline for TE analysis of multiple different non-model organisms.

I'd prefer to use your tool through Docker on the command line to speed up throughput on a cluster, as I'm looking to do a comparative analysis of many genomes.

I was wondering how things are progressing with getting Pirate to Docker?

Thanks.

could not type correct letter in konsole

During the process of installation, I could open the konsole, but could not type "sh /home/jeremy/galaxy/run.sh" correctly.
Actually it print "sh !ho,e!jere,y!gqlqxy!run:sh".
My platform is ubutu 14.04 LTS. And I installed the correct virtualbox and extension pack.
The menus of konsole is not in English, very hard to understand..

How to set up the number of processor and slots for the tools TEdenovo, PASTEC and TEannot (REPET package) in PiRATE

How to set up the number of processor and slots for the tools TEdenovo, PASTEC and TEannot (REPET package) in PiRATE　：

in a konsol:

sudo qconf -mq main

pasword is jeremy07

then change the number of processors and slots according to your VM/machine

Best

Jeremy

Statistics summary

Hi Jeremy,

Is it possible to generate a statistics summary similar to the table 1 in the paper? Or how can I easily extract the numbers of the different TE's and proportions of the genome etc?

Cheers
Pepijn

TEannot - size genome

Hello!
I am trying to use it in my studies and I have an important doubt in the TE annotation step. Is the genome size really in Mb as exemplified in the pipeline (85,000,000 Mb) or the correct is to use the size in bp? I tried to put the size in Mb but the statistical result did not make sense.
Thanks a lot

MGEScan-nonLTR error

Hi Jeremy--

Many thanks for helping me out with the uploading of files to the Virtual Box. I've managed to figure that out.

I was doing a preliminary run for nonLTRs using MGEScan-nonLTR and received the following error:

"FATAL: Failed to open sequence database file /home/Jeremy/galaxy/tools/Pipeline/MGEScan/WORK/output/f/output1.pep
Usage: hmm search [-options]
Available options are: -h : help; print brief help on version"

Would you happen to have any insight on what might be causing this error?

Thanks
Jack

Are the Galaxy tools available on a Toolshed or others ?

Hi Pirate guys ;)

Hope all is doing well for you. Maybe I don't search enough but is there a way to find the Galaxy tools related code somewhere on a Galaxy Toolshed or other Github repo for example ? I think this can really be of interest to propose a cloud-based version of Pirate like for example through usegalaxy.eu (and why not a dedicated subdomain like pirate.usegalaxy.eu) + giving the opportunity to any Galaxy server to propose it.

Best,

Yvan

TEdenovo error

I tried to run TEdenovo on my data set but receiving the following error:

There is no job registered for the following users: jeremy Fatal error: Exit code 1 () cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Piler_Map/genome_Blaster_Piler_Map_consensus.fa: No such file or directory

Script for eliminating redudancy in CD-Hit est

Hello, I would to ask if there is any script prepared for eliminating redundancy in CD-Hit test. After the clustering is there any manual curation that we are suppose to do?

resize virtual hard disk

Hello,

I was wondering wether it possible to resize the virtual hard disk that is eventually created in the Pirate VM. In my case it the hard disk is limited to 300G.

I am using Ubuntu Virtual Vox 5.2.18

Many thanks
Jorge

Problem with TEannot

Hi!

I have been trying to run TEannot in a genome assembly, and the following error appeared:


Fatal error: Exit code 1 ()
ERROR: job '88457', supposedly still running, is not handled by SGE anymore
it was launched the 2018-10-22 19:22:42 (> 1.00 hours ago)
this problem can be due to:
* memory shortage, in that case, decrease the size of your jobs;
* timeout, in that case, decrease the size of your jobs;
* node failure or database error, in that case, launch the program again or ask your system administrator.

My genome size is ~270 Mb.

It also reports:

There is no job registered for the following users: jeremy

Thank you

use docker to make it more robust?

VM is a huge thing because of GUI.

Can it use docker to produce a virtual environment?

root password for the VM

I've download the VM but the internet connection is not working, I need to install guest addition in order to share files within the host and I need root password to do so.

how to freely copy/move file between PiRATE and Ubutu?

Dear Jeremy,
Could you tell me how to freely copy/move files between PiRATE and Ubutu?
I use ubuntu 18.04. I shared a Desktop/VM-shared-folder between VM and ubuntu.
Thank you very much,
wj

TEannot query

I have a quick question regarding TEannot. I just did the first run, but in the tutorial you do a second run. Which of the output files from the first run do you use for this?

Full length copy
Full length fragment
TE with one copy and more

Cheers
Pepijn

hmmer error/bug

Hi,
running PiRATE I just came across this hmmer bug when running TE-HMMER. It is fixed in newer versions of hmmer, I so updated hmmer on the VM. I did not try TE-HMMER with the new version yet, but if it works maybe you could include an updated hmmer in the VM.

_mysql_exceptions.DatabaseError: ERROR: failed to connect to the MySQL database

Dear Jeremy,
The PiRATE could run RepeatScout, SINE-finder, LTRharvest, TE-hmmer, RepeatMasker. But It could not run TEdenovo, Helsearch, MITE-Hunter, MGEScan-nonLTR.
When I run PASTEC, it have errors as following, even if I logged in as [email protected].

2019-11-23 03:24:26 - PASTEClassifier - INFO - START PASTEClassifier parallelized 2019-11-23 03:24:26 - PASTEClassifier - INFO - Fasta file name: genome.fa 2019-11-23 03:24:26 - PASTEClassifier - INFO - Running STEP 1 of PASTEClassifier parallelized: DetectTEFeatures START DetectTEFeatures_parallelized Fasta file name: genome.fa split the sequences in batches Homogeneous in size 20kbp launch the programs on each batch 2019-11-23 03:24:35 - DbMySql - ERROR - ERROR 1049: Unknown database 'PASTEC_galaxy'

Fatal error: Exit code 1 ()
ERROR 1008 (HY000) at line 1: Can't drop database 'PASTEC_galaxy'; database doesn't exist
Traceback (most recent call last):
File "/home/jeremy/Pipeline/REPET_2.5/bin/PASTEClassifier.py", line 405, in
iPASTEClassifier.run()
File "/home/jeremy/Pipeline/REPET_2.5/bin/PASTEClassifier.py", line 384, in run
iDF.run()
File "/home/jeremy/Pipeline/REPET_2.5/commons/tools/DetectTEFeatures_parallelized.py", line 226, in run
iDb = DbFactory.createInstance()
File "/home/jeremy/Pipeline/REPET_2.5/commons/core/sql/DbFactory.py", line 37, in createInstance
return DbMySql(cfgFileName = configFileName, verbosity = verbosity)
File "/home/jeremy/Pipeline/REPET_2.5/commons/core/sql/DbMySql.py", line 166, in init
raise DatabaseError(msg)
_mysql_exceptions.DatabaseError: ERROR: failed to connect to the MySQL database

long read sequence data

I was wondering if it would be possible now or in the future version to run this pipeline using long-read sequence data such as Nanopore or PacBio

Usage without VM

We have a managed galaxy service which is connected to our HPC system.
Some users showed interest in using the PiRATE pipeline. As a workaround we started the VM but we want to run the pipeline on our managed Galaxy service.
Provided that all software dependencies are installed, what would be necessary to replicate the VM ?
Do I just need to copy the tools-conf.ini and the tools xmls ?

Run PiRATE without using Illumina raw data

Hello, I wondering if it is possible to run PiRATE using genome assemblies only? I only have PacBio long read sequencing data, hence I cannot use those tools such as RepArch since they require an Illumina raw data

Single Transposable elements library

Hello Berthelier
Forgive me for dozens of questions, It's just that I want to make use of this tool as much as I can.
Since you were interested in potentially autonomous TEs in your study, and you focused on them mostly. What if for example in my case, my goal is to identify, classify and annotated TEs in genomes of interest. After classification, I want to group my elements as "class I retrotransposons" "class II DNA transposons" and "un-categorized elements" . Following classification, I wish to cluster the TE consensus sequences into families using MCL. Then, create a TE library for respective organism and perform a genome-wide annotation using TEannot pipeline.

PiRATE pipeline.

Hello, can we use PiRATE for retrieval and annotation of transposable elements in fungal genomes, specifically Basidiomycota?

TE content (%)

Hello Jeremy,

As I have decided to use a single TEs library ( library excludes unknown repeats from PASTEC classification) for TEannot step. How can I determine the content of TEs in my genome e.g Class I, class II, and total TE content. For unknown repeats also, how can I determine their proportion in the genome?
From PASTEC output, I decided that I will separate the elements as Class I, Class II and unknown repeats. Howver, unknown repeats will be separated from the TE library that will be used for TE annot step.
Best regards
Lukanyo

How to interpret the output of PASTEC in PiRATE

Dear Jeremy,
PASTEC usually give consistent evidence of TE's superfamily, but we have these:
CI=27; coding=(TE_BLRtx: I-2_BM:ClassI:LINE:I: 5.09%, L2-13_DRe:ClassI:LINE:Jockey: 5.92%, L2_Plat1r:ClassI:LINE:Jockey: 6.23%, L2_Plat1u:ClassI:LINE:Jockey: 6.33%; TE_BLRx: L2-13_DRe_1p:ClassI:LINE:Jockey: 8.19%, L2-24_ACar_1p:ClassI:LINE:Jockey: 18.50%); struct=(TElength: <700bps); other=(SSRCoverage=0.12)
CI=27; coding=(TE_BLRtx: I-2_BM:ClassI:LINE:I: 7.61%; TE_BLRx: L2-18_ACar_1p:ClassI:LINE:Jockey: 19.78%, L2-29_ACar_2p:ClassI:LINE:Jockey: 20.92%, L2-9_DRe_1p:ClassI:LINE:Jockey: 22.64%); struct=(TElength: <700bps); other=(SSRCoverage=0.12)

should we treat these contigs as LINE/I or LINE/Jockey? They are not potential chimeric but ok in the output.
What's the meaning of TE_BLRtx and TE_BLRx? and the percentages mean similarity of divergency? Should we choose TE_BLRtx or TE_BLRx?
Are the confidence index (CI) high or low? Is there a threshold to filter low quality annotation?
Is the PASTEC output more accurate than those of RepeatMasker, which based on the similarity of 80-80-80 rule?

Thank you very much and with best wishes,
Jie

How to import raw illumina files

Hello--

I am having a hard time figuring out how to upload my raw illumina files (located on the desktop of the local computer). The manual suggests to "copy and paste" the files to "[email protected]", but I am not quite sure how to do this. Clicking and dragging my files from the desktop to this location doesn't seem to be working.

Any help or clarification would be greatly appreciated.

Many thanks
Jack

PiRATE-VM

Hi,

I an not import the PiRATE-VM ova from the site into my virtual box. The PiRATE-VM file does not work. Could someone fix this.

thanks!

jberthelier / pirate Goto Github PK

pirate's People

Contributors

Stargazers

Watchers

Forkers

pirate's Issues

Recommend Projects

Recommend Topics

Recommend Org