jberthelier / pirate Goto Github PK
View Code? Open in Web Editor NEWPiRATE (Pipeline to Retrieve and Annotate Transposable Elements)
Home Page: http://doi.org/10.17882/51795
PiRATE (Pipeline to Retrieve and Annotate Transposable Elements)
Home Page: http://doi.org/10.17882/51795
Hi,
I have problems with uploading data. My Illumina data is ~70G, it's over 1G, so I tried to use ftp to upload my data. My data files have already in my virtual computer. Here are two methods I tried:
copy-past my files in the directory “[email protected]” of the PiRATE-VM /home/Jeremy/Documents/[email protected].
It dosen't work. I saw the files in the [email protected], but I saw nothing when I selected "Choose FTP files" to upload data in the galaxy web.
I used FileZilla to transfer files. Host:134.246.55.42 usrname: my_email password:****.
It dosen't work. FileZilla told me "Couldn't connect to server".
But when used the HOST: usegalaxy.org, I succeeded to connect to server and upload files.
Is there anything wrong with my operation? And how can I solve this problem----upload data via FTP?
Thank you for your attention and hope for your solution.
Hello @ Jérémy Berthelier
Is it possible to just directly map potentially autonomous TEs with RNA-seq data, without mapping to genome assembly?
Hello,
I plan to install and use the PiRATE on VirtualBox in my laptop.
However, I cannot download the PiRATE-VM OVA file on my local folder.
When I clicked the PiRATE-VM, new window popped up with unknown scripts, which is not normal downloading I guess.
If you guide me how to download it, it would be appreciated.
Kyung
Hi there, I was wondering ... is there a way to get the software not as virtual machine? I ask this as the VM won't work with our cluster.
Several people ask for that:
Here is my tuto
1)Firsly check your Network Configuration in VirtualBox for the PiRATE-VM
Select Global Settings from the File menu --> Select the Network item in the list
-In "attached to" check that the "Bridged Network" is selected
-Go in "advanced" (blue arrow)
-Check that for the Promiscuous mode "Allow All"
-Cable connected: check
2) Start your PiRATE-VM
2.a-just in case restart the proftdp:
open a konsol and run:
sudo /etc/init.d/proftpd reload
sudo /etc/init.d/proftpd restart
( the admin pasword is jeremy07 )
2.b-Obtain your IP adress
open a konsol and run the ifconfig command (https://www.computerhope.com/unix/uifconfi.htm)
Example:
eth0 Link encap:Ethernet HWaddr 09:00:12:90:e3:e5
inet addr:192.168.1.29 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::a00:27ff:fe70:e3f5/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:54071 errors:1 dropped:0 overruns:0 frame:0
TX packets:48515 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:22009423 (20.9 MiB) TX bytes:25690847 (24.5 MiB)
this will show you the "inet addr:" which is your ip adress
Then go back to your physical machine
3) Open Filezilla and connect you to the PiRATE-VM
Host: your IP adress login: jeremy password: jeremy07 Port: 21
I just re-checked this method and it works for me,
Best!
Jeremy
I have changed the input file with 60bp per line, but I still get an error for Helsearch
Fatal error: Exit code 1 ()
[formatdb] ERROR: ./working/GCAGGT_CGA_ACCTGC_5/GCAGGT_CGA_ACCTGC_5.nhrOutput Blast-def-line-set.E.<title>
Invalid value(s) [9] in VisibleString [genome_P#12826044#12826244 ...]
[formatdb] ERROR: ./working/GCAGGT_CGA_ACCTGC_5/
Hi Jeremy--
Once again, thanks for constructing PiRATE. I've very much enjoyed using it so far.
I do have another question, if you don't mind. More like a clarification. I've finished the classification step and have received my output from PASTEC. Now I'm starting to look into the manual check step, and had a question regarding how to proceed. My understanding is that the manual check is only meant to be performed on the autonomous sequences, using MCL and Blastn. In order to do so, is the expectation that one has to pull LTR, LINE and TIR sequences (using PASTEC information) manually and then feed that to MCL and Blastn? If so, was there a script that was used for this purpose, or is the expectation that we write that ourselves?
Thanks for the clarification.
Best
Jack
Hi @JBerthelier ,
Hope all is fine with you. Before running PiRATE pipeline I just want to confirm that, Does it necessary to have an assembled genome to run PiRATE pipeline. Because species I'm working on not possible to have an assembled genome due to large genome size. Can I use approach 4 (build repeated elements) in step one, then proceed the following steps as it is. The questions why I'm asking because I checked the final step TEannot there is two input requirements
A PiRATE user reported this error with Helsearch:
Fatal error: Exit code 1 () [formatdb] WARNING: Cannot add sequence number 1 (lcl|1_./working/GCCCTGA_CA_TCAGGGC_156) because it has zero-length. [formatdb] FATAL ERROR: Fatal error when adding sequence to BLAST database. rm: impossible de supprimer «./working/GCCCGC_AAGA_GCGGGC_2/GCCCGC_AAGA_GCGGGC_2.nhr»: Aucun fichier ou dossier de ce type rm: impossible de supprimer «./working/GCCCGC_AAGA_GCGGGC_2/GCCCGC_AAGA_GCGGGC_2.nsq»: Aucun fichier ou dossier de ce type [formatdb] ERROR: ./working/TCCAAT_AAT_ATTGGA_2/TCCAAT_AAT_ATTGGA_2.nhrOutput Blast-def-line-set.E.<title> Invalid value(s) [9] in VisibleString [genome_P#2379729#2379929 ...] [formatdb] ERROR: ./working/TCCAAT_AAT_ATTGGA_2/TCCAAT_AAT_ATTGGA_2.nhrOutput Blast-def-line-set.E.<title> Invalid value(s) [9] in VisibleString [genome_P#28421039#28421239 ...] cp: impossible d'évaluer «/home/jeremy/Pipeline/helsearch1.0/helitron_contain/helitron.fa»: Aucun fichier ou dossier de ce type
To fix this error, the input sequences must have a width of 60 pb with per lines.
Best,
Hello @ Jérémy Berthelier
What does weight below 1 ( for genome assembly) mean? or weight of more than 1 GO ( for Illumina raw data)
Hello @ Jérémy Berthelier
Does nucleotide and protein default databanks of PiRATE include all the sequences from Repbase Update, I am planning to do a comprehensive genome-wide TE analysis in my study. Thank you in advance!
When trying to transform my FASTQ to FASTA I get the following error:
fastq_to_fasta: Invalid quality score value (char '#' ord 35 quality value -29) on line 4 gzip: stdout: Broken pipe
Hi there,
Thank you for providing this package for TE annotation. It's super easy to use so many TE annotation tools. At the same time, I have a quick question about the output of different annotation programs. I found it usually just provide a .fasta file output. I was wondering what would be the best way to retrieve the location information on the genome of TE annotations? For example, is there a gff format output I can easily get from the output?
Thank you
YY
I can't seem to change the keyboard in shell to english. I can't find / tried all the keys to see if can find it. Why not universal keyboard?
I am unable to have internet connection in the PiRATE-VM
I tried many network setting including Bridged Adapter, but nothing changes.
Can someone help me?
Many thanks in advance
Hi Jeremy,
I've tried to upload data via ftp; however, there is no ftp option that appears on the tab for data file drop. There is only a button for choose local file. I can not upload the data via USB or drag and drop because I can not access my local computer drives once I am on the PiRATE VM. Do you have any suggestions?
thank you,
I receive the following error message when when I try to run TEdenovo on my assembly
Fatal error: Exit code 1 ()
cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Grouper_Map/genome_Blaster_Grouper_Map_consensus.fa: Aucun fichier ou dossier de ce type
cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Piler_Map/genome_Blaster_Piler_Map_consensus.fa: Aucun fichier ou dossier de ce type
cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Recon_Map/genome_Blaster_Recon_Map_consensus.fa: Aucun fichier ou dossier de ce type
The assembly has fasta lines no longer than 60bp, and the read names are formatted as follow >Ccostatus_symbiont_assembly_1
any suggestion?
There was a problem for some users to log into the PiRATE VM.
The cause was that we used an AZERTY Keyboard, causing probleme for QWERTY and others Keyboards.
The PiRATE VM has been update to fix this problem,
no password needed now, but please configure your keyboard in the PiRATE-VM.
Best,
I have a shared drive setup but I can't get to it from the vm. All I want to do is access my data from the vm or transfer it to it. The tutorial is not working for me at this stage. Any more info on how to do this and change keyboard as I can get round launching galaxy but I can't find the $ key
Hello,
During the past couple weeks I tried to run several Detection steps with the PiRATE pipeline and all of them reported the same error. It is about invalid characters in the fasta file. Below is the error reported by the RepeatMasker.
Fatal error: Exit code 1 () FastaDB::_cleanIndexAndCompact - WARNING: RepeatMasker encountered a line in an unrecognized format. The offending subsequence is "GTTGG)N;GPM". The offending line is "TGTACGGGTGTAATGATGTAATTGCTTTTGTATGTTGG)N;GPMXK6I,3". Seq
The long report displays a lot of errors like this. In every line there are different characters not recognized, however the fasta file does not have these. I've searched for the lines with this error and also the specific characters, in both files in the VM and my PC, and found nothing.
I think these files may be corrupted after input in the pipeline.
Is there something wrong I am doing with file management or a problem in my .ova file?
PS. I transfer the fasta file with FileZilla.
Thank you!
I tried running the PASTEClassifier job in PiRATE, but got the follwing error message:
SSI index construction failed:
primary keys not unique: 'PiRATEdb_Crypton_YR_NA' occurs more than once
Traceback (most recent call last):
File "/home/jeremy/Pipeline/REPET_2.5/bin/PASTEClassifier.py", line 405, in <module>
I assume this has to do with the MySQL database? Any way to fix it?
Hi @JBerthelier ,
Hope all is fine with you. Before running PiRATE pipeline I just want to confirm that, Does it necessary to have an assembled genome to run PiRATE pipeline. Because species I'm working on not possible to have an assembled genome due to large genome size. Can I use approach 4 (build repeated elements) in step one, then proceed the following steps as it is. The questions why I'm asking because I checked the final step TEannot there is two input requirements
Hello Jeremy, I encountered this error during my analysis with TE-HMMER.
Fatal error: Exit code 134 ()
Translate nucleic acid sequences
Fatal exception ( source file p7_alidisplay .c , line 729):
backconverted subseq didnt end at expected length (scaffold255_4/PiRATEdb_Academ_HEL_NA)
/home/jeremy/galaxy/database/jobs_directory/001/1961/tool_script . sh : ligne 9 : 10158 Abandon (core dumped) hmm search -A result _hmmer . stock / home/jeremy/galaxy/tools/Pipeline/Hmmer/PiRATE_hmmer.hmm genome_prot. fa > /home/jeremy/galaxy/database/files/003/dataset_3875. dat
Hi Dear @JBerthelier ,
I just successfully installed PiRATE, But the problem I'm facing right now related to uploading data from my Local device. Whenever I selected "choose local file" option it always directed to your directory instead of mine local PC. Because I haven't used this platform before, Could you please help me to resolve this issue.
Many Thanks!
Hi
I'm really interested in using your pipeline for TE analysis of multiple different non-model organisms.
I'd prefer to use your tool through Docker on the command line to speed up throughput on a cluster, as I'm looking to do a comparative analysis of many genomes.
I was wondering how things are progressing with getting Pirate to Docker?
Thanks.
How to set up the number of processor and slots for the tools TEdenovo, PASTEC and TEannot (REPET package) in PiRATE :
in a konsol:
sudo qconf -mq main
pasword is jeremy07
then change the number of processors and slots according to your VM/machine
Best
Jeremy
Hi Jeremy,
Is it possible to generate a statistics summary similar to the table 1 in the paper? Or how can I easily extract the numbers of the different TE's and proportions of the genome etc?
Cheers
Pepijn
Hello!
I am trying to use it in my studies and I have an important doubt in the TE annotation step. Is the genome size really in Mb as exemplified in the pipeline (85,000,000 Mb) or the correct is to use the size in bp? I tried to put the size in Mb but the statistical result did not make sense.
Thanks a lot
Hi Jeremy--
Many thanks for helping me out with the uploading of files to the Virtual Box. I've managed to figure that out.
I was doing a preliminary run for nonLTRs using MGEScan-nonLTR and received the following error:
"FATAL: Failed to open sequence database file /home/Jeremy/galaxy/tools/Pipeline/MGEScan/WORK/output/f/output1.pep
Usage: hmm search [-options]
Available options are: -h : help; print brief help on version"
Would you happen to have any insight on what might be causing this error?
Thanks
Jack
Hi Pirate guys ;)
Hope all is doing well for you. Maybe I don't search enough but is there a way to find the Galaxy tools related code somewhere on a Galaxy Toolshed or other Github repo for example ? I think this can really be of interest to propose a cloud-based version of Pirate like for example through usegalaxy.eu (and why not a dedicated subdomain like pirate.usegalaxy.eu) + giving the opportunity to any Galaxy server to propose it.
Best,
Yvan
I tried to run TEdenovo on my data set but receiving the following error:
There is no job registered for the following users: jeremy Fatal error: Exit code 1 () cat: /home/jeremy/galaxy/tools/Pipeline/REPET/WORK/genome_Blaster_Piler_Map/genome_Blaster_Piler_Map_consensus.fa: No such file or directory
Hello, I would to ask if there is any script prepared for eliminating redundancy in CD-Hit test. After the clustering is there any manual curation that we are suppose to do?
Hello,
I was wondering wether it possible to resize the virtual hard disk that is eventually created in the Pirate VM. In my case it the hard disk is limited to 300G.
I am using Ubuntu Virtual Vox 5.2.18
Many thanks
Jorge
Hi!
I have been trying to run TEannot in a genome assembly, and the following error appeared:
Fatal error: Exit code 1 ()
ERROR: job '88457', supposedly still running, is not handled by SGE anymore
it was launched the 2018-10-22 19:22:42 (> 1.00 hours ago)
this problem can be due to:
* memory shortage, in that case, decrease the size of your jobs;
* timeout, in that case, decrease the size of your jobs;
* node failure or database error, in that case, launch the program again or ask your system administrator.
My genome size is ~270 Mb.
It also reports:
There is no job registered for the following users: jeremy
Thank you
VM is a huge thing because of GUI.
Can it use docker to produce a virtual environment?
I've download the VM but the internet connection is not working, I need to install guest addition in order to share files within the host and I need root password to do so.
Dear Jeremy,
Could you tell me how to freely copy/move files between PiRATE and Ubutu?
I use ubuntu 18.04. I shared a Desktop/VM-shared-folder between VM and ubuntu.
Thank you very much,
wj
Hi
I have a quick question regarding TEannot. I just did the first run, but in the tutorial you do a second run. Which of the output files from the first run do you use for this?
Cheers
Pepijn
Hi,
running PiRATE I just came across this hmmer bug when running TE-HMMER. It is fixed in newer versions of hmmer, I so updated hmmer on the VM. I did not try TE-HMMER with the new version yet, but if it works maybe you could include an updated hmmer in the VM.
Dear Jeremy,
The PiRATE could run RepeatScout, SINE-finder, LTRharvest, TE-hmmer, RepeatMasker. But It could not run TEdenovo, Helsearch, MITE-Hunter, MGEScan-nonLTR.
When I run PASTEC, it have errors as following, even if I logged in as [email protected].
2019-11-23 03:24:26 - PASTEClassifier - INFO - START PASTEClassifier parallelized 2019-11-23 03:24:26 - PASTEClassifier - INFO - Fasta file name: genome.fa 2019-11-23 03:24:26 - PASTEClassifier - INFO - Running STEP 1 of PASTEClassifier parallelized: DetectTEFeatures START DetectTEFeatures_parallelized Fasta file name: genome.fa split the sequences in batches Homogeneous in size 20kbp launch the programs on each batch 2019-11-23 03:24:35 - DbMySql - ERROR - ERROR 1049: Unknown database 'PASTEC_galaxy'
Fatal error: Exit code 1 ()
ERROR 1008 (HY000) at line 1: Can't drop database 'PASTEC_galaxy'; database doesn't exist
Traceback (most recent call last):
File "/home/jeremy/Pipeline/REPET_2.5/bin/PASTEClassifier.py", line 405, in
iPASTEClassifier.run()
File "/home/jeremy/Pipeline/REPET_2.5/bin/PASTEClassifier.py", line 384, in run
iDF.run()
File "/home/jeremy/Pipeline/REPET_2.5/commons/tools/DetectTEFeatures_parallelized.py", line 226, in run
iDb = DbFactory.createInstance()
File "/home/jeremy/Pipeline/REPET_2.5/commons/core/sql/DbFactory.py", line 37, in createInstance
return DbMySql(cfgFileName = configFileName, verbosity = verbosity)
File "/home/jeremy/Pipeline/REPET_2.5/commons/core/sql/DbMySql.py", line 166, in init
raise DatabaseError(msg)
_mysql_exceptions.DatabaseError: ERROR: failed to connect to the MySQL database
I was wondering if it would be possible now or in the future version to run this pipeline using long-read sequence data such as Nanopore or PacBio
We have a managed galaxy service which is connected to our HPC system.
Some users showed interest in using the PiRATE pipeline. As a workaround we started the VM but we want to run the pipeline on our managed Galaxy service.
Provided that all software dependencies are installed, what would be necessary to replicate the VM ?
Do I just need to copy the tools-conf.ini and the tools xmls ?
Hello, I wondering if it is possible to run PiRATE using genome assemblies only? I only have PacBio long read sequencing data, hence I cannot use those tools such as RepArch since they require an Illumina raw data
Hello Berthelier
Forgive me for dozens of questions, It's just that I want to make use of this tool as much as I can.
Since you were interested in potentially autonomous TEs in your study, and you focused on them mostly. What if for example in my case, my goal is to identify, classify and annotated TEs in genomes of interest. After classification, I want to group my elements as "class I retrotransposons" "class II DNA transposons" and "un-categorized elements" . Following classification, I wish to cluster the TE consensus sequences into families using MCL. Then, create a TE library for respective organism and perform a genome-wide annotation using TEannot pipeline.
Hello, can we use PiRATE for retrieval and annotation of transposable elements in fungal genomes, specifically Basidiomycota?
Hello Jeremy,
As I have decided to use a single TEs library ( library excludes unknown repeats from PASTEC classification) for TEannot step. How can I determine the content of TEs in my genome e.g Class I, class II, and total TE content. For unknown repeats also, how can I determine their proportion in the genome?
From PASTEC output, I decided that I will separate the elements as Class I, Class II and unknown repeats. Howver, unknown repeats will be separated from the TE library that will be used for TE annot step.
Best regards
Lukanyo
Dear Jeremy,
PASTEC usually give consistent evidence of TE's superfamily, but we have these:
CI=27; coding=(TE_BLRtx: I-2_BM:ClassI:LINE:I: 5.09%, L2-13_DRe:ClassI:LINE:Jockey: 5.92%, L2_Plat1r:ClassI:LINE:Jockey: 6.23%, L2_Plat1u:ClassI:LINE:Jockey: 6.33%; TE_BLRx: L2-13_DRe_1p:ClassI:LINE:Jockey: 8.19%, L2-24_ACar_1p:ClassI:LINE:Jockey: 18.50%); struct=(TElength: <700bps); other=(SSRCoverage=0.12)
CI=27; coding=(TE_BLRtx: I-2_BM:ClassI:LINE:I: 7.61%; TE_BLRx: L2-18_ACar_1p:ClassI:LINE:Jockey: 19.78%, L2-29_ACar_2p:ClassI:LINE:Jockey: 20.92%, L2-9_DRe_1p:ClassI:LINE:Jockey: 22.64%); struct=(TElength: <700bps); other=(SSRCoverage=0.12)
Thank you very much and with best wishes,
Jie
Hello--
I am having a hard time figuring out how to upload my raw illumina files (located on the desktop of the local computer). The manual suggests to "copy and paste" the files to "[email protected]", but I am not quite sure how to do this. Clicking and dragging my files from the desktop to this location doesn't seem to be working.
Any help or clarification would be greatly appreciated.
Many thanks
Jack
Hi,
I an not import the PiRATE-VM ova from the site into my virtual box. The PiRATE-VM file does not work. Could someone fix this.
thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.