genetalks / gtz Goto Github PK
View Code? Open in Web Editor NEWA high performance and compression ratio compressor for genomic data, powered by GTXLab of Genetalks.
License: Other
A high performance and compression ratio compressor for genomic data, powered by GTXLab of Genetalks.
License: Other
gtz is a fantastic compress software, but i wonder when it can run on Arm architecture v8/v9 ?
解压一个常规gz文件时遇到这个错误,这个gz文件是正常的,就是大了些,25GB,希望排查下
$ ls -sh /resdata/SCT_DATA/data_deliver/JY_combined_R2.fastq.gz
25G /resdata/SCT_DATA/data_deliver/JY_combined_R2.fastq.gz
$ zcat /resdata/SCT_DATA/data_deliver/JY_combined_R2.fastq.gz | head
@E00552:208:HNJFYCCXY:8:1101:2351:1379 2:N:0:NGAGGCTG
NATCAGAATGAGCTGGTGGGAACCTTGGGCAGCCAAACGGAGCGGCGTTCTGCACCATGTCCCATCCAGTGCTGCGAATCCACGCCCCGCAGCCCTGCCCCCCCGCGACAGCTCACACCATGGCTCGAGGACAAGGTGTTATCCCGACAC
+
#---<A-FJF-F-7--7-7-7-7F<F-<J---7AAAF-7---A-F--7-7-7-<-7A-7AJJF-AA<7-7-7-A--7---7<A---7--7A<7-7---7-7F7-)-)7)7-A<)--AF----))-)-))))--<-----7---7-)))))
@E00552:208:HNJFYCCXY:8:1101:2392:1379 2:N:0:NGAGGCTG
NCTCATCCCAGCAGGCCCTCCCTTAGCTGAGGGAATTCTTTTTCCCCTCCCTCCACCGACAAATATTGACAGGCACCCACCGAGGATGTGCAGAGCTCAGCCGCGGCTGCGGGGACTCAATTTGCAACAGACATGGACTCCCCCCTCACG
+
#<-A-----7----7----7F---<-7-<--77-<--A<--7-<7<A--A7-7A7AJ<----7---7<--<FF-77-7-<J-A-77-7FAA----7-7A--<-)-)))-))))))-)7---------7------7--7-)))))))-)))
@E00552:208:HNJFYCCXY:8:1101:2514:1379 2:N:0:NGAGGCTG
NAAGAATTTCAAAGCCTTCGCTAGTCTCCGTATGGCCCGTGCCAACGCCCGGCTCTTCGGCATACGGGCAAAAAGAGCCAAGGAAGCCGCAGAACAGGATGTTGAAAAGAAAAAATAAAGCCCTCCTGGGGACTTGGAATCAAAAAAAAA
if i have many fq.gz in a file , i hope use this command
gtz *.gz --ref /data.genome.fa
,of course i can write a shell For loops,also can achieve this requirement.
ref文件:genome.fa (hg19)
下载地址:https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
hisat2运行的ref: hg19 genome_tran
下载地址:https://ccb.jhu.edu/software/hisat2/manual.shtml
已经通过默认的hisat2流程生成bam文件(没有用hisat2-gtz),gtz打包时报错如下:
prepare compression... 100%, cost 23s (18|5)
Enabling high-rate compression mode with /home/.config/gtz/genome.fa-D2A70550489DE356A2CD6BFC40711204.bam.rbin2( hardware speedup )
[ ] 0%
RNAME "1" was not found in reference file: /home/Erythropoiesis_APA/HSC_APA/hsc_apa1/04-salmon/00-file/genome.fa
error: the reference file was detected to not match this BAM file!
Hi
I have tried 'gtz file.gz -c' and I got output as file (file.gz.gtz). Indeed, the gtz program should print in stdout. Would you like to investigate this issue?
I used gtz version: PROFESSIONAL-3.0.0-V-2020-12-09 01:58:41.
Best regards,
Piroon
http://gtz.io/gtz_public_0.2.2k_centos_release.tgz
http://gtz.io/gtz_public_0.2.2k_ubuntu_release.tgz
请问以上两个版本的 gtz 可以用于 Hiseq2500、Hiseq3000、Xten 生产环境的数据吗?
您好,我想咨询下gtz与其生态软件匹配的问题。
我的猜测:
在如 bwa、tophat 这些软件的源码中,修改参数,使之当解析原始数据为gtz格式时,先对原始数据进行解压。解压后,数据成为原原软件可以接受的fastq(gz)格式。
还是有其他部分的修改?
谢谢~
At the same time, open 2 terminals to access the same server. If one of them runs' GTZ ', the other cannot run.
ERROR info -bash: gtz: command not found
For example, if I was compressing a file A.fq.gz to A.fq.gz.gtz. Unfortunately, there was a problem so that the work was not finished in normal condition. Then the A.fq.gz.gtz was still created and we can't judge if it is full from the file size. Is there any tools to let me know if the compressed file is full?
假如我用gtz压缩一个文件,快结束时候断掉了,但是我不知道,而.gtz的文件也已经生成了,虽然还没结束,但是大小也接近正常大小了。那我如何检测这个gtz文件是否完整呢?
I'm testing gtz. It seems that gtz uses all the cores of my computer. I think it would be best if one can specify the number of threads used by gtz.
According to README file in gtz, STAR should be working with gtz compressed files, but it actually not.
We have downloaded last versions of gtz, and compressed a pair of RNA-seq as follows,
$ gtz RNAseq_R1.fastq.gz -o RNAseq_R1.fastq.gz.gtz
$ gtz RNAseq_R2.fastq.gz -o RNAseq_R2.fastq.gz.gtz
without reference.
Then we tried to use STAR directly with command
$ STAR --genomeDir /path/to/star_index --readFilesIn RNAseq_R1.fastq.gz.gtz RNAseq_R2.fastq.gz.gtz --readFilesCommand gtz -r
But it was not working.
So, we would like to know which version of STAR is required for gtz compressed files.
-- Jang-il Sohn
Running GTZ installer "https://gtz.io/gtz_latest.run" modifies user's .bashrc file. This intrusion into user's configuration happens silently, without asking, warning or notifying the user.
Few reasons why it's a bad idea:
Fortunately && source ~/.bashrc
in installation instructions gives it away.
This happens as of GTZ version "PROFESSIONAL-2.1.3-V-2020-03-18 07:11:20"
Installed GTX.Zip Professional stops working after 6 months (not sure if this is exact, or approximate). Expired GTZ shows message:
Powered by GTXLab of Genetalks. (built in PROFESSIONAL-2.1.2-V-2019-11-13 01:02:13 )
Warning:Invalid certificate!
Warning:The expiration date is:20200511
Warning:Please update the program from https://github.com/Genetalks/gtz .If you would like to use an unrestricted version, please contact [email protected] .
Upon seeing this message, the user is expected to download and install the latest version. (It's also mentioned in the license).
Why is this a problem for any serious use of GTX.Zip Professional?
Therefore currently GTX.Zip Professional is not suitable for data analysis pipelines (i.e., in the industry).
Therefore GTX.Zip Professional is useless for reproducible science.
I don't know what benefits expiration brings to GTZ developers to make it worth rendering it useless for both science and industry. Perhaps it is intended to motivate the user to purchase a commercial license. This by itself is OK, the problem is that the current README.md never mentions the expiration.
To avoid misleading the users, freely downloadable GTX.Zip Professional must be clearly marked as "trial" and "6 month expiration" should be prominently mentioned in the README.md.
Here are somethings I noticed
gtz
make full use of multi cpu cores when do compress/decompress--gz
so , just as the title, is there a plan to develop a somatic function to decompress gz file to fastq with multi cores in gtz
?
使用专业版压缩提供的sample.fq文件,在使用fasta完成压缩后,压缩完的文件大小是原文件的10%左右,并没有达到介绍中的2%,请问需要如何操作才能达到最佳的压缩率?
$ gtz sample_bak.fq --ref GCF_000001405.37_GRCh38.p11_genomic.fna.gz
......
$ ls -l
-rw-r--r-- 1 charles charles 2183810290 6月 28 20:25 sample_bak.fq
-rw-r--r-- 1 charles charles 233007202 6月 29 10:35 sample_bak.fq.gtz
I tried following code, but it didn't work
zcat sample_*_R1_*.fastq.gz | ./gtz -o sample_R1.fastq.gtz
output:
Powered by GTXLab of Genetalks. (built in PUBLIC-1.0-V-2018-08-06 02:48:41 )
Compressor initializing ...
nothing to compress.
版本:PROFESSIONAL-3.0.0-V-2020-11-30 03:40:16
数据:70:/nfs/test-bug-lib/gtz/G_3_201201_0001/ERR1993376.1.fastq.gz
ref:GCA_900618405.1_N_brasiliensis_RM07_v1_5_4_0011_upd_genomic.fna.gz
报错:段错误
aws北京区机器型号:c5.large
旧版压缩数据:ERR1288730.fastq.gz.gtz
解压失败日志:
ubuntu@ip-10-0-2-199:~/data/user1/data/data$ gtz -d ERR1288730.fastq.gz.gtz -z
Powered by GTXLab of Genetalks. (built in PROFESSIONAL-2.1.4-V-2020-06-30 01:41:41 )
Decompressor initializing ...
(1/1) ERR1288730.fastq.gz decompressing ...
Segmentation fault (core dumped)
1、左侧导航栏种“操作手册”指向路径错误,应指向:https://gtz.io/#/compress
2、操作手册易用性太低,需简化操作步骤
3、操作手册内的“1.8不要打包fasta文件 ”,过于口语化
I'm trying to compress a fastq file by gtz. However I obtained 3 totally different outcome by just modify the output
The first one was a success!
cat DMS_273.2_1.fastq | gtz -o ./G20481.DMS_273.2_1.fastq.gtz
Powered by GTXLab of Genetalks.
Compressor initializing ...
compressing ...
id: 375442529 / 3183320019
base: 819648004 / 8518710584
quality: 2158646143 / 8518710584
() source/compressed : 20468858971/3353746175. ratio : 16.385%
The cost time of compressing () is 00:06:11 (hh::mm:ss)
Compress finished, the total cost time is 00:06:11 (hh:mm:ss)
real 6m14.742s
user 134m18.192s
sys 1m8.160s
####################
This one caused an immediate error
cat DMS_273.2_1.fastq | gtz -o G20481.DMS_273.2_1.fastq.gtz
Powered by GTXLab of Genetalks.
Compressor initializing ...
gtz: line 8: 47524 Segmentation fault (core dumped)
real 0m0.198s
user 0m0.004s
sys 0m0.000s
#####################
This one was aborted
cat DMS_273.2_1.fastq | gtz -c > G20481.DMS_273.2_1.fastq.gtz
Powered by GTXLab of Genetalks.
Compressor initializing ...
compressing ...
id: 375442529 / 3183320019
base: 819648212 / 8518710584
quality: 2158646143 / 8518710584
() source/compressed : 20468858971/3353746396. ratio : 16.385%
The cost time of compressing () is 00:07:01 (hh::mm:ss)
Compress finished, the total cost time is 00:07:02 (hh:mm:ss)
terminate called without an active exception
gtz: line 8: 23992 Aborted (core dumped)
real 7m6.006s
user 136m5.360s
sys 1m6.200s
#############
This is the status by pigz:
cat DMS_273.2_1.fastq | pigz -p 4 -c > G20481.DMS_273.2_1.fastq.gz
real 6m4.838s
user 23m46.496s
sys 0m29.420s
#############
This is the final file:
3.2G Apr 4 19:55 G20481.DMS_273.2_1.fastq.gtz
5.1G Apr 4 19:36 G20481.DMS_273.2_1.fastq.gz
The compression ratio is very good.
hi
I'm testing the gtz program.
During running the program, a failure occurred due to a brken pipe.
bwa-gtz mem -t 12 -M -R "@rg\tPL:Illumina\tID:A00552\tSM:NA12878\tLB:NA12878" genome.fa NA12878_R1.fastq.gtz NA12878_R2.fastq.gtz | samtools view -bSu - | samtools sort -@ 22 -m 8G - /gtz_test/NA12878_gtz_for_bwa_use_ref_Thead12.sorted
The fault log is as follows.
M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (22, 336272, 211, 7)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (177, 235, 305)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 561)
[M::mem_pestat] mean and std.dev: (246.27, 108.03)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 689)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (335, 386, 447)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (111, 671)
[M::mem_pestat] mean and std.dev: (392.79, 84.98)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 783)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (18, 37, 70)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 174)
[M::mem_pestat] mean and std.dev: (45.20, 36.32)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 226)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_process_seqs] Processed 794702 reads in 386.185 CPU sec, 31.998 real sec
qsub_script_noSGE.sh: line 7: 46647 Done(141) bwa-gtz mem -t 12 -M -R "@rg\tPL:Illumina\tID:A00552\tSM:NA12878\tLB:NA12878" genome.fa NA12878_R1.fastq.gtz NA12878_R2.fastq.gtz
46648 Broken pipe | samtools view -bSu -
46649 Killed | samtools sort -@ 22 -m 8G - /gtz_test/NA12878_gtz_for_bwa_use_ref_nosge_Thead1.sorted
If you know how to solve it, I would appreciate it if you let me know.
同一个物种不同版本的基因组如小鼠的mm9,mm10,需要建立相应的index压缩和解压相应的文件吗?不同版本的基因组fasta文件不一样,怎么在不同版本之间转换?
Hi,
I tried to compress a bam file with the following code:
~/software/gtz/GTX.Zip/gtz CO43.bam
But I encountered this issue:
Edition
License expiration time: 2024-06-02 19:45:43
Start compression: 1 of 1
FileName: CO43.bam, CompressType: bam, Threads: 96, Verify: No
[ERROR] Catch signal 11, clear and exit
[ERROR] Exit with exception, remove temp file
你好,目前在使用gtz压缩原始数据,在压缩小麦的数据,压缩完验证文件的时候,会偶尔出现(目前只在v1.2.2发现,出现2次了)停在验证这一步,top看进程发现gtz一直在运行,但是验证这一步超过1700min...只能手动kill掉进程。
目前还不能确定是什么问题,因为重新压缩一次的时候又通过了,这里只提交碰到的问题吧
gtz get stuck , retried several times? Am I doing something wrong?
Here are the details:
I ran the following command
$ nohup /home/XX/.config/GTZ/gtz possorted_genome_bam.bam --ref ../ref/cellranger_custom_hg38_ref_with_full_car_and_5utr.tar.gz &
It got stuck after increasing precents with this messages (taking from the nohup.out file)
first time use this Fasta, need convert it to binary... 72%
I have tried to delete the output .gtz file and start again several times - but it still stuck
This is how the working directory looks like
XX@HH:/mnt/disks/sdb/bamfile$ ls -l
total 273824532
-rw------- 1 XX XX 1840055 Jul 15 19:54 nohup.out
-rw-r--r-- 1 XX XX 280394439771 May 19 00:20 possorted_genome_bam.bam
-rw-rw-r-- 1 XX XX 108 Jul 15 19:45 possorted_genome_bam.bam.gtz
XX@HH:/mnt/disks/sdb/bamfile$ ls -l ../ref
total 11120380
-rw-rw-r-- 1 XX XX 11387261138 Jul 15 07:50 cellranger_custom_hg38_ref_with_full_car_and_5utr.tar.gz
I have a ppc64le based cpu. Does the source file or compiled executable available in ppc64le exist?
不小心将bam和fq文件一起压缩后,无法恢复查看和使用bam文件了,有什么办法解决吗?
Hi,
Always got the following errors when trying to run on *.fq.gtz files:
"ERROR:Error: test.1.fq.gtz format error!(magic num error)
Started analysis of test.1.fq.gtz
Please waiting...
Analysis complete for test.1.fq.gtz
Failed to process file test.1.fq.gtz
java.lang.ArrayIndexOutOfBoundsException: -1
at uk.ac.babraham.FastQC.Modules.SequenceLengthDistribution.calculateDistribution(SequenceLengthDistribution.java:101)
at uk.ac.babraham.FastQC.Modules.SequenceLengthDistribution.raisesError(SequenceLengthDistribution.java:190)
at uk.ac.babraham.FastQC.Report.HTMLReportArchive.startDocument(HTMLReportArchive.java:336)
at uk.ac.babraham.FastQC.Report.HTMLReportArchive.(HTMLReportArchive.java:84)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:178)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110)
at java.lang.Thread.run(Thread.java:748)
The gtz is able to decompress gtz files.
Any solutions? Thank you.
Jim
Same scenario with issues #20 , the certificate expired for V3.0.1, while newer version unavailable...
There exist very serious errors in this tool, and it might not be released now. Because, I tried to compress my datasets using this tools, and deleted raw files. But, this tool can not decompress its outputs. Fortunately, I only used public datasets now. It can not imagine what could be done if it was applied in your sequencing datasets without any other backups.
Here is the accession numbers of used libraries:
ERR1276823
ERR1276808
ERR1276794
ERR1276796
ERR1276778
ERR1276771
ERR1276761
ERR1276759
ERR1276757
ERR1276743
ERR1276741
And, almost all of them have zero or 250Mb outputs.
Hope you could fix this problems.
Thanks for all your efforts of developing this tool.
Hello, I'm interested in creating a Bioconda recipe for GTZ, but I can't quite tell if the license permits this.
I noticed the license is very similar to a BSD-2 license, just without the permission to redistribute with modification. Would creating a Bioconda recipe which takes the gtz binary and throws away the lib
directory that comes packaged with it, and instead installs python 2.7 through Conda be considered modification and thus not be permitted?
之前用gtz压缩过20T的猪的数据,压缩率为16%(gz为28%),想体验下现在的高倍压缩模式,不知道贵团队方便添加下猪的index么?谢谢!
猪的基因组下载地址如下:
ftp://ftp.ensembl.org/pub/release-94/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.