genetalks / gtz Goto Github PK

View Code? Open in Web Editor NEW

170.0 17.0 39.0 1.57 GB

A high performance and compression ratio compressor for genomic data, powered by GTXLab of Genetalks.

License: Other

compression-rate genomic-data-compression fastq-compression gtx pacbio nanopore compress compression gene fastq

gtz's People

Contributors

Stargazers

Watchers

gtz's Issues

When the program can run on Arm architecture v8 ?

gtz is a fantastic compress software, but i wonder when it can run on Arm architecture v8/v9 ?

fatal error:invalid gz file,exit

解压一个常规gz文件时遇到这个错误，这个gz文件是正常的，就是大了些，25GB，希望排查下

$ ls -sh /resdata/SCT_DATA/data_deliver/JY_combined_R2.fastq.gz
25G /resdata/SCT_DATA/data_deliver/JY_combined_R2.fastq.gz
$ zcat /resdata/SCT_DATA/data_deliver/JY_combined_R2.fastq.gz | head
@E00552:208:HNJFYCCXY:8:1101:2351:1379 2:N:0:NGAGGCTG
NATCAGAATGAGCTGGTGGGAACCTTGGGCAGCCAAACGGAGCGGCGTTCTGCACCATGTCCCATCCAGTGCTGCGAATCCACGCCCCGCAGCCCTGCCCCCCCGCGACAGCTCACACCATGGCTCGAGGACAAGGTGTTATCCCGACAC
+
#---<A-FJF-F-7--7-7-7-7F<F-<J---7AAAF-7---A-F--7-7-7-<-7A-7AJJF-AA<7-7-7-A--7---7<A---7--7A<7-7---7-7F7-)-)7)7-A<)--AF----))-)-))))--<-----7---7-)))))
@E00552:208:HNJFYCCXY:8:1101:2392:1379 2:N:0:NGAGGCTG
NCTCATCCCAGCAGGCCCTCCCTTAGCTGAGGGAATTCTTTTTCCCCTCCCTCCACCGACAAATATTGACAGGCACCCACCGAGGATGTGCAGAGCTCAGCCGCGGCTGCGGGGACTCAATTTGCAACAGACATGGACTCCCCCCTCACG
+
#<-A-----7----7----7F---<-7-<--77-<--A<--7-<7<A--A7-7A7AJ<----7---7<--<FF-77-7-<J-A-77-7FAA----7-7A--<-)-)))-))))))-)7---------7------7--7-)))))))-)))
@E00552:208:HNJFYCCXY:8:1101:2514:1379 2:N:0:NGAGGCTG
NAAGAATTTCAAAGCCTTCGCTAGTCTCCGTATGGCCCGTGCCAACGCCCGGCTCTTCGGCATACGGGCAAAAAGAGCCAAGGAAGCCGCAGAACAGGATGTTGAAAAGAAAAAATAAAGCCCTCCTGGGGACTTGGAATCAAAAAAAAA

can you add a new feature

if i have many fq.gz in a file , i hope use this command
gtz *.gz --ref /data.genome.fa
,of course i can write a shell For loops,also can achieve this requirement.

hisat2默认流程下处理的bam文件不能用gtz压缩

ref文件：genome.fa (hg19)
下载地址：https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

hisat2运行的ref: hg19 genome_tran
下载地址：https://ccb.jhu.edu/software/hisat2/manual.shtml

已经通过默认的hisat2流程生成bam文件（没有用hisat2-gtz），gtz打包时报错如下：


prepare compression... 100%, cost 23s (18|5)
Enabling high-rate compression mode with /home/.config/gtz/genome.fa-D2A70550489DE356A2CD6BFC40711204.bam.rbin2( hardware speedup )
[                                                  ] 0%

RNAME "1" was not found in reference file: /home/Erythropoiesis_APA/HSC_APA/hsc_apa1/04-salmon/00-file/genome.fa
error: the reference file was detected to not match this BAM file!

Hi
I have tried 'gtz file.gz -c' and I got output as file (file.gz.gtz). Indeed, the gtz program should print in stdout. Would you like to investigate this issue?
I used gtz version: PROFESSIONAL-3.0.0-V-2020-12-09 01:58:41.

Best regards,
Piroon

gtz_public_0.2.2k 能用于生产环境吗？

http://gtz.io/gtz_public_0.2.2k_centos_release.tgz
http://gtz.io/gtz_public_0.2.2k_ubuntu_release.tgz

请问以上两个版本的 gtz 可以用于 Hiseq2500、Hiseq3000、Xten 生产环境的数据吗？

生态软件技术提问

您好，我想咨询下gtz与其生态软件匹配的问题。

我的猜测：
在如 bwa、tophat 这些软件的源码中，修改参数，使之当解析原始数据为gtz格式时，先对原始数据进行解压。解压后，数据成为原原软件可以接受的fastq（gz）格式。
还是有其他部分的修改？
谢谢~

Emergency ERROR needs to be fixed

At the same time, open 2 terminals to access the same server. If one of them runs' GTZ ', the other cannot run.
ERROR info -bash: gtz: command not found

How to estimate the compressed file is unbroken?

For example, if I was compressing a file A.fq.gz to A.fq.gz.gtz. Unfortunately, there was a problem so that the work was not finished in normal condition. Then the A.fq.gz.gtz was still created and we can't judge if it is full from the file size. Is there any tools to let me know if the compressed file is full?

假如我用gtz压缩一个文件，快结束时候断掉了，但是我不知道，而.gtz的文件也已经生成了，虽然还没结束，但是大小也接近正常大小了。那我如何检测这个gtz文件是否完整呢？

gtz uses all the available cores

I'm testing gtz. It seems that gtz uses all the cores of my computer. I think it would be best if one can specify the number of threads used by gtz.

STAR with gtz is not working

According to README file in gtz, STAR should be working with gtz compressed files, but it actually not.

We have downloaded last versions of gtz, and compressed a pair of RNA-seq as follows,
$ gtz RNAseq_R1.fastq.gz -o RNAseq_R1.fastq.gz.gtz
$ gtz RNAseq_R2.fastq.gz -o RNAseq_R2.fastq.gz.gtz
without reference.

Then we tried to use STAR directly with command
$ STAR --genomeDir /path/to/star_index --readFilesIn RNAseq_R1.fastq.gz.gtz RNAseq_R2.fastq.gz.gtz --readFilesCommand gtz -r
But it was not working.
So, we would like to know which version of STAR is required for gtz compressed files.

-- Jang-il Sohn

Installer modifies user's .bashrc without asking or warning the user

Running GTZ installer "https://gtz.io/gtz_latest.run" modifies user's .bashrc file. This intrusion into user's configuration happens silently, without asking, warning or notifying the user.

Few reasons why it's a bad idea:

It's extremely rude to silently mess with user's configuration files.
User may prefer other shell such as zsh (please don't get any ideas).
User may prefer to run gtz using full path, or configure their own aliases or symlinks.
User may already have another command with the same name.

Fortunately && source ~/.bashrc in installation instructions gives it away.

This happens as of GTZ version "PROFESSIONAL-2.1.3-V-2020-03-18 07:11:20"

GTX.Zip Professional stops working after 6 months

Installed GTX.Zip Professional stops working after 6 months (not sure if this is exact, or approximate). Expired GTZ shows message:

Powered by GTXLab of Genetalks. (built in PROFESSIONAL-2.1.2-V-2019-11-13 01:02:13 )
Warning:Invalid certificate!
Warning:The expiration date is:20200511
Warning:Please update the program from https://github.com/Genetalks/gtz .If you would like to use an unrestricted version, please contact [email protected] .

Upon seeing this message, the user is expected to download and install the latest version. (It's also mentioned in the license).

Why is this a problem for any serious use of GTX.Zip Professional?

Someone may use GTX.Zip Professional as part of data analysis system. Such system (possibly consisting of dozens of software tools) is tested and deployed in production environment. It's easy to miss the expiration message in EULA when constructing such system. Then in a few months the system suddenly stops working. By this time the people who designed the system may be unavailable, and the cost of investigating, fixing, and downtime may be high.

Therefore currently GTX.Zip Professional is not suitable for data analysis pipelines (i.e., in the industry).

Many journals require reproducible data analysis protocols when publishing results. A data analysis protocol must include versions of all software used. However, the exact version of GTZ used in the protocol will be unavailable for download by the time the paper is out. Even if a reader has the same version, it will expire and stop working by the time the paper is out.

Therefore GTX.Zip Professional is useless for reproducible science.

I don't know what benefits expiration brings to GTZ developers to make it worth rendering it useless for both science and industry. Perhaps it is intended to motivate the user to purchase a commercial license. This by itself is OK, the problem is that the current README.md never mentions the expiration.

To avoid misleading the users, freely downloadable GTX.Zip Professional must be clearly marked as "trial" and "6 month expiration" should be prominently mentioned in the README.md.

Suggest a parameter to directly decompress gz file to fastq.

Here are somethings I noticed

gtz make full use of multi cpu cores when do compress/decompress
here is a param --gz

so , just as the title, is there a plan to develop a somatic function to decompress gz file to fastq with multi cores in gtz ?

有关压缩率的问题

使用专业版压缩提供的sample.fq文件，在使用fasta完成压缩后，压缩完的文件大小是原文件的10%左右，并没有达到介绍中的2%，请问需要如何操作才能达到最佳的压缩率？

$ gtz sample_bak.fq --ref GCF_000001405.37_GRCh38.p11_genomic.fna.gz
......
$ ls -l
-rw-r--r-- 1 charles charles 2183810290  6月 28 20:25 sample_bak.fq
-rw-r--r-- 1 charles charles  233007202  6月 29 10:35 sample_bak.fq.gtz

how to use standard input?

I tried following code, but it didn't work

zcat sample_*_R1_*.fastq.gz | ./gtz -o sample_R1.fastq.gtz

output:

Powered by GTXLab of Genetalks. (built in PUBLIC-1.0-V-2018-08-06 02:48:41 )
Compressor initializing ... 
nothing to compress.

NCBI上ERR1993376.1的一组Rat第三代数据走高倍压缩失败

版本：PROFESSIONAL-3.0.0-V-2020-11-30 03:40:16
数据：70：/nfs/test-bug-lib/gtz/G_3_201201_0001/ERR1993376.1.fastq.gz
ref：GCA_900618405.1_N_brasiliensis_RM07_v1_5_4_0011_upd_genomic.fna.gz
报错：段错误

gtz需下rbin的旧版压缩数据在aws北京区的机器上解压失败，报错Segmentation fault (core dumped)

aws北京区机器型号：c5.large
旧版压缩数据：ERR1288730.fastq.gz.gtz
解压失败日志：

ubuntu@ip-10-0-2-199:~/data/user1/data/data$ gtz -d ERR1288730.fastq.gz.gtz -z
Powered by GTXLab of Genetalks. (built in PROFESSIONAL-2.1.4-V-2020-06-30 01:41:41 )
Decompressor initializing ... 
(1/1) ERR1288730.fastq.gz decompressing ...
Segmentation fault (core dumped)

fastqc-gtz couldn't be installed without root permission

I tried to install fastqc-gtz on our school's cluster, since I don't have the root permission, I need to install it in my home dir, but it got the error.

二级：GTZ.IO网页优化

1、左侧导航栏种“操作手册”指向路径错误，应指向：https://gtz.io/#/compress
2、操作手册易用性太低，需简化操作步骤
3、操作手册内的“1.8不要打包fasta文件 ”，过于口语化

weird gtz behavior

I'm trying to compress a fastq file by gtz. However I obtained 3 totally different outcome by just modify the output

The first one was a success!
cat DMS_273.2_1.fastq | gtz -o ./G20481.DMS_273.2_1.fastq.gtz
Powered by GTXLab of Genetalks.
Compressor initializing ...
compressing ...
id: 375442529 / 3183320019
base: 819648004 / 8518710584
quality: 2158646143 / 8518710584
() source/compressed : 20468858971/3353746175. ratio : 16.385%
The cost time of compressing () is 00:06:11 (hh::mm:ss)

Compress finished, the total cost time is 00:06:11 (hh:mm:ss)

real 6m14.742s
user 134m18.192s
sys 1m8.160s
####################
This one caused an immediate error
cat DMS_273.2_1.fastq | gtz -o G20481.DMS_273.2_1.fastq.gtz
Powered by GTXLab of Genetalks.
Compressor initializing ...
gtz: line 8: 47524 Segmentation fault (core dumped) $basepath/_gtz $@

real 0m0.198s
user 0m0.004s
sys 0m0.000s
#####################
This one was aborted
cat DMS_273.2_1.fastq | gtz -c > G20481.DMS_273.2_1.fastq.gtz
Powered by GTXLab of Genetalks.
Compressor initializing ...
compressing ...
id: 375442529 / 3183320019
base: 819648212 / 8518710584
quality: 2158646143 / 8518710584
() source/compressed : 20468858971/3353746396. ratio : 16.385%
The cost time of compressing () is 00:07:01 (hh::mm:ss)

Compress finished, the total cost time is 00:07:02 (hh:mm:ss)
terminate called without an active exception
gtz: line 8: 23992 Aborted (core dumped) $basepath/_gtz $@

real 7m6.006s
user 136m5.360s
sys 1m6.200s

#############
This is the status by pigz:
cat DMS_273.2_1.fastq | pigz -p 4 -c > G20481.DMS_273.2_1.fastq.gz

real 6m4.838s
user 23m46.496s
sys 0m29.420s

#############
This is the final file:
3.2G Apr 4 19:55 G20481.DMS_273.2_1.fastq.gtz
5.1G Apr 4 19:36 G20481.DMS_273.2_1.fastq.gz

The compression ratio is very good.

broken pipe issue

hi
I'm testing the gtz program.

During running the program, a failure occurred due to a brken pipe.

bwa-gtz mem -t 12 -M -R "@rg\tPL:Illumina\tID:A00552\tSM:NA12878\tLB:NA12878" genome.fa NA12878_R1.fastq.gtz NA12878_R2.fastq.gtz | samtools view -bSu - | samtools sort -@ 22 -m 8G - /gtz_test/NA12878_gtz_for_bwa_use_ref_Thead12.sorted

The fault log is as follows.

M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (22, 336272, 211, 7)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (177, 235, 305)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 561)
[M::mem_pestat] mean and std.dev: (246.27, 108.03)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 689)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (335, 386, 447)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (111, 671)
[M::mem_pestat] mean and std.dev: (392.79, 84.98)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 783)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (18, 37, 70)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 174)
[M::mem_pestat] mean and std.dev: (45.20, 36.32)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 226)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation FF
[M::mem_pestat] skip orientation RF
[M::mem_process_seqs] Processed 794702 reads in 386.185 CPU sec, 31.998 real sec
qsub_script_noSGE.sh: line 7: 46647 Done(141) bwa-gtz mem -t 12 -M -R "@rg\tPL:Illumina\tID:A00552\tSM:NA12878\tLB:NA12878" genome.fa NA12878_R1.fastq.gtz NA12878_R2.fastq.gtz
46648 Broken pipe | samtools view -bSu -
46649 Killed | samtools sort -@ 22 -m 8G - /gtz_test/NA12878_gtz_for_bwa_use_ref_nosge_Thead1.sorted

If you know how to solve it, I would appreciate it if you let me know.

同一个物种不同版本的基因组，需要建立相应的index吗？

同一个物种不同版本的基因组如小鼠的mm9，mm10，需要建立相应的index压缩和解压相应的文件吗？不同版本的基因组fasta文件不一样，怎么在不同版本之间转换？

Compress bam file

Hi,

I tried to compress a bam file with the following code:

~/software/gtz/GTX.Zip/gtz CO43.bam

But I encountered this issue:

Powered by GTXLab of Genetalks. (built in STANDARD-v4.0.3 build time: 14:49:47 Sep 13 2022 license: v1.1.1 build 09132022.021212)

Edition
License expiration time: 2024-06-02 19:45:43

Compression capacity limit:
Total: 1 TB
Used: 1 GB 577.39 MB
Remaining: 1022 GB 446.61 MB

Start compression: 1 of 1
FileName: CO43.bam, CompressType: bam, Threads: 96, Verify: No

[ERROR] Catch signal 11, clear and exit
[ERROR] Exit with exception, remove temp file

gtz v1.2.2 bin模式压缩验证出现假死

你好，目前在使用gtz压缩原始数据，在压缩小麦的数据，压缩完验证文件的时候，会偶尔出现（目前只在v1.2.2发现，出现2次了）停在验证这一步，top看进程发现gtz一直在运行，但是验证这一步超过1700min...只能手动kill掉进程。
目前还不能确定是什么问题，因为重新压缩一次的时候又通过了，这里只提交碰到的问题吧

GTZ stuck on .bam file after 72%

gtz get stuck , retried several times? Am I doing something wrong?
Here are the details:

I ran the following command
$ nohup /home/XX/.config/GTZ/gtz possorted_genome_bam.bam --ref ../ref/cellranger_custom_hg38_ref_with_full_car_and_5utr.tar.gz &

It got stuck after increasing precents with this messages (taking from the nohup.out file)

first time use this Fasta, need convert it to binary... 72%

I have tried to delete the output .gtz file and start again several times - but it still stuck
This is how the working directory looks like

XX@HH:/mnt/disks/sdb/bamfile$ ls -l
total 273824532
-rw------- 1 XX XX 1840055 Jul 15 19:54 nohup.out
-rw-r--r-- 1 XX XX 280394439771 May 19 00:20 possorted_genome_bam.bam
-rw-rw-r-- 1 XX XX 108 Jul 15 19:45 possorted_genome_bam.bam.gtz
XX@HH:/mnt/disks/sdb/bamfile$ ls -l ../ref
total 11120380
-rw-rw-r-- 1 XX XX 11387261138 Jul 15 07:50 cellranger_custom_hg38_ref_with_full_car_and_5utr.tar.gz

Can I use it in cpu based on powerPC?

I have a ppc64le based cpu. Does the source file or compiled executable available in ppc64le exist?

压缩bam文件后无法解压复原

不小心将bam和fq文件一起压缩后，无法恢复查看和使用bam文件了，有什么办法解决吗？

fastqc-gtz error

Hi,

Always got the following errors when trying to run on *.fq.gtz files:

"ERROR:Error: test.1.fq.gtz format error!(magic num error)
Started analysis of test.1.fq.gtz
Please waiting...
Analysis complete for test.1.fq.gtz
Failed to process file test.1.fq.gtz
java.lang.ArrayIndexOutOfBoundsException: -1
at uk.ac.babraham.FastQC.Modules.SequenceLengthDistribution.calculateDistribution(SequenceLengthDistribution.java:101)
at uk.ac.babraham.FastQC.Modules.SequenceLengthDistribution.raisesError(SequenceLengthDistribution.java:190)
at uk.ac.babraham.FastQC.Report.HTMLReportArchive.startDocument(HTMLReportArchive.java:336)
at uk.ac.babraham.FastQC.Report.HTMLReportArchive.(HTMLReportArchive.java:84)
at uk.ac.babraham.FastQC.Analysis.OfflineRunner.analysisComplete(OfflineRunner.java:178)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:110)
at java.lang.Thread.run(Thread.java:748)

The gtz is able to decompress gtz files.

Any solutions? Thank you.

Jim

Certificate Expires

Same scenario with issues #20 , the certificate expired for V3.0.1, while newer version unavailable...

GTZ failed during decompression process

There exist very serious errors in this tool, and it might not be released now. Because, I tried to compress my datasets using this tools, and deleted raw files. But, this tool can not decompress its outputs. Fortunately, I only used public datasets now. It can not imagine what could be done if it was applied in your sequencing datasets without any other backups.

Here is the accession numbers of used libraries:
ERR1276823
ERR1276808
ERR1276794
ERR1276796
ERR1276778
ERR1276771
ERR1276761
ERR1276759
ERR1276757
ERR1276743
ERR1276741
And, almost all of them have zero or 250Mb outputs.

Hope you could fix this problems.

Thanks for all your efforts of developing this tool.

Permitted to create Bioconda recipe?

Hello, I'm interested in creating a Bioconda recipe for GTZ, but I can't quite tell if the license permits this.

I noticed the license is very similar to a BSD-2 license, just without the permission to redistribute with modification. Would creating a Bioconda recipe which takes the gtz binary and throws away the lib directory that comes packaged with it, and instead installs python 2.7 through Conda be considered modification and thus not be permitted?

请问方便加入猪(Sus scrofa)的index么？

之前用gtz压缩过20T的猪的数据，压缩率为16%(gz为28%)，想体验下现在的高倍压缩模式，不知道贵团队方便添加下猪的index么？谢谢！
猪的基因组下载地址如下：
ftp://ftp.ensembl.org/pub/release-94/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz

genetalks / gtz Goto Github PK

gtz's People

Contributors

Stargazers

Watchers

Forkers

gtz's Issues

Powered by GTXLab of Genetalks. (built in STANDARD-v4.0.3 build time: 14:49:47 Sep 13 2022 license: v1.1.1 build 09132022.021212)

Compression capacity limit: Total: 1 TB Used: 1 GB 577.39 MB Remaining: 1022 GB 446.61 MB

Recommend Projects

Recommend Topics

Recommend Org

Compression capacity limit:
Total: 1 TB
Used: 1 GB 577.39 MB
Remaining: 1022 GB 446.61 MB