andreasheger / gat Goto Github PK
View Code? Open in Web Editor NEWGenomic Association Tester
Genomic Association Tester
Dear Andreas,
I have been happily using GAT (gat-run & gat-compare) over the past months - thanks a lot for it!
I recently updated all of my code so that I could keep a more informative string in the "track" column in my gat-run output. However, when I got to gat-compare I no longer got any results. Inspecting the code it seems that the things around lines 264-270 of gat-compare, the code assumes that the "track" column only contains the string "merged". This would be the case when using --ignore-segment-tracks as I was doing previously, but now with my more "informative" track strings, this blocks things.
Would it be possible to add to the code so that - similar to what is done in the in-file-comparison part - the list of track values is established and iterated over, to remove this hard-coded limitation?
Thank you in advance for your help!
Best regards,
-- Alex
Dear Andreas,
I've come across a minor issue while using GAT over the past months.
For some reason if the ID field (4th column) of the Isochores BED file is set to ".", GAT hangs with the following Python error:
Traceback (most recent call last):
File "/software/UHTS/Analysis/gat/1.2/bin/gat-run.py", line 335, in <module>
sys.exit(main(sys.argv))
File "/software/UHTS/Analysis/gat/1.2/bin/gat-run.py", line 313, in main
annotator_results = fromSegments(options, args)
File "/software/UHTS/Analysis/gat/1.2/bin/gat-run.py", line 242, in fromSegments
num_threads=options.num_threads)
File "/software/UHTS/Analysis/gat/1.2/lib64/python2.7/site-packages/gat/__init__.py", line 915, in run
track, counts, counters, segs, annotations, workspace, outfiles)
File "/software/UHTS/Analysis/gat/1.2/lib64/python2.7/site-packages/gat/__init__.py", line 622, in sample
contig_annotations.fromIsochores()
File "GatEngine.pyx", line 2979, in GatEngine.IntervalCollection.fromIsochores (GatEngine/GatEngine.c:36284)
File "GatEngine.pyx", line 2767, in GatEngine.IntervalDictionary.fromIsochores (GatEngine/GatEngine.c:32377)
ValueError: too many values to unpack (expected 2)
It runs fine if the "." is set to some other constant (e.g. "isoc" for isochore).
This is not the case for the workspace BED file: "ws" or "." both work just fine.
It actually took me a while to find the source of this problem, so I just thought I'd point it our here!
Best regards,
-- Alex
Hello,
I am having issues running gat-compare.py on my regions. The output log of gat-compare says that there are no shared annotations between my count files, even though I definitely used the same annotation bed file to generate both count files. I have attached the count files below. I have also attached the output log file.
This is the code I used:
gat-compare.py anchorGAT_1_nucleotide-overlap.counts.tsv.gz anchorGAT_2_nucleotide-overlap.counts.tsv.gz
This is how I generated the count files:
gat-run.py --num-threads=4 --num-samples=1000 -s /home/qiadu/GATdata/RT_TADs/anchors_GAT/anchorset_forGAT_1.bed -a /home/qiadu/GATdata/RT_TADs/anchors_GAT/con.early.bed -w /home/qiadu/GATdata/RT_TADs/merged.genome.bed --log=GAT_1.log --output-counts-pattern=anchorGAT_1_%s.counts.tsv.gz
gat-run.py --num-threads=4 --num-samples=1000 -s /home/qiadu/GATdata/RT_TADs/anchors_GAT/anchorset_forGAT_2.bed -a /home/qiadu/GATdata/RT_TADs/anchors_GAT/con.early.bed -w /home/qiadu/GATdata/RT_TADs/merged.genome.bed --log=GAT_2.log --output-counts-pattern=anchorGAT_2_%s.counts.tsv.gz
anchorGAT_1_nucleotide-overlap.counts.tsv.gz
anchorGAT_2_nucleotide-overlap.counts.tsv.gz
gat-compare_anchor_output.txt
Any help would be appreciated. I might just be missing something obvious. Thank you.
Cheers,
Qian
Hi,
Running GAT with just regions that are not assembly gaps finishes in a few seconds without large memory requirements, however passing uniquely mappable regions as an additional workspace causes the program to crash due to insufficient memory. I have increased the memory passed to GAT to tens of GB and it still exits due to memory requirements. (I run the following command on a cluster allowing maximum of 100GB memory and reserving at least 40GB)
gat-run.py --segments=M2_pooled_filtered_peaks.bed
--annotations=annotations.bed
--workspace=contigs_ungapped.bed
--workspace=MM10-mappability_50.bed
--output-counts-pattern=M2_pooled_filtered_%s.overlap.counts.tsv.gz
--num-threads 8 --num-samples=10000 --log=gat.log --pvalue-method=norm > M2_pooled_filtered_gat.tsv
I am not entirely convinced that I correctly retrieve the uniquely mappable regions so it would be great if you can summarise how you derived that from the UCSC mappability tracks. I am working with mm10, for which UCSC doesn't have mappability tracks so I am using gem-mappability
to produce them for 50bp reads. The output (which I convert to wig and then bed) gives me around 8 million intervals (probably the reason GAT crashes!) with a score of 1 (uniquely mappable). If I understand mappability tracks correctly any kmer starting within the reported interval is unique, so I extend the region's end by 50bp.
Could you please provide mapability_36.filtered.bed.gz
file you use in the tutorials (https://media.readthedocs.org/pdf/gat/latest/gat.pdf) as an example? Have you done any tests on how big the workspace and annotation can be without memory requirements getting huge?
Many thanks for your help.
P.S.: M2_pooled_filtered_peaks.bed
and annotations.bed
are 43492 and 4707 intervals, respectively.
Whatever I do, however I install and reinstall gat, I can't launch gat-compare:
Traceback (most recent call last):
File "/home/adc34/anaconda2/bin/gat-compare.py", line 4, in <module>
__import__('pkg_resources').run_script('gat==1.3.5', 'gat-compare.py')
File "/home/adc34/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 661, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/adc34/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1441, in run_script
exec(code, namespace, namespace)
File "/home/adc34/anaconda2/lib/python2.7/site-packages/gat-1.3.5-py2.7-linux-x86_64.egg/EGG-INFO/scripts/gat-compare.py", line 98, in <module>
import GatEngine
ImportError: No module named GatEngine
The frustration is overwhelming
This isn't a GAT issue so much as a clarification on how to build annotation files for use with gat-great. I would like to build an annotation for Hg38 but am struggling to get the 1MB extension that GREAT uses. Here is what I have tried so far:
Thanks!
Annie
Hi, Andreas,
I think I figured how to make the isochores files work, but now I'm running into another issue. If my sample doesn't have a chromosome annotation that is present in the isochore file, the software crashes:
Traceback (most recent call last):
File "/lab/sw/ve/pl2/bin/gat-run.py", line 4, in <module>
__import__('pkg_resources').run_script('gat==1.3.5', 'gat-run.py')
File "/lab/sw/ve/pl2/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 738, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/lab/sw/ve/pl2/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1499, in run_script
exec(code, namespace, namespace)
File "/lab/sw/ve/pl2/lib/python2.7/site-packages/gat-1.3.5-py2.7-linux-x86_64.egg/EGG-INFO/scripts/gat-run.py", line 317, in <module>
sys.exit(main(sys.argv))
File "/lab/sw/ve/pl2/lib/python2.7/site-packages/gat-1.3.5-py2.7-linux-x86_64.egg/EGG-INFO/scripts/gat-run.py", line 295, in main
annotator_results = fromSegments(options, args)
File "/lab/sw/ve/pl2/lib/python2.7/site-packages/gat-1.3.5-py2.7-linux-x86_64.egg/EGG-INFO/scripts/gat-run.py", line 86, in fromSegments
segments, annotations, workspaces, isochores = IO.buildSegments(options)
File "/lab/sw/ve/pl2/lib/python2.7/site-packages/gat-1.3.5-py2.7-linux-x86_64.egg/gat/IO.py", line 180, in buildSegments
isochores.intersect(workspaces["collapsed"])
File "gat/Engine.pyx", line 3061, in gat.Engine.IntervalCollection.intersect (gat/Engine.c:44811)
File "gat/Engine.pyx", line 2797, in gat.Engine.IntervalDictionary.__delitem__ (gat/Engine.c:37344)
KeyError: 'chrX'
In this case, it's complaining that chrX is in the isochores file, but not in the input files. For this particular analysis, I was using an input where I removed the sex chromosomes. In order to solve this problem, I had to make a new isochore file without the sex chromosomes. Same thing happened in another female sample, when it gave the KeyError
for chrY.
Ideally, we should use the same full isochore file for each input, and GAT decides internally which chromosome annotations to use.
Best,
Ricardo
Update: I was just looking at some previous logs and found the same error, even though I wasn't using isochores. Seems like this issue is caused by the chromosomes annotations not being present int the workspace as well.
I just installed GAT and tried running gat-run.py. I immediately got this error:
Traceback (most recent call last):
File "../scripts/gat-run.py", line 73, in
import gat.SegmentList as SegmentList
ImportError: No module named SegmentList
SegmentList.pyx is in gat/gat, so I am not sure why this is happening. Might someone be able to help me out? Thanks!
Hi- I think there is an incompatibility between gat v1.3.5 and python3 when using multiple threads via the -t/--num-threads
parameter. Using the test data in the gat source code:
gat-run.py -s gat/test/data/segments_single.bed.gz \
-a gat/test/data/annotations.bed.gz \
-w gat/test/data/workspace.bed.gz \
-n 100 \
-t 1
# job started at Mon Apr 8 09:51:53 2019 on dario-T7500 -- 6eff5a66-3f9f-4a7c-bba5-3c6bd34e0478
# pid: 40016, system: Linux 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64
# annotation_files : ['gat/test/data/annotations.bed.gz']
# annotations_label : None
# annotations_to_points : None
# bucket_size : 0
# cache : None
# conditional : unconditional
# conditional_expansion : None
# conditional_extension : None
# counters : []
# enable_split_tracks : False
# ignore_segment_tracks : True
# input_filename_counts : None
# input_filename_descriptions : None
# input_filename_results : None
# isochore_files : None
# loglevel : 1
# nbuckets : 100000
# null : default
# num_samples : 100
# num_threads : 1
# output_bed : []
# output_counts_pattern : None
# output_filename_pattern : %s
# output_force : False
# output_order : fold
# output_plots_pattern : None
# output_samples_pattern : None
# output_stats : []
# output_tables_pattern : %s.tsv.gz
# overlapping_annotations : False
# pseudo_count : 1.0
# pvalue_method : empirical
# qvalue_lambda : None
# qvalue_method : BH
# qvalue_pi0_method : smoother
# random_seed : None
# restrict_workspace : False
# sample_files : []
# sampler : annotator
# segment_files : ['gat/test/data/segments_single.bed.gz']
# shift_expansion : 2.0
# shift_extension : 0
# short_help : None
# stderr : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin : <_io.TextIOWrapper name='<stdin>' mode='r' encoding='UTF-8'>
# stdlog : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# stdout : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# timeit_file : None
# timeit_header : None
# timeit_name : all
# truncate_segments_to_workspace : False
# truncate_workspace_to_annotations : False
# workspace_files : ['gat/test/data/workspace.bed.gz']
## 2019-04-08 09:51:53,468 INFO segments: reading tracks from 1 files
## 2019-04-08 09:51:53,566 INFO segments: read 1 tracks from 1 files
## 2019-04-08 09:51:53,568 INFO annotations: reading tracks from 1 files
## 2019-04-08 09:51:53,786 INFO annotations: read 7 tracks from 1 files
## 2019-04-08 09:51:53,788 INFO workspaces: reading tracks from 1 files
## 2019-04-08 09:51:55,121 INFO workspaces: read 1 tracks from 1 files
## 2019-04-08 09:51:55,134 INFO collapsing workspaces
## 2019-04-08 09:51:55,135 INFO intervals loaded in 1 seconds
## 2019-04-08 09:51:55,144 INFO collecting observed counts
## 2019-04-08 09:51:55,145 INFO starting sampling
## 2019-04-08 09:51:55,145 INFO sampling: merged: 1/1
## 2019-04-08 09:51:55,145 INFO performing unconditional sampling
## 2019-04-08 09:51:55,147 INFO workspace without conditioning: 279844 segments, 2204303400 nucleotides
## 2019-04-08 09:51:55,148 INFO workspace after conditioning: 279844 segments, 2204303400 nucleotides
## 2019-04-08 09:51:55,149 INFO setting up shared data for multi-processing
## 2019-04-08 09:51:55,153 INFO sampling started
## 2019-04-08 09:51:55,154 INFO generating processpool with 1 threads for 100 items
## 2019-04-08 09:51:55,363 INFO 0/100 done ( 0.00)
## 2019-04-08 09:52:11,781 INFO sampling completed
## 2019-04-08 09:52:11,781 INFO retrieving private data
Traceback (most recent call last):
File "/home/dario/miniconda3/envs/tritume/bin/gat-run.py", line 317, in <module>
sys.exit(main(sys.argv))
File "/home/dario/miniconda3/envs/tritume/bin/gat-run.py", line 295, in main
annotator_results = fromSegments(options, args)
File "/home/dario/miniconda3/envs/tritume/bin/gat-run.py", line 218, in fromSegments
num_threads=options.num_threads)
File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/gat/__init__.py", line 1010, in run
track, counts, counters, segs, annotations, workspace, outfiles)
File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/gat/__init__.py", line 763, in sample
annotations.unshare()
File "gat/Engine.pyx", line 2688, in gat.Engine.IntervalContainer.unshare
TypeError: expected bytes, str found
The error TypeError: expected bytes, str found
should be due to the difference between python 3 and 2 in handling strings.
Without multithreading things work fine:
gat-run.py -s gat/test/data/segments_single.bed.gz -a gat/test/data/annotations.bed.gz -w gat/test/data/workspace.bed.gz -n 100 -t 0
...
# job finished in 13 seconds at Mon Apr 8 10:04:49 2019 -- 14.57 0.07 0.00 0.00 -- a96d4d32-d60d-420c-b195-8ecc9866f65c
NB: This is using gat installed via conda as:
conda install gat=1.3.5=py36ha92aebf_2
Weeks ago I have submitted a bioconda recipe (v1.3.5-3) that sets the python version to 2.
Hi Andreas,
Sorry to bother you again!
We are trying to run GAT in python 3 on the CGAT cluster - it works for this pair of files in python 2 but in python 3 we get this error:
gat-run.py --segment-file=/ifs/projects/proj060/genomic_context_2/annotations/all_enhancers.bed --annotation-file=/ifs/projects/proj060/genomic_context_2/res/score5.bed --workspace-file=contigs.bed.gz > merged_no_IDR_vs_all_ensembl_exons.tsv
Traceback (most recent call last):
File "/ifs/devel/catherineg/py35-v1/conda/bin/gat-run.py", line 4, in <module>
__import__('pkg_resources').run_script('gat==1.3.3', 'gat-run.py')
File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/pkg_resources/__init__.py", line 739, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1494, in run_script
exec(code, namespace, namespace)
File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/gat-1.3.3-py3.5-linux-x86_64.egg-info/scripts/gat-run.py", line 317, in <module>
sys.exit(main(sys.argv))
File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/gat-1.3.3-py3.5-linux-x86_64.egg-info/scripts/gat-run.py", line 295, in main
annotator_results = fromSegments(options, args)
File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/gat-1.3.3-py3.5-linux-x86_64.egg-info/scripts/gat-run.py", line 116, in fromSegments
restrict_workspace=options.restrict_workspace)
File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/gat/IO.py", line 246, in applyIsochores
segments.filter(workspaces["collapsed"])
File "gat/Engine.pyx", line 3051, in gat.Engine.IntervalCollection.filter (gat/Engine.c:44856)
RuntimeError: dictionary changed size during iteration
We have version 1.3.3 I think, which conda thinks is the latest version (although on github there is a 1.3.4).
Thanks very much,
Katy (and Catherine)
When I wanted to download example data(http://www.cgat.org/~andreas/documentation/gat-examples/TutorialIntervalOverlap.tar.gz) from the GAT tutorial(https://gat.readthedocs.io/en/latest/tutorialIntervalOverlap.html), I found that I have no access to this data.
Forbidden
You don't have permission to access /~andreas/documentation/gat-examples/TutorialIntervalOverlap.tar.gz on this server.
Would you please check where went wrong?
As described here, I tried the --cache option but I got "Segmentation fault (core dumped)" error. I am using the files provided in the gat tutorial and this is the full command I ran:
gat-run.py --segments=srf.hg19.bed.gz --annotations=jurkat.hg19.dhs.bed.gz --workspace=contigs.bed.gz --ignore-segment-tracks --num-samples=100 --log=gat.log --cache=gat.cache > gat.tsv
It runs fine without the cache option. @AndreasHeger what might be going wrong here? Can you please share your thoughts?
Thanks,
Sarang
Hi AndreasHeger,
Problem:
--workspace: BED file of all regions in genome (excluding regions composed of N's)
--segments: BED file of annotations in subset of contigs
contig_1001 21 792 RepeatMasker
contig_1001 27 34 dust
contig_1001 93 159 dust
contig_1001 246 255 dust
contig_1001 266 339 dust
contig_1001 415 422 dust
--annotation: BED file of annotations across the whole genome (same as above but for whole genome)
The output I get when running:
gat-run.py --ignore-segment-tracks --segments=segments.bed --annotations=annotations.bed --workspace=workspace.bed --num-samples=100 --log=gat.log --num-threads=8 > gat.out
is
track annotation observed expected CI95low CI95high stddev fold l2fold pvalue qvalue track_nsegments track_size track_density annotation_nsegments annotation_size annotation_density overlap_nsegments overlap_size overlap_density percent_overlap_nsegments_track percent_overlap_size_track percent_overlap_nsegments_annotation percent_overlap_size_annotation
merged ncrnas_predicted 2913 1709.1200 1300.0000 1994.0000 209.0009 1.7040 0.7689 1.0000e-02 1.0000e-02 62983 6935174 6.6911e+00 1025 163283 1.5754e-01 30 2913 2.8105e-03 0.0476 0.0420 2.9268 1.7840
merged gene 389744 170648.2000 163172.0000 177856.0000 5359.9760 2.2839 1.1915 1.0000e-02 1.0000e-02 62983 6935174 6.6911e+00 18574 37934616 3.6599e+01 278 389744 3.7603e-01 0.4414 5.6198 1.4967 1.0274
merged tandem 368130 158513.4400 154952.0000 162625.0000 2399.6840 2.3224 1.2156 1.0000e-02 1.0000e-02 62983 6935174 6.6911e+00 47134 4562430 4.4018e+00 4994 368130 3.5517e-01 7.9291 5.3082 10.5953 8.0687
merged RepeatMasker 1492404 610641.4800 602042.0000 620429.0000 6353.3404 2.4440 1.2892 1.0000e-02 1.0000e-02 62983 6935174 6.6911e+00 117147 21502336 2.0745e+01 8705 1492404 1.4399e+00 13.8212 21.5193 7.4308 6.9407
merged dust 3200967 1182955.4000 1172992.0000 1190872.0000 4343.2429 2.7059 1.4361 1.0000e-02 1.0000e-02 62983 6935174 6.6911e+00 382880 14706492 1.4189e+01 63463 3200967 3.0883e+00 100.7621 46.1555 16.5752 21.7657
I am confused:
percent_overlap_size_track
and co be 100% for all?Thank you in advance.
cheers,
dom
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.