andreasheger / gat Goto Github PK

View Code? Open in Web Editor NEW

28.0 28.0 15.0 33.69 MB

Genomic Association Tester

Python 76.13% HTML 4.16% JavaScript 2.47% Makefile 0.91% Shell 0.72% Java 15.19% C 0.43%

gat's People

Contributors

Stargazers

Watchers

Forkers

samucc blthree sthshraddha hrk2109 anran1214 zongchangli xiangyupan hugoguillen ngocemy christianlee19 tools-jusue404 sarangbg dahuojiboy raivivek

gat's Issues

example data for GAT tutorial

When I wanted to download example data(http://www.cgat.org/~andreas/documentation/gat-examples/TutorialIntervalOverlap.tar.gz) from the GAT tutorial(https://gat.readthedocs.io/en/latest/tutorialIntervalOverlap.html), I found that I have no access to this data.

Forbidden

You don't have permission to access /~andreas/documentation/gat-examples/TutorialIntervalOverlap.tar.gz on this server.

Would you please check where went wrong?

Incompatibility with python3 when using threads

Hi- I think there is an incompatibility between gat v1.3.5 and python3 when using multiple threads via the -t/--num-threads parameter. Using the test data in the gat source code:

gat-run.py -s gat/test/data/segments_single.bed.gz \
    -a gat/test/data/annotations.bed.gz \
    -w gat/test/data/workspace.bed.gz \
    -n 100 \
    -t 1

# job started at Mon Apr  8 09:51:53 2019 on dario-T7500 -- 6eff5a66-3f9f-4a7c-bba5-3c6bd34e0478
# pid: 40016, system: Linux 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64
# annotation_files                        : ['gat/test/data/annotations.bed.gz']
# annotations_label                       : None
# annotations_to_points                   : None
# bucket_size                             : 0
# cache                                   : None
# conditional                             : unconditional
# conditional_expansion                   : None
# conditional_extension                   : None
# counters                                : []
# enable_split_tracks                     : False
# ignore_segment_tracks                   : True
# input_filename_counts                   : None
# input_filename_descriptions             : None
# input_filename_results                  : None
# isochore_files                          : None
# loglevel                                : 1
# nbuckets                                : 100000
# null                                    : default
# num_samples                             : 100
# num_threads                             : 1
# output_bed                              : []
# output_counts_pattern                   : None
# output_filename_pattern                 : %s
# output_force                            : False
# output_order                            : fold
# output_plots_pattern                    : None
# output_samples_pattern                  : None
# output_stats                            : []
# output_tables_pattern                   : %s.tsv.gz
# overlapping_annotations                 : False
# pseudo_count                            : 1.0
# pvalue_method                           : empirical
# qvalue_lambda                           : None
# qvalue_method                           : BH
# qvalue_pi0_method                       : smoother
# random_seed                             : None
# restrict_workspace                      : False
# sample_files                            : []
# sampler                                 : annotator
# segment_files                           : ['gat/test/data/segments_single.bed.gz']
# shift_expansion                         : 2.0
# shift_extension                         : 0
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin                                   : <_io.TextIOWrapper name='<stdin>' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# truncate_segments_to_workspace          : False
# truncate_workspace_to_annotations       : False
# workspace_files                         : ['gat/test/data/workspace.bed.gz']
## 2019-04-08 09:51:53,468 INFO segments: reading tracks from 1 files
## 2019-04-08 09:51:53,566 INFO segments: read 1 tracks from 1 files
## 2019-04-08 09:51:53,568 INFO annotations: reading tracks from 1 files
## 2019-04-08 09:51:53,786 INFO annotations: read 7 tracks from 1 files
## 2019-04-08 09:51:53,788 INFO workspaces: reading tracks from 1 files
## 2019-04-08 09:51:55,121 INFO workspaces: read 1 tracks from 1 files
## 2019-04-08 09:51:55,134 INFO collapsing workspaces
## 2019-04-08 09:51:55,135 INFO intervals loaded in 1 seconds
## 2019-04-08 09:51:55,144 INFO collecting observed counts
## 2019-04-08 09:51:55,145 INFO starting sampling
## 2019-04-08 09:51:55,145 INFO sampling: merged: 1/1
## 2019-04-08 09:51:55,145 INFO performing unconditional sampling
## 2019-04-08 09:51:55,147 INFO workspace without conditioning: 279844 segments, 2204303400 nucleotides
## 2019-04-08 09:51:55,148 INFO workspace after conditioning: 279844 segments, 2204303400 nucleotides
## 2019-04-08 09:51:55,149 INFO setting up shared data for multi-processing
## 2019-04-08 09:51:55,153 INFO sampling started
## 2019-04-08 09:51:55,154 INFO generating processpool with 1 threads for 100 items
## 2019-04-08 09:51:55,363 INFO 0/100 done ( 0.00)
## 2019-04-08 09:52:11,781 INFO sampling completed
## 2019-04-08 09:52:11,781 INFO retrieving private data
Traceback (most recent call last):
  File "/home/dario/miniconda3/envs/tritume/bin/gat-run.py", line 317, in <module>
    sys.exit(main(sys.argv))
  File "/home/dario/miniconda3/envs/tritume/bin/gat-run.py", line 295, in main
    annotator_results = fromSegments(options, args)
  File "/home/dario/miniconda3/envs/tritume/bin/gat-run.py", line 218, in fromSegments
    num_threads=options.num_threads)
  File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/gat/__init__.py", line 1010, in run
    track, counts, counters, segs, annotations, workspace, outfiles)
  File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/gat/__init__.py", line 763, in sample
    annotations.unshare()
  File "gat/Engine.pyx", line 2688, in gat.Engine.IntervalContainer.unshare
TypeError: expected bytes, str found

The error TypeError: expected bytes, str found should be due to the difference between python 3 and 2 in handling strings.

Without multithreading things work fine:

gat-run.py -s gat/test/data/segments_single.bed.gz -a gat/test/data/annotations.bed.gz -w gat/test/data/workspace.bed.gz -n 100 -t 0
...
# job finished in 13 seconds at Mon Apr  8 10:04:49 2019 -- 14.57  0.07  0.00  0.00 -- a96d4d32-d60d-420c-b195-8ecc9866f65c

NB: This is using gat installed via conda as:

conda install gat=1.3.5=py36ha92aebf_2

Weeks ago I have submitted a bioconda recipe (v1.3.5-3) that sets the python version to 2.

dictionary changed size during iteration in Python 3

Hi Andreas,

Sorry to bother you again!

We are trying to run GAT in python 3 on the CGAT cluster - it works for this pair of files in python 2 but in python 3 we get this error:

gat-run.py --segment-file=/ifs/projects/proj060/genomic_context_2/annotations/all_enhancers.bed    --annotation-file=/ifs/projects/proj060/genomic_context_2/res/score5.bed --workspace-file=contigs.bed.gz > merged_no_IDR_vs_all_ensembl_exons.tsv
Traceback (most recent call last):
  File "/ifs/devel/catherineg/py35-v1/conda/bin/gat-run.py", line 4, in <module>
    __import__('pkg_resources').run_script('gat==1.3.3', 'gat-run.py')
  File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/pkg_resources/__init__.py", line 739, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/pkg_resources/__init__.py", line 1494, in run_script
    exec(code, namespace, namespace)
  File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/gat-1.3.3-py3.5-linux-x86_64.egg-info/scripts/gat-run.py", line 317, in <module>
    sys.exit(main(sys.argv))
  File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/gat-1.3.3-py3.5-linux-x86_64.egg-info/scripts/gat-run.py", line 295, in main
    annotator_results = fromSegments(options, args)
  File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/gat-1.3.3-py3.5-linux-x86_64.egg-info/scripts/gat-run.py", line 116, in fromSegments
    restrict_workspace=options.restrict_workspace)
  File "/ifs/devel/catherineg/py35-v1/conda/lib/python3.5/site-packages/gat/IO.py", line 246, in applyIsochores
    segments.filter(workspaces["collapsed"])
  File "gat/Engine.pyx", line 3051, in gat.Engine.IntervalCollection.filter (gat/Engine.c:44856)
RuntimeError: dictionary changed size during iteration

We have version 1.3.3 I think, which conda thinks is the latest version (although on github there is a 1.3.4).

Thanks very much,
Katy (and Catherine)

Feature request: gat-compare should not assume the "track" column is equal to "merged"

Dear Andreas,

I have been happily using GAT (gat-run & gat-compare) over the past months - thanks a lot for it!

I recently updated all of my code so that I could keep a more informative string in the "track" column in my gat-run output. However, when I got to gat-compare I no longer got any results. Inspecting the code it seems that the things around lines 264-270 of gat-compare, the code assumes that the "track" column only contains the string "merged". This would be the case when using --ignore-segment-tracks as I was doing previously, but now with my more "informative" track strings, this blocks things.

Would it be possible to add to the code so that - similar to what is done in the in-file-comparison part - the list of track values is established and iterated over, to remove this hard-coded limitation?

Thank you in advance for your help!

Best regards,

-- Alex

making annotation files

This isn't a GAT issue so much as a clarification on how to build annotation files for use with gat-great. I would like to build an annotation for Hg38 but am struggling to get the 1MB extension that GREAT uses. Here is what I have tried so far:

download TSS coordinates from Ensemble for Hg38 (including chromosome, TSS, geneid, and strand)
Add 5kb downstream and 1kb downstream of each TSS using bedtools flank with strand = basal region
Add 1MB up and downstream of the basal region using bedtools slop with strand
But now I am stuck as to how to remove the overlapping regions. There must be a better approach for extending the regions 1MB until they hit the next basal region?? Or if there is a script for building these types of annotations it would be awesome if you were willing to share!

Thanks!

Annie

No module called GatEngine

Whatever I do, however I install and reinstall gat, I can't launch gat-compare:

Traceback (most recent call last):
  File "/home/adc34/anaconda2/bin/gat-compare.py", line 4, in <module>
    __import__('pkg_resources').run_script('gat==1.3.5', 'gat-compare.py')
  File "/home/adc34/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 661, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/home/adc34/anaconda2/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1441, in run_script
    exec(code, namespace, namespace)
  File "/home/adc34/anaconda2/lib/python2.7/site-packages/gat-1.3.5-py2.7-linux-x86_64.egg/EGG-INFO/scripts/gat-compare.py", line 98, in <module>
    import GatEngine
ImportError: No module named GatEngine

The frustration is overwhelming

Not an issue, but I am confused ...

Hi AndreasHeger,

Problem:

I want to calculate whether certain annotation features (genes, repeats, etc) are enriched/depleted in a particular subset of contigs in an assembly

--workspace: BED file of all regions in genome (excluding regions composed of N's)
--segments: BED file of annotations in subset of contigs

contig_1001    21      792     RepeatMasker
contig_1001    27      34      dust
contig_1001    93      159     dust
contig_1001    246     255     dust
contig_1001    266     339     dust
contig_1001    415     422     dust

--annotation: BED file of annotations across the whole genome (same as above but for whole genome)

The output I get when running:

gat-run.py --ignore-segment-tracks --segments=segments.bed --annotations=annotations.bed --workspace=workspace.bed --num-samples=100 --log=gat.log --num-threads=8 > gat.out

track   annotation        observed  expected      CI95low       CI95high      stddev     fold    l2fold  pvalue      qvalue      track_nsegments  track_size  track_density  annotation_nsegments  annotation_size  annotation_density  overlap_nsegments  overlap_size  overlap_density  percent_overlap_nsegments_track  percent_overlap_size_track  percent_overlap_nsegments_annotation  percent_overlap_size_annotation
merged  ncrnas_predicted  2913      1709.1200     1300.0000     1994.0000     209.0009   1.7040  0.7689  1.0000e-02  1.0000e-02  62983            6935174     6.6911e+00     1025                  163283           1.5754e-01          30                 2913          2.8105e-03       0.0476                           0.0420                      2.9268                                1.7840
merged  gene              389744    170648.2000   163172.0000   177856.0000   5359.9760  2.2839  1.1915  1.0000e-02  1.0000e-02  62983            6935174     6.6911e+00     18574                 37934616         3.6599e+01          278                389744        3.7603e-01       0.4414                           5.6198                      1.4967                                1.0274
merged  tandem            368130    158513.4400   154952.0000   162625.0000   2399.6840  2.3224  1.2156  1.0000e-02  1.0000e-02  62983            6935174     6.6911e+00     47134                 4562430          4.4018e+00          4994               368130        3.5517e-01       7.9291                           5.3082                      10.5953                               8.0687
merged  RepeatMasker      1492404   610641.4800   602042.0000   620429.0000   6353.3404  2.4440  1.2892  1.0000e-02  1.0000e-02  62983            6935174     6.6911e+00     117147                21502336         2.0745e+01          8705               1492404       1.4399e+00       13.8212                          21.5193                     7.4308                                6.9407
merged  dust              3200967   1182955.4000  1172992.0000  1190872.0000  4343.2429  2.7059  1.4361  1.0000e-02  1.0000e-02  62983            6935174     6.6911e+00     382880                14706492         1.4189e+01          63463              3200967       3.0883e+00       100.7621                         46.1555                     16.5752                               21.7657

I am confused:

shouldn't percent_overlap_size_track and co be 100% for all?

Thank you in advance.

cheers,

dom

Add strand specific analysis

Unable to cache sampling results

As described here, I tried the --cache option but I got "Segmentation fault (core dumped)" error. I am using the files provided in the gat tutorial and this is the full command I ran:

gat-run.py --segments=srf.hg19.bed.gz --annotations=jurkat.hg19.dhs.bed.gz --workspace=contigs.bed.gz --ignore-segment-tracks --num-samples=100 --log=gat.log --cache=gat.cache > gat.tsv

It runs fine without the cache option. @AndreasHeger what might be going wrong here? Can you please share your thoughts?

Thanks,
Sarang

Import Error

I just installed GAT and tried running gat-run.py. I immediately got this error:
Traceback (most recent call last):
File "../scripts/gat-run.py", line 73, in
import gat.SegmentList as SegmentList
ImportError: No module named SegmentList

SegmentList.pyx is in gat/gat, so I am not sure why this is happening. Might someone be able to help me out? Thanks!

Minor issue: GAT hangs for missing IDs in isochore files

Dear Andreas,

I've come across a minor issue while using GAT over the past months.

For some reason if the ID field (4th column) of the Isochores BED file is set to ".", GAT hangs with the following Python error:

Traceback (most recent call last):
  File "/software/UHTS/Analysis/gat/1.2/bin/gat-run.py", line 335, in <module>
    sys.exit(main(sys.argv))
  File "/software/UHTS/Analysis/gat/1.2/bin/gat-run.py", line 313, in main
    annotator_results = fromSegments(options, args)
  File "/software/UHTS/Analysis/gat/1.2/bin/gat-run.py", line 242, in fromSegments
    num_threads=options.num_threads)
  File "/software/UHTS/Analysis/gat/1.2/lib64/python2.7/site-packages/gat/__init__.py", line 915, in run
    track, counts, counters, segs, annotations, workspace, outfiles)
  File "/software/UHTS/Analysis/gat/1.2/lib64/python2.7/site-packages/gat/__init__.py", line 622, in sample
    contig_annotations.fromIsochores()
  File "GatEngine.pyx", line 2979, in GatEngine.IntervalCollection.fromIsochores (GatEngine/GatEngine.c:36284)
  File "GatEngine.pyx", line 2767, in GatEngine.IntervalDictionary.fromIsochores (GatEngine/GatEngine.c:32377)
ValueError: too many values to unpack (expected 2)

It runs fine if the "." is set to some other constant (e.g. "isoc" for isochore).
This is not the case for the workspace BED file: "ws" or "." both work just fine.

It actually took me a while to find the source of this problem, so I just thought I'd point it our here!

Best regards,

-- Alex

Isochore processing can't handle missing chromosome annotations

Hi, Andreas,

I think I figured how to make the isochores files work, but now I'm running into another issue. If my sample doesn't have a chromosome annotation that is present in the isochore file, the software crashes:

Traceback (most recent call last):
  File "/lab/sw/ve/pl2/bin/gat-run.py", line 4, in <module>
    __import__('pkg_resources').run_script('gat==1.3.5', 'gat-run.py')
  File "/lab/sw/ve/pl2/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 738, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/lab/sw/ve/pl2/local/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1499, in run_script
    exec(code, namespace, namespace)
  File "/lab/sw/ve/pl2/lib/python2.7/site-packages/gat-1.3.5-py2.7-linux-x86_64.egg/EGG-INFO/scripts/gat-run.py", line 317, in <module>
    sys.exit(main(sys.argv))
  File "/lab/sw/ve/pl2/lib/python2.7/site-packages/gat-1.3.5-py2.7-linux-x86_64.egg/EGG-INFO/scripts/gat-run.py", line 295, in main
    annotator_results = fromSegments(options, args)
  File "/lab/sw/ve/pl2/lib/python2.7/site-packages/gat-1.3.5-py2.7-linux-x86_64.egg/EGG-INFO/scripts/gat-run.py", line 86, in fromSegments
    segments, annotations, workspaces, isochores = IO.buildSegments(options)
  File "/lab/sw/ve/pl2/lib/python2.7/site-packages/gat-1.3.5-py2.7-linux-x86_64.egg/gat/IO.py", line 180, in buildSegments
    isochores.intersect(workspaces["collapsed"])
  File "gat/Engine.pyx", line 3061, in gat.Engine.IntervalCollection.intersect (gat/Engine.c:44811)
  File "gat/Engine.pyx", line 2797, in gat.Engine.IntervalDictionary.__delitem__ (gat/Engine.c:37344)
KeyError: 'chrX'

In this case, it's complaining that chrX is in the isochores file, but not in the input files. For this particular analysis, I was using an input where I removed the sex chromosomes. In order to solve this problem, I had to make a new isochore file without the sex chromosomes. Same thing happened in another female sample, when it gave the KeyError for chrY.

Ideally, we should use the same full isochore file for each input, and GAT decides internally which chromosome annotations to use.

Best,
Ricardo

Update: I was just looking at some previous logs and found the same error, even though I wasn't using isochores. Seems like this issue is caused by the chromosomes annotations not being present int the workspace as well.

Memory consumption explodes when using mappability tracks

Hi,

Running GAT with just regions that are not assembly gaps finishes in a few seconds without large memory requirements, however passing uniquely mappable regions as an additional workspace causes the program to crash due to insufficient memory. I have increased the memory passed to GAT to tens of GB and it still exits due to memory requirements. (I run the following command on a cluster allowing maximum of 100GB memory and reserving at least 40GB)

gat-run.py --segments=M2_pooled_filtered_peaks.bed 
 --annotations=annotations.bed 
 --workspace=contigs_ungapped.bed 
--workspace=MM10-mappability_50.bed
--output-counts-pattern=M2_pooled_filtered_%s.overlap.counts.tsv.gz 
--num-threads 8 --num-samples=10000 --log=gat.log --pvalue-method=norm > M2_pooled_filtered_gat.tsv

I am not entirely convinced that I correctly retrieve the uniquely mappable regions so it would be great if you can summarise how you derived that from the UCSC mappability tracks. I am working with mm10, for which UCSC doesn't have mappability tracks so I am using gem-mappability to produce them for 50bp reads. The output (which I convert to wig and then bed) gives me around 8 million intervals (probably the reason GAT crashes!) with a score of 1 (uniquely mappable). If I understand mappability tracks correctly any kmer starting within the reported interval is unique, so I extend the region's end by 50bp.

Could you please provide mapability_36.filtered.bed.gz file you use in the tutorials (https://media.readthedocs.org/pdf/gat/latest/gat.pdf) as an example? Have you done any tests on how big the workspace and annotation can be without memory requirements getting huge?

Many thanks for your help.

P.S.: M2_pooled_filtered_peaks.bed and annotations.bed are 43492 and 4707 intervals, respectively.

gat-compare, not identifying shared annotations between count files

Hello,

I am having issues running gat-compare.py on my regions. The output log of gat-compare says that there are no shared annotations between my count files, even though I definitely used the same annotation bed file to generate both count files. I have attached the count files below. I have also attached the output log file.

This is the code I used:
gat-compare.py anchorGAT_1_nucleotide-overlap.counts.tsv.gz anchorGAT_2_nucleotide-overlap.counts.tsv.gz

This is how I generated the count files:
gat-run.py --num-threads=4 --num-samples=1000 -s /home/qiadu/GATdata/RT_TADs/anchors_GAT/anchorset_forGAT_1.bed -a /home/qiadu/GATdata/RT_TADs/anchors_GAT/con.early.bed -w /home/qiadu/GATdata/RT_TADs/merged.genome.bed --log=GAT_1.log --output-counts-pattern=anchorGAT_1_%s.counts.tsv.gz

gat-run.py --num-threads=4 --num-samples=1000 -s /home/qiadu/GATdata/RT_TADs/anchors_GAT/anchorset_forGAT_2.bed -a /home/qiadu/GATdata/RT_TADs/anchors_GAT/con.early.bed -w /home/qiadu/GATdata/RT_TADs/merged.genome.bed --log=GAT_2.log --output-counts-pattern=anchorGAT_2_%s.counts.tsv.gz

anchorGAT_1_nucleotide-overlap.counts.tsv.gz
anchorGAT_2_nucleotide-overlap.counts.tsv.gz
gat-compare_anchor_output.txt

Any help would be appreciated. I might just be missing something obvious. Thank you.
Cheers,
Qian