jianhong / chippeakanno Goto Github PK

R 96.85% TeX 3.15%

chippeakanno's Introduction

ChIPpeakAnno

Batch annotation and visualization of peaks from ChIP-seq, ATAC-seq, and NAD-seq experiments or any experiments resulted in large number of chromosome ranges

Installation

To install this package, start R and enter:

library(BiocManager)
BiocManager::install("ChIPpeakAnno")

Documentation

To view documentation of ChIPpeakAnno, start R and enter:

browseVignettes("ChIPpeakAnno")

chippeakanno's People

Contributors

Stargazers

Watchers

Forkers

gusevfe junhuili1017 wook2014 hukai916

chippeakanno's Issues

genomicElementDistribution

Hello,

I had no issues running this.
Code ran perfectly, it is straight forward and simple.

I'm more trying to inquire as to the definitions for gene downstream.
What is it actually meant by "gene downstream"?
downstream of the ORF?

toGRanges() no longer working in response to UCSC update

Hello,

I have been using ChIPPeakAnno for ATAC-seq analysis, and it has worked perfectly for me multiple times in the past couple of months. However, about 1 week ago the toGRanges() function stopped working.

To reproduce this issue:

library(ChIPpeakAnno)
library("EnsDb.Mmusculus.v79")
annoData <- toGRanges(EnsDb.Mmusculus.v79)

This produces the following error message: "Error in .order_seqlevels(chrom_sizes[, "chrom"]) : !anyNA(m31) is not TRUE"

Multiple users in the past week have been experiencing similar issues using different packages, as has been discussed at length here. It seems that the issue is with a bad scaffold which messes with .order_seqlevels and seqLevelsStyle(), but I am not experienced enough with R to fix these issues on my own.

If you have any suggestion for a solution or workaround, I would greatly appreciate it!

opened in error

Sorry, there was an error in my code.

Error with `bindingType` in `annotatePeakInBatch`

Hi,

I would like to select a specific bindingType when using the annotatePeakInBatch function but I receive this error.

Annotate peaks by annoPeaks, see ?annoPeaks for details. maxgap will be ignored. Error in (function (peaks, annoData, bindingType = c("nearestBiDirectionalPromoters", : length(intersect(seqlevelsStyle(peaks), seqlevelsStyle(annoData))) > .... is not TRUE

disjointExons removed in Bioconductor 3.17

Hi Jianhong,

Thanks so much for your great package. I noticed that in the bioconductor development version 3.17, disjointExons method is removed from ensembldb and GenomicFeatures packages, causing your ChIPpeakAnno package to fail during build.

Would it be possible to fix this issue simply by changing your code like the following?

from R/privateUtil.R, line 616-618

disjointExons(ranges,
                                      aggregateGenes=FALSE,
                                      includeTranscripts=TRUE)

GenomicFeatures::exonicParts(ranges,
                                      linked.to.single.gene.only=TRUE)

Commits related to the removal of disjointExons:
ensembldb
GenomicFeatures

Thanks a lot for your time!

cvglists rtracklayer import error: Error in stop_if_wrong_length(what, ans_len) 'ranges' must have the length of the object to construct

Hi,

Greetings!

I am getting an error at the 'cvglists' create error: This was happening with my data, so I ran the demo data and I get the same error. The error withh traceback is below.

Error in stop_if_wrong_length(what, ans_len) :
'ranges' must have the length of the object to construct (79152) or length 1

12. | stop(wmsg(what, " must have the length of the object ", "to construct (", ans_len, ") or length 1"))

11. | stop_if_wrong_length(what, ans_len)

10. | new_GRanges("GRanges", seqnames = seqnames, ranges = ranges, strand = strand, mcols = mcols, seqinfo = seqinfo)

9. | GRanges(rep(seqnames(which), nhits), C_ans[[1L]], seqinfo = si)

8. | .local(con, format, text, ...)
-- | --

7. | import(FileForFormat(con, format), ...)
-- | --

6. | import(FileForFormat(con, format), ...)

5. | FUN(X[[i]], ...)

4. | FUN(X[[i]], ...)

3. | 	 lapply(X = X, FUN = FUN, ...)

2. | sapply(file.path(path, files), import, format = "BigWig", which = feature.recentered, as = "RleList")

1. | sapply(file.path(path, files), import, format = "BigWig", which = feature.recentered, as = "RleList")

The error happens with these steps: It occurs only when building "cvglists"

path <- system.file("extdata", package="ChIPpeakAnno")
files <- dir(path, "broadPeak")
data <- sapply(file.path(path, files), toGRanges, format="broadPeak")
(names(data) <- gsub(".broadPeak", "", files))

ol <- findOverlapsOfPeaks(data, connectedPeaks="keepAll")
features <- ol$peaklist[[length(ol$peaklist)]]
feature.recentered <- reCenterPeaks(features, width=4000)

library(rtracklayer)
files <- dir(path, "bigWig")
if(.Platform$OS.type != "windows"){
  cvglists <- sapply(file.path(path, files), import,
                     format="BigWig",
                     which=feature.recentered,
                     as="RleList")
}else{## rtracklayer can not import bigWig files on Windows
  load(file.path(path, "cvglist.rds"))
}

It looks like a print out rather than an error. But my heatmap is empty after the error!

Any help is appreciated !

Bug in makeVennDiagram.R

On line 373, the argument otherCount is passed to the plotting function, but the correct argument is otherCounts (see line 113).

This causes a "0" to be plotted in the Venn Diagram when otherCount is 0.

Annotation for dm6 does not contain gene_name

Hello Jianhong,

I have enjoyed using ChIPpeakAnno however working with dm6, the annoData does not contain a gene name. This is prohibiting me from extracting useful information directly with the more interpretable gene symbol. I can certainly find another way to convert the Flybase IDs if needed, but your examples in the user guide suggest that you can get the gene_name directly. Please let me know what I need to do to get this information!

Thank you.
Ashley

> library("TxDb.Dmelanogaster.UCSC.dm6.ensGene")
> txdb <- TxDb.Dmelanogaster.UCSC.dm6.ensGene
> annoData <- toGRanges(txdb, feature="gene")
> annoData
GRanges object with 17699 ranges and 0 metadata columns:
              seqnames            ranges strand
                 <Rle>         <IRanges>  <Rle>
  FBgn0000003    chr3R   6822498-6822796      +
  FBgn0000008    chr2R 22136968-22172834      +
  FBgn0000014    chr3R 16807214-16830049      -
  FBgn0000015    chr3R 16927212-16972236      -
  FBgn0000017    chr3L 16615866-16647882      -
          ...      ...               ...    ...
  FBgn0285994    chr3L 21576460-21576661      -
  FBgn0286004     chrX 19558607-19558693      +
  FBgn0286005    chr2R 18180151-18180205      -
  FBgn0286006    chr2R 12170373-12170462      -
  FBgn0286007    chr2R 12170153-12170244      -
  -------
  seqinfo: 1870 sequences (1 circular) from dm6 genome

findOverlapsOfPeaks error: Inputs contain duplicated ranges

I am having an issue finding overlaps between two GRanges objects. First GRanges object consists of peaks found by DiffBind and transformed to GRanges

peaks example without metadata columns:

peaks[1:6]

GRanges object with 6 ranges and 36 metadata columns:
         seqnames          ranges strand | 
            <Rle>       <IRanges>  <Rle> |       
  Peak_1     chr1 3105094-3105294      * |
  Peak_2     chr1 3221027-3221227      * |
  Peak_3     chr1 3400310-3400510      * |
  Peak_4     chr1 3426765-3426965      * |
  Peak_5     chr1 3433906-3434106      * | 
  Peak_6     chr1 3445765-3445965      * |

The second GRanges object is derived from gencode gtf

# load .gtf into txdb object
txdb = GenomicFeatures::makeTxDbFromGFF(
  "gencode.vM23.annotation.gtf",
  format="gtf", 
  dataSource="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz",
  organism="Mus musculus",
  circ_seqs = "chrM"
  )

# extract transcript TSS 
GRanges_TSS = compEpiTools::TSS(txdb)

I try overlapping both peak sets, explicitly selecting unique ranges for both GRanges objects:

overlaps = findOverlapsOfPeaks(unique(peaks[1:6]), unique(GRanges_TSS))

And I get an error:

Error in FUN(X[[i]], ...) : Inputs contains duplicated ranges. 
             please recheck your inputs.

Would it be possible to pinpoint the cause of this? Thank you!

Regarding oligoSummary()

Hi Jianhong,

Is there a reference for :

ChIPpeakAnno/R/oligoSummary.R

Line 214 in df99dec

std <- sqrt(mu/N * (1-mu/N) / N)

Thanks,

oligoSummary not populating the motif matrix output field

Hello,

I'm trying to run oligoSummary, in order to use motifStack as shown in the vignette, but for some reason when I run it the $motif section is not being populated, and just says NA. The other sections are being filled.

Code:

summary of the short oligos

freqs <- oligoFrequency(Hsapiens$chr1, MarkovOrder=3)
os <- oligoSummary(seq, oligoLength=6, MarkovOrder=3,
quickMotif=FALSE, freqs=freqs)

freqs
A C G T AAA AAC AAG AAT ACA
0.2630721269 0.1886631689 0.1886316745 0.2634647638 0.0334716679 0.0128935366 0.0179881397 0.0215514929 0.0178128784
ACC ACG ACT AGA AGC AGG AGT ATA ATC
0.0107751226 0.0022441549 0.0145656929 0.0200149752 0.0130548281 0.0167887847 0.0145635986 0.0176340383 0.0119421488
....etc

seq
GRanges object with 8029 ranges and 5 metadata columns:
seqnames ranges strand | peakNames score upstream downstream
|
[1] chr1 180700-181906 * | gr2__sig_R2_pea..,gr1__sig_R1_pea.. 178.0 20 20
[2] chr1 909801-910737 * | gr2__sig_R2_pea..,gr1__sig_R1_pea.. 54.5 20 20
[3] chr1 923791-925753 * | gr2__sig_R2_pea..,gr1__sig_R1_pea.. 49.0 20 20
[4] chr1 928875-929592 * | gr1__sig_R1_pea..,gr2__sig_R2_pea.. 40.5 20 20
[5] chr1 939838-943041 * | gr2__sig_R2_pea..,gr1__sig_R1_pea.. 23.5 20 20
... ... ... ... . ... ... ... ...
[8025] chrX 153672778-153674038 * | gr2__sig_R2_pea..,gr1__sig_R1_pea.. 24.5 20 20
[8026] chrX 153689045-153689654 * | gr1__sig_R1_pea..,gr2__sig_R2_pea.. 34.5 20 20
[8027] chrX 153879164-153879902 * | gr1__sig_R1_pea..,gr2__sig_R2_pea.. 26.5 20 20
[8028] chrX 155026647-155027494 * | gr2__sig_R2_pea..,gr1__sig_R1_pea.. 112.0 20 20
[8029] chrY 4034003-4034678 * | gr1__sig_R1_pea..,gr2__sig_R2_pea.. 61.0 20 20
sequence

[1] NNNNNNNNNNNNNNNNNNNN..
[2] TTGCCTCCCACAGGCTGACA..
[3] GCCTGACCAACATGGTGAAA..
[4] GTGCACAACGACACGATGTC..
[5] TGGGCGCAGAGCTGTTCCTG..
... ...
[8025] TCCGGCCCTGGAAGCCCTGG..
[8026] GGCACTGCTGTCGACGGCAA..
[8027] TTGTCTCCGAGCAACTTAAG..
[8028] AAATAACTCAAGTTTTTTTC..
[8029] TAATCCGTAAGGAACTTAAT..

seqinfo: 24 sequences from an unspecified genome

os file I am getting:

os
$zscore
aaaaaa aaaaac aaaaag aaaaat aaaaca aaaacc aaaacg aaaact aaaaga
-1.386023e+03 -6.786546e+01 -8.377371e+01 -1.091797e+02 -4.729527e+01 -3.655862e+01 -9.319195e+00 -4.210806e+01 -4.731548e+01
aaaagc aaaagg aaaagt aaaata aaaatc aaaatg aaaatt aaacaa aaacac
-3.692042e+01 -5.187627e+01 -4.000877e+01 -4.078904e+01 -4.320826e+01 -4.690237e+01 -5.486493e+01 -1.205061e+01 -2.717602e+01
...etc

$counts
aaaaaa aaaaac aaaaag aaaaat aaaaca aaaacc aaaacg aaaact aaaaga aaaagc aaaagg aaaagt aaaata aaaatc aaaatg aaaatt aaacaa aaacac
4231 3063 3887 4702 3660 2429 688 2803 3782 2465 2726 2868 4509 2631 3666 4179 3327 2359
aaacag aaacat aaacca aaaccc aaaccg aaacct aaacga aaacgc aaacgg aaacgt aaacta aaactc aaactg aaactt aaagaa aaagac aaagag aaagat
2806 2703 2622 2578 415 2135 504 445 515 573 2015 2409 2553 2411 3972 2157 2909 2535
...etc

$expCnt
aaaaaa aaaaac aaaaag aaaaat aaaaca aaaacc aaaacg aaaact aaaaga aaaagc aaaagg
8021.51382 5789.99652 6687.38991 7376.03652 5605.44329 4066.77318 958.78499 4664.63641 5704.78352 4118.56570 4981.71410
aaaagt aaaata aaaatc aaaatg aaaatt aaacaa aaacac aaacag aaacat aaacca aaaccc
4638.68760 6076.86164 4549.56036 5597.14974 6229.18700 3866.52858 3569.03014 4436.17871 3611.63569 3316.70012 2987.08929
...etc

$motifs
[1] NA
NB all of these sections are the same length (>3000 entries), I just shortened them for this post.

Any idea what might be causing this issue, and how I can fix it?

Many thanks,
Jess

R version - 4.2.0
ChIPpeakAnno version - 3.32.0

error in enrichmentPlot() when providing a partially empty GO object

I recently encountered an error during a call to enrichmentPlot():

> x <- getEnrichedGO(annotatedPeakList, orgAnn="org.Hs.eg.db", condense=TRUE)
> enrichmentPlot(x)
Error in `[.data.frame`(.ele, !is.na(.ele$pvalue), c(cn.id, cn.term, "pvalue",  :  undefined columns selected

Upon investigation I noticed that one of the list items in x was an empty dataframe:

> x$mf
 [1] go.id              go.term            Definition         Ontology           count.InDataset   
 [6] count.InGenome     pvalue             totaltermInDataset totaltermInGenome  EntrezID          
<0 rows> (or 0-length row.names)

I could successfully plot with enrichmentPlot() by adding a row of NAs:

> x$mf[nrow(x$mf)+1,] <- NA
> enrichmentPlot(x)

It might be worth adding a test for an empty df into enrichmentPlot(), e.g. similar to this:

for (item in c("bp", "mf", "cc")) {
     if (nrow(x[[item]]) == 0) {
      x[[item]][nrow(x[[item]])+1,] <- NA
    }
}

assignChromosomeRegion

Hi there,

the assignChromosomeRegion function assigned the peak region to exon, 5'utr, 3'utr, intron, etc. Since exon include CDS, 5'utr and 3'utr, I wonder how you define exon here, does it overlap with 5'utr and 3'utr?

Thanks.

save featureAlignedHeatmap output to picture file ?

Hi,
I don't find how to save featureAlignedHeatmap output to a picture file.
I try the method with png function followed by the featureAlignedHeatmap command and dev.off() but the image is white.
I also try ggsave but there are only legend and track but no heatmaps.

Thank you in advance for your help.

How to set the circle size same proportion of the number of peaks when there are more than 2 sets using `makeVennDiagram`?

Hello,

library(ChIPpeakAnno)

Control_Menin <- toGRanges("peaks/changeformat/4934_Control_Menin_peaks.narrowPeak",
                               format="narrowPeak", header=FALSE)
Control_Menin$score <- as.numeric(Control_Menin$score)
Control_Menin_reduce <- reduce(Control_Menin, drop.empty.ranges=FALSE, min.gapwidth=1L,
                      with.revmap=FALSE,
                      with.inframe.attrib=FALSE, ignore.strand=FALSE)


Control_Set1a <- toGRanges("peaks/changeformat/4934_Control_Set1a_peaks.narrowPeak",
                           format="narrowPeak", header=FALSE)
Control_Set1a$score <- as.numeric(Control_Set1a$score)
Control_Set1a_reduce <- reduce(Control_Set1a, drop.empty.ranges=FALSE, min.gapwidth=1L,
                               with.revmap=FALSE,
                               with.inframe.attrib=FALSE, ignore.strand=FALSE)

MLL4_MERGED <- toGRanges("peaks/changeformat/SICER_HOMER_MLL4_MERGED.bed",
                             format="BED", header=FALSE)
MLL4_MERGED$score <- as.numeric(MLL4_MERGED$score)
MLL4_MERGED_reduce <- reduce(MLL4_MERGED, drop.empty.ranges=FALSE, min.gapwidth=1L,
                                 with.revmap=FALSE,
                                 with.inframe.attrib=FALSE, ignore.strand=FALSE)


ol <- findOverlapsOfPeaks(Control_Menin_reduce, Control_Set1a_reduce, MLL4_MERGED_reduce)
makeVennDiagram(ol, NameOfPeaks=c("Control_Menin", "Control_Set1a",
                                  "MLL4_MERGED"),
                fill=c("#CC79A7", "#56B4E9", "#F0E442"),
                col=c("#D55E00", "#0072B2", "#E69F00"), 
                cat.col=c("#D55E00", "#0072B2", "#E69F00"))

When there are only 2 sets, makeVennDiagram show the circle size in the same proportion as the number of peaks.

But when there are 3 sets or more, makeVennDiagram show the circle size all the same.

How to set the circle size in the same proportion as the number of peaks when there are more than 2 sets using makeVennDiagram?
Thanks

makeVennDiagram fails with error 'Unexpected parameter length for "col"'

I have a list beds where each element of the list is a GR object generated by ChIPpeakAnno::toGRanges. When I run ChIPpeakAnno::findOverlapsOfPeaks and ChIPpeakAnno::makeVennDiagram on just two elements, I get a Venn diagram as expected. When I try to run these functions with three elements of beds I get the following error:

ol <- ChIPpeakAnno::findOverlapsOfPeaks(
+   beds[[1]], 
+   beds[[2]], 
+   beds[[3]],
+   maxgap = 500
+ )
> ChIPpeakAnno::makeVennDiagram(
+   ol,
+   fill = c("pink", "lightblue"), # circle fill color
+   col = c("red", "blue"), #circle border color
+   cat.col = c("red", "blue")
+ )
Missing totalTest! totalTest is required for HyperG test. 
If totalTest is missing, pvalue will be calculated by estimating 
the total binding sites of encoding region of human.
totalTest = humanGenomeSize * (2%(codingDNA) + 
             1%(regulationRegion)) / ( 2 * averagePeakWidth )
          = 3.3e+9 * 0.03 / ( 2 * averagePeakWidth)
          = 5e+7 /averagePeakWidth
Error in VennDiagram::draw.triple.venn(area1 = length(A), area2 = length(B),  : 
  Unexpected parameter length for "col"

What does this error mean, and how can I fix it? Thanks.

overlapofpeak issue

I've been having issues uploading either bed file (GR1) or the genome (anno).
Don't know what the issue is.
How can I fix it?

######################################### R console ####################

## acquiring bed file ##
bed <- "C:/Users/gerar/Desktop/lab.data/ATAC-seq.files/DiffBind.noGenes.bed"
## bed to Granges ##
gr1 <- toGRanges(bed, format = 'BED', header = T)
duplicated or NA names found. Rename all the names by numbers.
Warning message:
In formatStrand(strand) : All the characters for strand,
other than '1', '-1', '+', '-' and '',
will be converted into ''.
.head(gr1)
Error in .head(gr1) : could not find function ".head"
head(gr1)
GRanges object with 6 ranges and 1 metadata column:
seqnames ranges strand | score
|
X0001 chr1 24611619-24616113 * | 0
X0002 chr17 6429603-6430857 * | 0
X0003 chr6 47650642-47651681 * | 0
X0004 chr7 7049462-7050448 * | 0
X0005 chrY 90825663-90829168 * | 0
X0006 chr12 104718856-104719450 * | 0

seqinfo: 22 sequences from an unspecified genome; no seqlengths

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
anno <- toGRanges(TxDb.Mmusculus.UCSC.mm10.knownGene, feature = 'gene')
## Acquiring GFF file ##
#library(GenomicFeatures)
#txdb <- makeTxDbFromGFF('C:/Users/gerar/Desktop/lab.data/ATAC-seq.files/GRCm38.99.gff3')
#anno <- toGRanges(txdb)
head(anno)
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand

100009600 chr9 21062393-21073096 -
100009609 chr7 84935565-84964115 -
100009614 chr10 77711457-77712009 +
100009664 chr11 45808087-45841171 +
100012 chr4 144157557-144162663 -
100017 chr4 134741554-134768024 -

seqinfo: 66 sequences (1 circular) from mm10 genome

ol <- findOverlapsOfPeaks(gr1, anno)
Error in FUN(X[[i]], ...) : Inputs contains duplicated ranges.
please recheck your inputs.

makeVennDiagram produces undesired console output

Hi,

I'm using the makeVennDiagram function as in the following code, but together with the plot, it prints lots of output in console.

This is not ideal because it makes it difficult to read the output when including the function in a rmakrdown.

Is there a sort of verbose argument to remove this annoying output?

mol <- findOverlapsOfPeaks(gr1, gr2)
makeVennDiagram(mol, NameOfPeaks=names(mrrplist), by="region",
    fill=c(colors[2], colors[7]), # circle fill color
    col=c(colors[4], colors[3]), #circle border color
    cat.col=c(colors[4], colors[3]))

This is a screenshot of the output:

Thanks,
Dario

How to understand otherExon

Hi Jianhong,

I have some questions about the concept of otherExon of the genomicElementDistribution function, how can I understand this concept, and what kind of peaks will be considered to come from belonging to otherExon.

Thanks,

LeeLee

getAllPeakSequence() incongrous length between seq and myPeakList

Hello -- I wanted to share a bug I found and the solution that worked for me:

> seqNow <- getAllPeakSequence(MSC_NT_FLI1, genome = Hsapiens)
Error in `[[<-`(`*tmp*`, name, value = c(MSC_NT_FLI1_1 = "GAACCCGGCAGGGGCGGGAAGACGCAGGAGTGGGGAGGCGGAACCGGGACCCCGCAGAGCCCGGGTCCCTGCGCCCCACAAGCCTTGGCTTCCCTGCTAGGGCCGGGCAAGGCCGGGTGCAGGGCGCGGCTCCAGGGAGGAAGCTCCGGGGCGAGCCCAAGACGCCTCCCGGGCGGTCGGGGCCCAGCGGCGGCGTTCGCAGTGGAGCCGGGCACCGGGCAGCGGCCGCGGAACACCAGCTTGGCGCAGGCTTCTCGGTCAGGAACGGTCCCGGGCCTCCCGCCCGCCTCCCTCCAGCCCCTCCGGGTCCCCTACTTCGCCCCGCCAGGCCCCCACGACCCTACTTCCCGCGGCCCCGGACGCCTCCTCACCTGCGAGCCGCCCTCCCGGAAGCTCCCGCCGCCGCTTCCGCTCTGCCGGAGCCGCTGGGTCCTAGCCCCGCCGCCCCCAGTCCGCCCGCGCCTCCGGGTCCTAACGCCGCCGCTCGCCCTCCACTGCGCCCTCCCCGAGCGCGGCTCCAGGACCCCGTCGACCCGGAGCGCTGTCCTGTCGGGCCGAGTCGCGGGCCTGGGCACGGAACTCACGCTCACTCCGAGCTCCCGACGTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACCCGTGCTTTCTGCTCTGCAGACCCTCTTCCTAGACCTCCGTCCTTTGTCCCATCGCTGCCTTCCCCTCAAGCTCAGGGCCAAGCTGTCCGCCAACCTC",  : 
  19581 elements in value to replace 19719 elements
In addition: Warning message:
In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': HSCHR1_CTG1_UNLOCALIZED, HSCHR1_CTG2_UNLOCALIZED, HSCHR1_CTG3_UNLOCALIZED, HSCHR1_CTG4_UNLOCALIZED, HSCHR1_CTG5_UNLOCALIZED, HSCHR1_CTG6_UNLOCALIZED, HSCHR1_CTG7_UNLOCALIZED, HSCHR1_CTG8_UNLOCALIZED, HSCHR1_CTG9_UNLOCALIZED, HSCHR2_RANDOM_CTG1, HSCHR2_RANDOM_CTG2, HSCHR3UN_CTG2, HSCHR4_RANDOM_CTG4, HSCHR5_RANDOM_CTG1, HSCHR9_UNLOCALIZED_CTG1, HSCHR9_UNLOCALIZED_CTG2, HSCHR9_UNLOCALIZED_CTG3, HSCHR9_UNLOCALIZED_CTG4, HSCHR11_CTG1_UNLOCALIZED, HSCHR14_CTG1_UNLOCALIZED, HSCHR14_CTG2_UNLOCALIZED, HSCHR14_CTG3_UNLOCALIZED, HSCHR14_CTG4_UNLOCALIZED, HSCHR14_CTG5_UNLOCALIZED, HSCHR14_CTG6_UNLOCALIZED, HSCHR14_CTG7_UNLOCALIZED, HSCHR14_CTG8_UNLOCALIZED, HSCHR15_RANDOM_CTG1, HSCHR16_RANDOM_CTG1, HSCHR17_RANDOM_CTG3, HSCHR17_RANDOM_CTG4, HSCHR17_RANDOM_CTG5, HSCHR22_UNLOCALIZED_CTG1, HSCHR22_UNLOCALIZED_CTG2, HSCHR22_UNLOCALIZED_CTG3, HSCHR22_UNLOCALIZED_CTG4, HSCHR22_UNLOCALIZED_CTG5, HSCHR22_UNLOCALIZED_CTG6, HSCH [... truncated]

The issue is that seq from seq <- getSeq(genome, myPeakList, as.character=TRUE) may not be the same length as myPeakList. This can be fixed by adding another line: myPeakList <- myPeakList[names(myPeakList) %in% names(seq),]

Best,
Henry

toGRanges() can't read in file name

Hi Jianhong,

This is related to the issue posted at https://support.bioconductor.org/p/9141976/#9142000.

I tested with the following commands:

bed <- "overlap_flag_coordinates_with_r_duplicates_removed.bed"
bed_df <- read.table(bed, header = FALSE, sep="\t",stringsAsFactors=FALSE, quote="")

#1: 
toGRanges(bed, format="BED")
# Error: each range must have an end that is greater or equal to its start minus one

#2:
toGRanges(bed_df)
# Error: Error in (function (data, colNames = NULL, format = "", ...)  : 
  colname must contain space/seqnames, start and end.
# I manually checked, the (ends - starts) are strictly larger than 1.

#3:
toGRanges(bed, format="narrowPeak")
# this works

ChIPpeakAnno_3.26.4

The above suggests that toGRanges() might have some glitch with dealing with "BED" format.

invalid class "CompressedGRangesList" object

Hey,

Having an issue with 'toGranges', which doesn't seem to like the TxDb objects I've passed to it. I've tried both human and mouse TxDb objects to make sure it wasn't an issue with the TxDb object itself (both commands below), just showing the error log for the human command (error logs were identical)

annoData <- toGRanges(TxDb.Hsapiens.UCSC.hg38.knownGene, feature="gene")
annoData <- toGRanges(TxDb.Mmusculus.UCSC.mm10.knownGene, feature="gene")

> annoData <- toGRanges(TxDb.Hsapiens.UCSC.hg38.knownGene, feature="gene")
  348 genes were dropped because they have exons located on both strands of the same reference
  sequence or on more than one reference sequence, so cannot be represented by a single genomic
  range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a GRangesList object, or use
  suppressMessages() to suppress this message.
Error in validObject(.Object) : 
  invalid class "CompressedGRangesList" object: 
    improper partitioning
In addition: Warning message:
In normarg_mcols(value, class(x), length(x)) :
  You supplied a metadata column of length 197382 to set on an object of length 247541. However
  please note that the latter is not a multiple of the former.

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0  ChIPpeakAnno_3.22.4                      
 [3] Biostrings_2.58.0                         XVector_0.29.3                           
 [5] org.Mm.eg.db_3.11.4                       TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
 [7] GenomicFeatures_1.42.0                    AnnotationDbi_1.52.0                     
 [9] Biobase_2.50.0                            bumphunter_1.30.0                        
[11] locfit_1.5-9.4                            iterators_1.0.13                         
[13] foreach_1.5.1                             GenomicRanges_1.40.0                     
[15] GenomeInfoDb_1.26.0                       IRanges_2.24.0                           
[17] S4Vectors_0.28.0                          BiocGenerics_0.36.0                      
[19] dplyr_1.0.2                              

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.22.0         bitops_1.0-6                matrixStats_0.57.0         
 [4] bit64_4.0.5                 progress_1.2.2              httr_1.4.2                 
 [7] tools_4.0.3                 doRNG_1.8.2                 R6_2.5.0                   
[10] DBI_1.1.0                   lazyeval_0.2.2              ade4_1.7-16                
[13] tidyselect_1.1.0            prettyunits_1.1.1           bit_4.0.4                  
[16] curl_4.3                    compiler_4.0.3              VennDiagram_1.6.20         
[19] graph_1.68.0                formatR_1.7                 xml2_1.3.2                 
[22] DelayedArray_0.16.0         rtracklayer_1.50.0          RBGL_1.66.0                
[25] askpass_1.1                 rappdirs_0.3.1              stringr_1.4.0              
[28] digest_0.6.27               Rsamtools_2.6.0             pkgconfig_2.0.3            
[31] MatrixGenerics_1.2.0        dbplyr_1.4.4                ensembldb_2.14.0           
[34] limma_3.44.3                BSgenome_1.58.0             regioneR_1.20.1            
[37] rlang_0.4.8                 rstudioapi_0.11             RSQLite_2.2.1              
[40] generics_0.0.2              BiocParallel_1.24.0         RCurl_1.98-1.2             
[43] magrittr_1.5                GO.db_3.11.4                GenomeInfoDbData_1.2.4     
[46] futile.logger_1.4.3         Matrix_1.2-18               Rcpp_1.0.5                 
[49] lifecycle_0.2.0             stringi_1.5.3               MASS_7.3-53                
[52] SummarizedExperiment_1.20.0 zlibbioc_1.36.0             BiocFileCache_1.14.0       
[55] grid_4.0.3                  blob_1.2.1                  crayon_1.3.4               
[58] lattice_0.20-41             splines_4.0.3               multtest_2.44.0            
[61] hms_0.5.3                   pillar_1.4.6                seqinr_4.2-4               
[64] rngtools_1.5                codetools_0.2-16            biomaRt_2.46.0             
[67] futile.options_1.0.1        XML_3.99-0.5                glue_1.4.2                 
[70] lambda.r_1.2.4              BiocManager_1.30.10         idr_1.2                    
[73] vctrs_0.3.4                 openssl_1.4.3               purrr_0.3.4                
[76] assertthat_0.2.1            xfun_0.18                   AnnotationFilter_1.14.0    
[79] survival_3.2-7              tibble_3.0.4                GenomicAlignments_1.26.0   
[82] tinytex_0.26                memoise_1.1.0               ellipsis_0.3.1

Bug in annotatePeakInBatch (fix found): invalid class "DFrame" object: column names should not be NULL

Hello -- I found a bug in annotatePeakInBatch when using a different annotation data from the one shown in the vignette. Here is my code:

> binOverFeature(BRCA2, annotationData = exons)
Error in validObject(result) : invalid class "DFrame" object: 
    column names should not be NULL
In addition: Warning message:
In annotatePeakInBatch(Peaks, AnnotationData = annotationData, output = "overlapping",  :
  Found duplicated names in myPeakList. 
                    Changing the peak names ...

I traced the issue to this line in annotatePeakInBatch:

subjectHits$output <- 
            dist[!is.na(dist$subjectHits),"output"]

The problem is fixed by changing this to:

mcols(subjectHits)$output <- 
            dist[!is.na(dist$subjectHits),"output"]

hello，I don't know how to deal with it，can you help me figure it out. thanks

library(ChIPseeker)
library(ChIPpeakAnno)
setwd("F:/atac/narrowpeak")
list.files('F:/atac/narrowpeak',".narrowPeak")
tmp = lapply(list.files('F:/atac/narrowpeak',".narrowPeak"), function(x){
return(readPeakFile(file.path('F:/atac/narrowpeak', x)))
})
tmp
[[1]]
GRanges object with 139237 ranges and 7 metadata columns:
seqnames ranges strand | V4 V5 V6 V7 V8 V9 V10
|
[1] chr1 135821-136120 * | 1_peak_1 25 . 3.40463 4.85833 2.58011 84
[2] chr1 137973-139817 * | 1_peak_2a 157 . 7.94413 19.27968 15.71903 384
[3] chr1 137973-139817 * | 1_peak_2b 188 . 8.79529 22.50781 18.89067 1158
[4] chr1 140946-141278 * | 1_peak_3 42 . 4.25578 7.10628 4.23111 191
[5] chr1 143264-143660 * | 1_peak_4 42 . 4.25578 7.10628 4.23111 280
... ... ... ... . ... ... ... ... ... ... ...
[139233] chrX 138957140-138957339 * | 1_peak_127003 36 . 3.97207 6.32712 3.63172 98
[139234] chrX 138975416-138975873 * | 1_peak_127004 36 . 3.97207 6.32712 3.63172 130
[139235] chrX 138992372-138992803 * | 1_peak_127005 21 . 3.12091 4.17304 2.11568 273
[139236] chrX 138998481-138998715 * | 1_peak_127006 30 . 3.68835 5.57713 3.08387 125
[139237] chrX 139004630-139005065 * | 1_peak_127007 21 . 3.12091 4.17304 2.11568 154

seqinfo: 687 sequences from an unspecified genome; no seqlengths

[[2]]
GRanges object with 137755 ranges and 7 metadata columns:
seqnames ranges strand | V4 V5 V6 V7 V8 V9 V10
|
[1] chr1 138181-138448 * | 2_peak_1 33 . 4.19834 5.93764 3.37236 109
[2] chr1 139027-139671 * | 2_peak_2 71 . 6.10668 10.62839 7.16524 143
[3] chr1 142613-142812 * | 2_peak_3 22 . 3.43501 4.30215 2.20563 80
[4] chr1 143008-145741 * | 2_peak_4a 80 . 6.48835 11.65180 8.09798 219
[5] chr1 143008-145741 * | 2_peak_4b 144 . 8.77836 18.26648 14.46424 503
... ... ... ... . ... ... ... ... ... ... ...
[137751] chrX 138928118-138928380 * | 2_peak_124867 40 . 4.58001 6.81229 4.02146 149
[137752] chrX 138962022-138962357 * | 2_peak_124868 33 . 4.19834 5.93764 3.37236 103
[137753] chrX 138971639-138971838 * | 2_peak_124869 40 . 4.58001 6.81229 4.02146 86
[137754] chrX 138992405-138992997 * | 2_peak_124870 33 . 4.19834 5.93764 3.37236 192
[137755] chrX 139006685-139006884 * | 2_peak_124871 22 . 3.43501 4.30215 2.20563 122

seqinfo: 640 sequences from an unspecified genome; no seqlengths

[[3]]
GRanges object with 211102 ranges and 7 metadata columns:
seqnames ranges strand | V4 V5 V6 V7 V8 V9 V10
|
[1] chr1 136058-136257 * | 3_peak_1 48 . 4.76810 7.40445 4.83368 104
[2] chr1 137932-141148 * | 3_peak_2a 89 . 6.60198 12.24618 8.94679 370
[3] chr1 137932-141148 * | 3_peak_2b 141 . 8.43586 17.67540 14.10315 589
[4] chr1 137932-141148 * | 3_peak_2c 80 . 6.23520 11.22568 8.03155 1144
[5] chr1 137932-141148 * | 3_peak_2d 41 . 4.40132 6.52308 4.13515 2657
... ... ... ... . ... ... ... ... ... ... ...
[211098] chrX 138997833-138998134 * | 3_peak_174522 17 . 2.93421 3.36781 1.72653 152
[211099] chrX 138998257-138998780 * | 3_peak_174523 34 . 4.03454 5.67568 3.47359 187
[211100] chrX 139004391-139005190 * | 3_peak_174524 34 . 4.03454 5.67568 3.47359 121
[211101] chrX 139006713-139007225 * | 3_peak_174525a 41 . 4.40132 6.52308 4.13515 114
[211102] chrX 139006713-139007225 * | 3_peak_174525b 41 . 4.40132 6.52308 4.13515 410

seqinfo: 686 sequences from an unspecified genome; no seqlengths
ol <- findOverlapsOfPeaks(tmp[[1]],tmp[[2]],tmp[[3]])

Error in FUN(X[[i]], ...) : Inputs contains duplicated ranges.
please recheck your inputs.

Questions on the how genomicElementDistribution works

hi @jianhong ,

First of all, thank you for providing such a nice tool! I am currently using the tool to analyze ATAC-seq and it works beautifully. I have some questions on the results from genomicElementDistribution.

In "genomicEelementDistribution", there are options named as "promoterRegion" and "geneDownstream". I wonder if promoterRegion is calculated based on the transcription start site (5'end of gene) of gene. Second, I wonder if either promoterRegion and geneDownstream option consider strand (+ or - strand) of gene.

so for example, let's say there is a gene name as "Tubulin" whose coordinates of start and end are 1000 and 2000 (start and end coordinates specified in GTF file), respectively, and strand is "-" and options for promoterRegion and geneDownstream are set as below

promoterRegion=c(upstream=500, downstream=500),
geneDownstream=c(upstream=0, downstream=200))

If the strandness of gene is considered, promoter region would be 1500-2500 (5'end of Tubulin is 2000 considering that the strand is "-") and downstream region would be 800-1000 (3'end of Tubulin in 1000 considering that the strand is "-").
So it would be great if you can confirm that genomicElementDistribution consider the strandness of gene or not.

Lastly, as far as I understand, If the peaks overlap multiple features (exon, intron, etc), single feature is assigned to the peak based on the pre-determined priority. It would be great if you can tell me how the features are prioritized.

In my R code, I make txdb from gencode gene annotation using the makeTxdbFromGFF as shown below

R code

setwd(dir)
library(ChIPpeakAnno)
library(GenomicFeatures)
gtf <- "gencode.v29.annotation.gtf" ## genocde annoataion
txdb <- makeTxDbFromGFF('gencode.v29.annotation.gtf')
test_peak_file <- "ATAC-seq_peaks.bed"
test_peak <- toGRanges(test_peak_file, format="BED")
out<- genomicElementDistribution(test_peak,
TxDb = txdb,
promoterRegion=c(upstream=5000, downstream=5000),
geneDownstream=c(upstream=0, downstream=2000))

Please let me know if there is any flaw in the code.
Thank you and have a nice day!

`findOverlapsOfPeaks(gr1, gr2)` Error in FUN(X[[i]], ...) : Inputs contains duplicated ranges. please recheck your inputs.

Hello:

library(ChIPpeakAnno)

gr1 <- toGRanges("macs2/MSKPCa3_STAT1_IGO_11795_B_1_peaks.narrowPeak", format="narrowPeak", header=FALSE)
gr1
## one can also try import from rtracklayer

gr2 <- toGRanges("macs2/MSKPCa3_STAT3_IGO_11795_B_2_peaks.narrowPeak", format="narrowPeak", header=FALSE)
## must keep the class exactly same as gr1$score, i.e., numeric.
gr2$score <- as.numeric(gr2$score)
gr2


ol <- findOverlapsOfPeaks(gr1, gr2)

Can you please let me know how to fix this error?

Error in FUN(X[[i]], ...) : Inputs contains duplicated ranges. 
             please recheck your inputs.

Thanks

genomicElementDistribution not found

Good work I was running code getting error as below ChiPpeakAnno installed properly still getting error can you please help me to resolve the issue, library loaded without ant error

Error in genomicElementDistribution(peaks, TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene, :
could not find function "genomicElementDistribution"

Thank you

Differences between bedtools overlap / findOverlaps and findOverlapsOfPeaks

Thanks for the great tool!

I'm a bit unclear why there is a difference between the number of overlapping peaks found by bedtools overlap and findOverlaps (which find the exact same number of overlapping peaks), and that of findOverlapsOfPeaks, which seems to with miss a few of the peaks or filter them out. I am just running the functions with default parameters. I have tried altering the connectedPeaks parameter, but still get the same number of overlapping peaks. Can you tell me why there are missing peaks and how I can recover them?

jianhong / chippeakanno Goto Github PK

chippeakanno's Introduction

ChIPpeakAnno

Installation

Documentation

chippeakanno's People

Contributors

Stargazers

Watchers

Forkers

chippeakanno's Issues

summary of the short oligos

R code

Recommend Projects

Recommend Topics

Recommend Org