rnabioco / valr Goto Github PK

View Code? Open in Web Editor NEW

87.0 8.0 25.0 70.24 MB

Genome Interval Arithmetic in R

Home Page: http://rnabioco.github.io/valr/

License: Other

R 69.08% C++ 29.94% C 0.71% Rez 0.28%

genome interval-arithmetic bedtools

valr's Introduction

valr

valr provides tools to read and manipulate genome intervals and signals, similar to the BEDtools suite.

Installation

The latest stable version can be installed from CRAN:

install.packages('valr')

The latest development version can be installed from github:

# install.packages("devtools")
devtools::install_github('rnabioco/valr')

valr Example

Functions in valr have similar names to their BEDtools counterparts, and so will be familiar to users coming from the BEDtools suite. Unlike other tools that wrap BEDtools and write temporary files to disk, valr tools run natively in memory. Similar to pybedtools, valr has a terse syntax:

library(valr)
library(dplyr)

snps <- read_bed(valr_example("hg19.snps147.chr22.bed.gz"))
genes <- read_bed(valr_example("genes.hg19.chr22.bed.gz"))

# find snps in intergenic regions
intergenic <- bed_subtract(snps, genes)
# find distance from intergenic snps to nearest gene
nearby <- bed_closest(intergenic, genes)

nearby |>
  select(starts_with("name"), .overlap, .dist) |>
  filter(abs(.dist) < 5000)
#> # A tibble: 1,047 × 4
#>    name.x      name.y   .overlap .dist
#>    <chr>       <chr>       <int> <int>
#>  1 rs530458610 P704P           0  2579
#>  2 rs2261631   P704P           0  -268
#>  3 rs570770556 POTEH           0  -913
#>  4 rs538163832 POTEH           0  -953
#>  5 rs190224195 POTEH           0 -1399
#>  6 rs2379966   DQ571479        0  4750
#>  7 rs142687051 DQ571479        0  3558
#>  8 rs528403095 DQ571479        0  3309
#>  9 rs555126291 DQ571479        0  2745
#> 10 rs5747567   DQ571479        0 -1778
#> # ℹ 1,037 more rows

valr's People

Contributors

Stargazers

Watchers

valr's Issues

Parallelize with RcppParallel

by group using GroupedDataFrame.

http://gallery.rcpp.org/tags/parallel/

implement bed_absdist

Modify reldist.cpp and bed_reldist.r to add absdist_impl and bed_absdist. should be easy.

See Fig 3 in http://journals.plos.org/ploscompbiol/article?id=10.1371%2Fjournal.pcbi.1002529

pass groups through bed_map

from the README:

x <- tss %>%
  ...
  group_by(win_id)

bed_map(x, y, ...) %>%
  group_by(win_id.x) %>%
  ...

Consider passing groups within bed_map so that the second group_by wouldn't be required. Might be surprising as the groups coming out would be suffixed.

implement bed_coverage

http://bedtools.readthedocs.io/en/latest/content/tools/coverage.html

implement gtf_to_bed12

using tidyr. update vignettes.

bed_merge implementations

bed_tbl <- dplyr::tibble(
  ~chrom, ~start, ~end,
  "chr1", 100,    500,
  "chr1", 300,    400,
  "chr1", 399,    401,
  "chr1", 200,    600,
  "chr1", 800,   1000,
  "chr2", 100,    200,
  "chr2", 150,    250,
  "chr3", 500,   1000
)

bed_tbl %>% group_by(chrom) %>%
  mutate(.overlap = lead(start) - end,
         .overlap = ifelse(.overlap < 0, -1, 1),
         .rank = dense_rank(.overlap))
#> Source: local data frame [8 x 5]
#> Groups: chrom [3]
#> 
#>   chrom start   end .overlap .rank
#>   (chr) (dbl) (dbl)    (dbl) (int)
#> 1  chr1   100   500       -1     1
#> 2  chr1   300   400       -1     1
#> 3  chr1   399   401       -1     1
#> 4  chr1   200   600        1     2
#> 5  chr1   800  1000       NA    NA
#> 6  chr2   100   200       -1     1
#> 7  chr2   150   250       NA    NA
#> 8  chr3   500  1000       NA    NA

Need to:

identify runs of 1s (maybe a purrr function for this)
assign min(start), max(end) to those groups
impl overlap constraint

pull tests from the bedtools2 repo

If you are writing tests I would pull the examples straight from the bedtools2 repo. I figure they will have hit all of the relevant corner cases we should be testing against.

Put the link to the test at the top of the test_ file. e.g.

https://github.com/arq5x/bedtools2/blob/master/test/cluster/test-cluster.sh

cc @kriemo @mackie90125

Pure R impl of intersect

This could be done with a custom join operation. However dplyr does not currently support NSE for joins. See tidyverse/dplyr#557. This is apprently in the works.

tools to implement

bedtools suite

Anything from bedops?

bed_cluster docs / tests

need to be implemented

unnecessary join warning in bed_window()

Somewhere in bed_window one of the functions is generating a chrom column as a <fctr> when it should be a <chr>. This causes the join warning below. Minor.

library(valr)

genome <- read_genome(system.file('extdata', 'hg19.chrom.sizes.gz', package = 'valr'))

x <- bed_random(genome, n = 100)
y <- bed_random(genome, n = 100)

# a few intersections
bed_intersect(x, y)
#> Source: local data frame [0 x 6]
#> 
#> Variables not shown: chrom <fctr>, start.x <int>, end.x <int>, start.y
#>   <int>, end.y <int>, .overlap <int>.

# can be expanded by casting a wider net
bed_window(x, y, genome, both = 1e6)
#> Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
#> character vector and factor, coercing into character vector
#> Source: local data frame [6 x 6]
#> 
#>   chrom   start.x     end.x   start.y     end.y .overlap
#>   <chr>     <int>     <int>     <int>     <int>    <int>
#> 1  chr1 178630052 178631052 179274609 179275609     1000
#> 2 chr16  19722181  19723181  19845923  19846923     1000
#> 3 chr16  79075460  79076460  78546496  78547496     1000
#> 4 chr22  38638987  38639987  38556828  38557828     1000
#> 5  chr3  95547777  95548777  95802366  95803366     1000
#> 6  chr8 102562319 102563319 102338405 102339405     1000

Suffix colnames returned from bed_intersect

Think about the "right" way to suffix colnames. See #20 #14

bed_map dot spec / pass colname suffixes to bed_intersect

The bed_intersect call in bed_map yields new columns with .y suffixes, causing surprising behavior:

bed_tbl <- tibble::frame_data(
~chrom, ~start, ~end,
"chr1", 100, 250,
"chr2", 250, 500)

signal_tbl <- tibble::frame_data(
~chrom, ~start, ~end, ~value,
"chr1", 100, 250, 10,
"chr1", 150, 250, 20,
"chr2", 250, 500, 500)

bed_map(bed_tbl, signal_tbl, sum = sum(value))
#> Error: object 'value' not found
bed_map(bed_tbl, signal_tbl, sum = sum(value.y))
#> Source: local data frame [2 x 4]
#> 
#>   chrom start   end   sum
#>   <chr> <dbl> <dbl> <dbl>
#> 1  chr1   100   250    30
#> 2  chr2   250   500   500

The way around is to enable passing of suffix values to bed_intersect:

bed_intersect(x, y, suffix.y = '')

remove magrittr pipe from library

This will be a fair bit of work.

https://twitter.com/hadleywickham/status/603883121197514752?ref_src=twsrc%5Etfw

C stack usage error bed_flank()

I ran into another issue debugging bed_flank(), when using 1e6 random intervals. I'll work on both #59 and this error.

genome <- read_genome("inst/extdata/genome.txt.gz") 
x <- bed_random(genome)
bed_flank(x, genome, both = 100)
#> Error: C stack usage  34924379 is too close to the limit

use start for merge_id

testing revealed that cluster IDs are out of order because .merge_id is lexographically ordered by dense_rank.

Seems OK just propagate the start value for merged intervals as it will likely always be combined with group_by(chrom), yielding a unique combination per chrom.

need good example using `dplyr::do`

see http://varianceexplained.org/r/tidy-genomics-biobroom/ for inspiration

Don't expose Rcpp methods

It's not a good idea to have the *_impl methods exposed in the API. If someone passes e.g. an ungrouped dataframe, it will crash RStudio.

library(valr)
# this function is available
closest_impl

The Rcpp methods in dplyr are not exposed, but I don't understand why.

library(dplyr)
# this function is not
select_impl

One guess is that the Rcpp methods are hidden by the S3 objects on the dplyr side.

closest implementation

This is the bedtools2 closest alogrithm: https://github.com/arq5x/bedtools2/blob/master/src/utils/NewChromsweep/CloseSweep.cpp

bed_closest is probably more easily implementated with an interval tree than a sweep algortihm, as there are a huge number of checks to confirm that relative ordering of intervals in the sweep case.

Need to think about how to cache these trees (per session?) if possible, they are somewhat expensive to build and be reused.

EKG has a minimal cxx interval tree implementation https://github.com/ekg/intervaltree/blob/master/IntervalTree.h

cleanup up function scopes before release

anything from dplyr should be used without qualification, i.e. mutate not dplyr::mutate.

anything else should be qualified with a package name i.e., from purrr, broom etc.

expand support for peak formats

narrowPeak
https://genome.ucsc.edu/FAQ/FAQformat.html#format12

broadPeak
https://genome.ucsc.edu/FAQ/FAQformat.html#format13

try to get appveyor working

before CRAN submission. stubs are in the appveyor branch.

build fails

installing to /private/var/folders/_1/4wg2xbj12_dft9p3kq4005n40000gn/T/RtmpvDL3Vf/devtools_install_beea4498c931/valr/libs
* DONE (valr)
Error in dyn.load(dllfile) : 
  unable to load shared object '/Users/jayhesselberth/devel/valr/src/valr.so':
  dlopen(/Users/jayhesselberth/devel/valr/src/valr.so, 6): Symbol not found: __ZN5dplyr23DataFrameSubsetVisitorsC1ERKN4Rcpp14DataFrame_ImplINS1_15PreserveStorageEEERKNS1_6VectorILi16ES3_EE
  Referenced from: /Users/jayhesselberth/devel/valr/src/valr.so
  Expected in: flat namespace
 in /Users/jayhesselberth/devel/valr/src/valr.so
Calls: suppressPackageStartupMessages ... <Anonymous> -> load_all -> load_dll -> library.dynam2 -> dyn.load

> devtools::session_info()
Session info ------------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.0 (2016-05-03)
 system   x86_64, darwin15.4.0        
 ui       RStudio (0.99.1172)         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Denver              
 date     2016-05-12                  

Packages ----------------------------------------------------------------------------------------------------------------------------------------------
 package   * version date       source        
 curl        0.9.7   2016-04-10 CRAN (R 3.3.0)
 devtools  * 1.11.1  2016-04-21 CRAN (R 3.3.0)
 digest      0.6.9   2016-01-08 CRAN (R 3.3.0)
 git2r       0.15.0  2016-05-11 CRAN (R 3.3.0)
 htmltools   0.3.5   2016-03-21 CRAN (R 3.3.0)
 httr        1.1.0   2016-01-28 CRAN (R 3.3.0)
 jsonlite    0.9.20  2016-05-10 CRAN (R 3.3.0)
 knitr       1.13    2016-05-09 CRAN (R 3.3.0)
 memoise     1.0.0   2016-01-29 CRAN (R 3.3.0)
 R6          2.1.2   2016-01-26 CRAN (R 3.3.0)
 Rcpp        0.12.4  2016-03-26 CRAN (R 3.3.0)
 rmarkdown   0.9.6   2016-05-01 CRAN (R 3.3.0)
 withr       1.0.1   2016-02-04 CRAN (R 3.3.0)
 yaml        2.1.13  2014-06-12 CRAN (R 3.3.0)

Shiny demo

Develop a demo of valr in shiny.

Feature aggregation (e.g., ChIP-seq signal around TSSs).
Interval summaries (e.g., by chrom, selectable in DT)
Heatmaps of correlations / jaccards in d3heatmap
Use flexdashboard to tie it all together

build error

The build fails with cran version of dplyr. While the workaround is to install an alternate version of dplyr found at jayhesselberth/dplyr, this extra functionality should somehow be included in the valr package so that the user doesn't have to overwrite their version dplyr

> install.packages("Projects/valr", repos = NULL, type = "source")
Installing package into ‘/Users/dpastling/Library/R/3.3/library’
(as ‘lib’ is unspecified)
* installing *source* package ‘valr’ ...
** libs
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I../inst/include -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/dpastling/Library/R/3.3/library/Rcpp/include" -I"/Users/dpastling/Library/R/3.3/library/BH/include" -I"/Users/dpastling/Library/R/3.3/library/dplyr/include"   -fPIC  -Wall -mtune=core2 -g -O2 -c RcppExports.cpp -o RcppExports.o
In file included from RcppExports.cpp:4:
In file included from ./../inst/include/valr.h:5:
In file included from /Users/dpastling/Library/R/3.3/library/dplyr/include/dplyr.h:120:
/Users/dpastling/Library/R/3.3/library/dplyr/include/dplyr/GroupedDataFrame.h:58:25: error: use of undeclared identifier 'build_index_cpp'
                data_ = build_index_cpp( data_) ;
                        ^
In file included from RcppExports.cpp:4:
In file included from ./../inst/include/valr.h:5:
In file included from /Users/dpastling/Library/R/3.3/library/dplyr/include/dplyr.h:149:
/Users/dpastling/Library/R/3.3/library/dplyr/include/dplyr/Collecter.h:228:42: error: use of undeclared identifier 'get_time_classes'
            Parent::data.attr("class") = get_time_classes() ;
                                         ^
/Users/dpastling/Library/R/3.3/library/dplyr/include/dplyr/Collecter.h:255:37: error: use of undeclared identifier 'get_time_classes'
            return collapse<STRSXP>(get_time_classes()) ;
                                    ^
/Users/dpastling/Library/R/3.3/library/dplyr/include/dplyr/Collecter.h:365:54: error: use of undeclared identifier 'get_date_classes'
                return new TypedCollecter<INTSXP>(n, get_date_classes()) ;
                                                     ^
/Users/dpastling/Library/R/3.3/library/dplyr/include/dplyr/Collecter.h:371:55: error: use of undeclared identifier 'get_date_classes'
                return new TypedCollecter<REALSXP>(n, get_date_classes()) ;
                                                      ^
/Users/dpastling/Library/R/3.3/library/dplyr/include/dplyr/Collecter.h:396:54: error: use of undeclared identifier 'get_date_classes'
                return new TypedCollecter<INTSXP>(n, get_date_classes() ) ;
                                                     ^
/Users/dpastling/Library/R/3.3/library/dplyr/include/dplyr/Collecter.h:404:55: error: use of undeclared identifier 'get_date_classes'
                return new TypedCollecter<REALSXP>(n, get_date_classes() ) ;
                                                      ^
7 errors generated.
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘valr’
* removing ‘/Users/dpastling/Library/R/3.3/library/valr’
Warning message:
In install.packages("Projects/valr", repos = NULL, type = "source") :
  installation of package ‘Projects/valr’ had non-zero exit status


> devtools::session_info()
Session info ------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.0 (2016-05-03)
 system   x86_64, darwin13.4.0        
 ui       AQUA                        
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Denver              
 date     2016-05-14                  

Packages ----------------------------------------------------------------------------------------------------------------------------------
 package  * version date       source        
 devtools * 1.11.1  2016-04-21 CRAN (R 3.3.0)
 digest     0.6.9   2016-01-08 CRAN (R 3.3.0)
 memoise    1.0.0   2016-01-29 CRAN (R 3.3.0)
 withr      1.0.1   2016-02-04 CRAN (R 3.3.0)

support VCF inputs

Deal with naked chroms (i.e. 1 in VCF vs chr1 in BED).

read_vcf <- function(vcf) {
res <- read_tsv(vcf, col_names = c('chrom', ...)) %>%
mutate(.chrom = str_c('chr', chrom))

attr(res, "is_vcf") <- TRUE
}

#downstream methods can compare `chrom == .chrom` if `is_vcf == TRUE`

random intervals should be proportional to chrom size

Scale the chrom_rng so that the chances of drawing a chromosome are proportional to chrom size. Maybe make this an option?

http://stackoverflow.com/questions/1761626/weighted-random-numbers

bed_fisher tests / checks

Need tests for bed_fisher.

Also need to confirm that the simplistic approach for estimating "possible intervals" is accurate. Current implementation does not give 37 as outlined on the BEDtools doc page.

http://bedtools.readthedocs.org/en/latest/content/tools/fisher.html

support BAM inputs

https://bioconductor.org/packages/release/bioc/html/Rsamtools.html
https://cran.r-project.org/web/packages/rbamtools/rbamtools.pdf

NSE for bed_map

Think about how to implement arguments / NSE for bed_map:

bed_map(sum(signal_colname))
bed_map(max(signal_colname))
bed_map(count(signal_colname))

bed_jaccard tests

bed_intersect output not compatible with downstream dplyr functions

Jay, thanks for giving me the opportunity to contribute to this project. I noticed that the new implementation of bed_intersect() (85174fa) does not append suffixes (.x or .y) to the name score or strand columns. This is problematic when piping the output to dplyr for further analysis as it will result in an error. I don't know Cxx well enough yet, otherwise I would suggest a fix. Thanks!

x <- tibble::frame_data(
  ~chrom, ~start, ~end, ~name, ~score, ~strand,
  "chr1", 500,    1000, '.',   '.',     '+',
  "chr1", 1000,   1500, '.',   '.',     '-',
  "chr2", 1000,   1200, '.',   '.',     '-'
)

y <- tibble::frame_data(
  ~chrom, ~start, ~end, ~name, ~score, ~strand,
  "chr1", 400,    450, '.',   '.',     '+',
  "chr1", 1000,   1200, '.',   '.',     '-',
  "chr1", 1100,    1500, '.',   '.',     '+',
  "chr2", 1300,   1500, '.',   '.',     '-'
)


bed_intersect(x, y) 

Source: local data frame [3 x 12]

  chrom start.x end.x  name score strand start.y end.y  name score strand .overlap
  <chr>   <dbl> <dbl> <chr> <chr>  <chr>   <dbl> <dbl> <chr> <chr>  <chr>    <int>
1  chr1     500  1000     .     .      +    1000  1200     .     .      -        0
2  chr1    1000  1500     .     .      -    1000  1200     .     .      -      200
3  chr1    1000  1500     .     .      -    1100  1500     .     .      +      400

bed_intersect(x,y) %>% select(start.y)
Error: found duplicated column name: name, score, strand

Passing grouping variables to bed_map

Need a way to pass grouping variables to bed_map, e.g., window intervals from makewindows:

tss_intervals %>%
  bed_flank(size = 1000) %>%
  bed_makewindows(win_size = 50, win_id = 'num') %>%
  bed_map(chip_signal, groups = .win_id, sum = sum(value))

bed_map <- function(..., groups, ...) {
  res <- bed_intersect(bed_tbl, signal_tbl, suffix_y = '') %>%
    group_by_(groups) %>%
    summarize_(.dots = lazyeval::lazy_dots(...)) %>%
    rename(start = start.x, end = end.x)
}

Need more flexible way to specific default grouping by chrom, start.x, end.x. Maybe have a unique ID for each intersection and group_by that?

Implement BEDtools tutorial as valr vignette

http://quinlanlab.org/tutorials/bedtools/bedtools.html

Pure R impl of complement

genome <- dplyr::tibble(
  ~chrom, ~size,
  "chr1", 100000,
  "chr2", 200000
)

bed_tbl <- dplyr::tibble(
  ~chrom, ~start, ~end,
  "chr1", 50,     250,
  "chr1", 500,    1000,
  "chr2", 1,      1000,
  "chr2", 2000,   5000
)

lags <- bed_tbl %>% group_by(chrom) %>% mutate(.prev_end = lag(end))

first <- lags %>% filter(is.na(.prev_end) & start > 1) %>%
  mutate(.start = 1, .end = start) %>%
  mutate(start = .start, end = .end) %>%
  select(-.prev_end, -.start, -.end)

internal <- lags %>%
  filter(!is.na(.prev_end)) %>%
  mutate(.start = .prev_end, .end = start) %>%
  mutate(start = .start, end = .end) %>%
  select(-.prev_end, -.start, -.end)

final <- lags %>%
  summarize(max.end = max(end)) %>%
  left_join(genome, by='chrom') %>%
  filter(size != max.end) %>%
  mutate(start = max.end, end = size) %>%
  select(-size, -max.end)
#> Joining by: "chrom"

compl <- bind_rows(list(first, internal, final)) %>% arrange(chrom, start)

compl
#> Source: local data frame [5 x 3]
#> 
#>   chrom start   end
#>   (chr) (dbl) (dbl)
#> 1  chr1     1 5e+01
#> 2  chr1   250 5e+02
#> 3  chr1  1000 1e+05
#> 4  chr2  1000 2e+03
#> 5  chr2  5000 2e+05

bed_map tests

test_that("ops on y columns work on original names (#14)")

package name ideas

Suggestions for names.

Ideally package names are pronounceable. Rbedtools doesn't exactly roll off the tongue.

bedr ("better") is taken.

valr - manipulates intervals
the alternative, rinter, sounds like "winter" with a speech impediment
bred - mixing R and bed tools.
berd - uhh ...

trim option for bound_intervals

add trim option to bound_intervals to adjust the coordinates of out-of-bounds coordinates, instead of removing them.

bed_intersect inverse returns 0 rows

df.1 <- frame_data(
  ~chrom, ~start, ~end,
  "A",   1,      100,
  "B",   50,     150
)

df.2 <- frame_data(
  ~chrom, ~start, ~end,
  "A",    1,       25,
  "B",    45,      85
)

intersection <- bed_intersect(df.1, df.2)
intersection
#> Source: local data frame [2 x 6]
#> 
#>   chrom start.x end.x start.y end.y .overlap
#>   <chr>   <dbl> <dbl>   <dbl> <dbl>    <int>
#> 1     A       1   100       1    25       24
#> 2     B      50   150      45    85       35

reverse <- bed_intersect(df.1, df.2, invert = TRUE)
reverse
#> Source: local data frame [0 x 3]
#> Groups: chrom [2]
#> 
#> Variables not shown: chrom <chr>, start <dbl>, end <dbl>.

glyph function

make this pretty and then use throughout vignettes. would be helpful to add x, y, and .fun labels to the plot.

library(ggplot2)
library(dplyr)
library(tibble)
library(valr)

bed_glyph <- function(x, y, .fun, ...) {

  x <- mutate(x, bin = 3)
  y <- mutate(y, bin = 2)

  res <- eval(.fun(x, y, ...)) %>% mutate(bin = 1)

  comb <- bind_rows(x, y, res) 

  ggplot(comb) + 
    geom_rect(aes(xmin = start, xmax = end,
                  ymin = bin, ymax = bin + 0.9,
                  fill= bin)) + theme_bw()

}

x <- tribble(
  ~chrom, ~start, ~end,
  'chr1',      1,      100
)

y <- tribble(
  ~chrom, ~start, ~end,
  'chr1',      50,     75
)

bed_glyph(x, y, bed_subtract)

duplicate intervals in bed_intersect

x <- tibble::frame_data(
~chrom, ~start, ~end,
"chr1", 100,    500,
"chr1", 175,    200
)

y <- tibble::frame_data(
~chrom, ~start, ~end,
"chr1", 150,    400,
"chr1", 151,    401
)

bed_intersect(x, y)
#> Source: local data frame [6 x 6]
#> 
#>   chrom start.x end.x start.y end.y .overlap
#>   <chr>   <dbl> <dbl>   <dbl> <dbl>    <int>
#> 1  chr1     100   500     150   400      250
#> 2  chr1     100   500     151   401      250
#> 3  chr1     175   200     150   400       25
#> 4  chr1     175   200     151   401       25
#> 5  chr1     175   200     150   400       25
#> 6  chr1     175   200     151   401       25

error in bed_flank vignette example

Error: Duplicate identifiers for rows (1284808, 1498582), (3284808, 3498582), (1386858, 1732629), (3386858, 3732629), (1136478, 1735205), (3136478, 3735205), (1256268, 1422468), (3256268, 3422468), (1298131, 1653327), (3298131, 3653327), (1628578, 1849474), (3628578, 3849474), (284808, 498582), (2284808, 2498582), (386858, 732629), (2386858, 2732629), (136478, 735205), (2136478, 2735205), (256268, 422468), (2256268, 2422468), (298131, 653327), (2298131, 2653327), (628578, 849474), (2628578, 2849474)
8. stop("Duplicate identifiers for rows ", paste(str, collapse = ", "), call. = FALSE)
7. spread_.data.frame(data, key_col, value_col, fill = fill, convert = convert, drop = drop, sep = sep)
6. NextMethod()
5. as_data_frame(NextMethod())
4. spread_.tbl_df(data, key_col, value_col, fill = fill, convert = convert, drop = drop, sep = sep)
3. spread_(data, key_col, value_col, fill = fill, convert = convert, drop = drop, sep = sep)
2. tidyr::spread(res, key, value) at bed_flank.r#121
1. bed_flank(x, genome, both = 100)

travis check failing with g++ errors in random.cpp

R impl is too slow.

differing numbers of chromosomes prevents intersections (bed_intersect)

Intersections will not be reported by bed_intersect if the number of chromosomes are dissimilar.

x <- tibble::frame_data(
  ~chrom, ~start, ~end,
  "chr2", 100,    500
)

y <- tibble::frame_data(
  ~chrom, ~start, ~end,
  "chr1", 10,     20,
  "chr2", 100,    500
)

bed_intersect(x, y)
#> # A tibble: 0 x 6
#> # ... with 6 variables: chrom <chr>, start.x <dbl>, end.x <dbl>,
#> #   start.y <dbl>, end.y <dbl>, .overlap <int>

make column name parameters on the Rcpp side

There are many places where chrom, start and end are hard-coded on the Rcpp side. Could make these into parameters that could be passed.

OTOH, this doesn't have to change and we just enforce the existence of these names on the R side.

build fails with forked version of dplyr

build fails with forked version of dplyr jayhesselberth/dplyr

> devtools::install_github('jayhesselberth/dplyr')
> devtools::install_github('eddelbuettel/BH')
> devtools::install_github('jayhesselberth/valr')

...

g++ -std=c++0x \
    -I/vol4/home/astlingd/R/lib64/R/include \
    -DNDEBUG \
    -I../inst/include \
    -I/usr/local/include \
    -I"/vol4/home/astlingd/R/lib64/R/library/Rcpp/include" \
    -I"/vol4/home/astlingd/R/lib64/R/library/BH/include" \
    -I"/vol4/home/astlingd/R/lib64/R/library/dplyr/include" \
    -fpic  -g -O2 -c RcppExports.cpp -o RcppExports.o
In file included from ../inst/include/valr.h:6,
                 from RcppExports.cpp:4:
../inst/include/IntervalTree.h: In constructor ‘IntervalTree<T, K>::IntervalTree()’:
../inst/include/IntervalTree.h:63: error: ‘nullptr’ was not declared in this scope
../inst/include/IntervalTree.h: In copy constructor ‘IntervalTree<T, K>::IntervalTree(const IntervalTree<T, K>&)’:
../inst/include/IntervalTree.h:76: error: ‘nullptr’ was not declared in this scope
../inst/include/IntervalTree.h: In member function ‘IntervalTree<T, K>& IntervalTree<T, K>::operator=(const IntervalTree<T, K>&)’:
../inst/include/IntervalTree.h:87: error: ‘nullptr’ was not declared in this scope
../inst/include/IntervalTree.h: In constructor ‘IntervalTree<T, K>::IntervalTree(std::vector<Interval<T, K>, std::allocator<Interval<T, K> > >&, size_t, size_t, K, K, size_t)’:
../inst/include/IntervalTree.h:101: error: ‘nullptr’ was not declared in this scope
../inst/include/IntervalTree.h:140: error: parse error in template argument list
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘valr’
* removing ‘/vol4/home/astlingd/R/lib64/R/library/valr’
Error: Command failed (1)


> devtools::session_info()
Session info -------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.3 (2015-12-10)
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       <NA>                        
 date     2016-05-14                  

Packages -----------------------------------------------------------------------
 package  * version date       source        
 curl       0.9.7   2016-04-10 CRAN (R 3.2.3)
 devtools * 1.11.1  2016-04-21 CRAN (R 3.2.3)
 digest     0.6.9   2016-01-08 CRAN (R 3.2.2)
 git2r      0.14.0  2016-03-13 CRAN (R 3.2.3)
 httr       1.1.0   2016-01-28 CRAN (R 3.2.3)
 memoise    1.0.0   2016-01-29 CRAN (R 3.2.3)
 R6         2.1.2   2016-01-26 CRAN (R 3.2.3)
 withr      1.0.1   2016-02-04 CRAN (R 3.2.3)

makewindows ids

Pretty sure the win_id param is not needed.

# name
bed_makewindows(x, genome, win_size = 10) %>% group_by(name)
# num
bed_makewindows(x, genome, win_size = 10) %>% group_by(win_id)
# namenum
bed_makewindows(x, genome, win_size = 10) %>% group_by(name, win_id)

Identify new starts within the incl and excl bounds.
Calculate original sizes from passed intervals and add to the random starts.