gogonzo / runner Goto Github PK

View Code? Open in Web Editor NEW

50.0 50.0 2.0 3.5 MB

:runner: R package performing simple running calculations

Home Page: https://gogonzo.github.io/runner/

R 66.82% C++ 32.39% Makefile 0.79%

runner's Introduction

💫 About Me:

I develop R packages.

🌐 Socials:

💻 Tech Stack:

📊 GitHub Stats:

runner's People

Contributors

Stargazers

Watchers

Forkers

salva polkas

runner's Issues

multiple columns in a sliding window, runner package, xts format

I'm trying to do something similar to w1, which contains colums in a sliding window of size k

set.seed(1)
library(xts)
tm <- Sys.time()+1:5000
x <- rnorm(5000) |> cumsum() |> round(1) |> xts(tm) |> to.minutes5(name = NULL) |> align.time()+100

# window len
k <- 3
w1 <- as.zoo(x) |> rollapplyr(width=k, FUN=c, fill=NA)

I'm trying to get the same thing as w1 using the runner package but I'm doing something wrong. This solution does not work although it is indicated that the package works with the xts format

library(runner)
w2 <- runner(x, k=k, f=c, na_pad = T)

I made a workaround to make it work, but I would like to know how to do it correctly and non-redundantly.

w2 <- x |> 
      coredata() |> 
      runner(k=k, f=c, na_pad = T) |> 
      do.call(what="rbind") |> 
      xts(index(x))

comparison (1) is possible only for atomic and list types

I have a little bit complex problem that is hard to reproduce.

In my package finfeatures, I use runner a lot https://github.com/MislavSag/finfeatures

Everything works as expected.

Than, I have wrote plumber API endpoint that uses functions from my finfeatures package.

When I try to execute endpoint (send POST request), I get an error:

Error in x[[1]] == as.name(\"runner\"): comparison (1) is possible only for atomic and list types

This same function works locally, but doesn't work through plumber requests.

I have found the problem is in this part: https://github.com/gogonzo/runner/search?q=x%5B%5B1%5D%5D+%3D%3D+as.name%28%22runner%22%29

My code that uses runner https://github.com/MislavSag/finfeatures/blob/main/R/RollingGeneric.R

Do you have any idea what could be the problem?

Issue when using sum_run over time series where number of rows matches index

Hi there, I've encountered an issue where sum_run does not work correctly on a time series when k equals the number of rows exactly. I'm using this code:
mutate(4WkSumHours = sum_run(Hours, k = 28, idx = PeriodEndDate)) where k is the number of days to look back, idx is a column containing date objects and hours is a column containing numeric values. The window sum works excellently in all other cases where k does not equal the number of rows and only sums together hours from rows with periodenddate that is within 28 days before the current row periodenddate.

Here's a sample of output showing expected behaviour with 34 rows of data vs incorrect behaviour with 28 rows of data
windowSum.xlsx

add more general support for irregularly spaced data

The following was written based on the documentation, not on actually testing the package.

First of all I think it's great that someone creates a general purpose package for running functions. A key part of the R data wrangling ecosystem that, as fas as I know, has been missing for a long time. I certainly missed it on many occations.

Currently this package provides support for regularly spaced time series (or other regularly spaced data) by assuming that the index of the datapoint corresponds to its position in time/space/whatever.
It also provides support for irregularly spaced data, by using a second input variable that explicitly defines the position of each datapoint. This (I assume) also means support for e.g. time series where there is more than one measurement per point im time.
This latter system is however at the moment restricted to cases where time/space/whatever is measured either as date object or as integer variable. It would be great if floating point values could also be used.
One could then also have a second variable of the same type that specifies "where" the windowed function should be computed. For example one could input a number of irregularly spaced temperature measurements and then use "mean" as running function and define regularly spaced points as "where" this function should be computed. The output then would be a regularly spaced time series of average temperatures.

feature-req./question

Is there a more elegant and robust way, to get a data frame back from runner as output, then this? I tried simplify='array' with no effect.

test<-runner(x=df,k=sSize,f=rollingAnal,simplify=FALSE,na_pad=TRUE)
            # View(test)
            # stop()

            sampleList  <-list()
            listIndex   <-0    
            for(i in 1:length(test)){
                myDf<-test[[i]]
                if(is.data.frame(myDf)==T){
                    if(nrow(myDf)>0){
                        listIndex<-listIndex+1
                        sampleList[[listIndex]]<-myDf
                    }
                }    
            }

            myOut<-do.call(data.table::rbindlist,list(sampleList,fill=FALSE,idcol=NULL))

Optional `at`/`where` argument

Optional argument which holds the data points where calculation should be calculated at.

lag_run to use negative lag

fix lag_run to use negative lag

Running which function name

Hi @gogonzo is whicht_run() from v0.2.0 now equivalent to which_run()?

Not working or bad example in docs

library(runner)

sample data

x <- cumsum(rnorm(10))
data <- data.frame(
date = Sys.Date() + cumsum(sample(1:3, 10, replace = TRUE)), # unequally spaced time series,
y = 3 * x + rnorm(10),
x = cumsum(rnorm(10))
)

solution

data$pred <- runner(
data,
na_pad = TRUE,
lag = 1,
k = 4,
f = function(data) {
predict(
lm(y ~ x, data = data)
)[nrow(data)]
}
)
lm( y ~ x, data = data[1:4,]) %>% predict(., newdata = data[5,]) == data[5, "pred"]

Unexpected behavior with idx as dates and at = "1 months"

I expected the following code to give me the distinct count of category for all rows with a date in January and another count for all rows with a date in February. Using 1 month for both k and at just demonstrates the issue. In my use case I need k = 12.

library(runner)
library(tidyverse)

df <- read.table(text = "  user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       27 2016-02-05   orange
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple
       27 2016-02-10   orange", header = TRUE)

df <- arrange(df, date)

runner(df, k = "1 months", idx = as.Date(df$date), f = function(x) length(unique(x$category)), at = "1 months")

Here, runner returns: 1 3, instead of the expected 3 1

The problem seems to stem in part from how the at argument is used to create the reporting periods:

seq(min(as.Date(df$date)), max(as.Date(df$date)), by = "1 month")

returns "2016-01-01" "2016-02-01"

From the docs I'm not sure how 1 month is taken off of those dates to look back, but I assume it uses Date functions, so the windows above would each start at the first of the prior month.

From the docs, the results are "correct", but not what I would expect, or what most people mean when they say something like a rolling 12 month count.

Passing a list of end of month dates as the value of at also doesn't work, because subtracting 1 month from some months will include days from the end of prior months, for instance

seq(as.Date("2016-02-28"), length = 2, by = "-1 month")

returns "2016-02-28" "2016-01-28" so data for the second month would include anything near the end of the first month.

So far I cannot find a solution using runner and dates as the index. I tried using Clock at the year month precision, but runner flags Clock's data type as an error when used for idx. Allowing Clock would work as expected when using Clock's year_month_date constructor with just a year and a month:

library(clock)
seq(year_month_day(2016, 2), length = 2, by  = -1)

returns "2016-02" "2016-01"

Thus far, the only way I can see to do this is by converting the dates to Clock's year month type, then ranking and using the rank as index. So this works:

df <- df %>% mutate(ymRank = dense_rank(date_group(as.Date(date), "month")))

runner(df, k = 1, idx = df$ymRank, f = function(x) length(unique(x$category)), at = seq(min(df$ymRank), max(df$ymRank), by = 1))

producing the desired: 3 1

And this also works with a k > 1 window size, such as:

runner(df, k = 2, idx = df$ymRank, f = function(x) length(unique(x$category)), at = seq(min(df$ymRank), max(df$ymRank), by = 1))

producing: 3 4

At a minimum this use case should be mentioned in the docs or in the set of examples. Allowing Clock dates would be better though.

calculate FUN at every by-th time point rather than every point.

I am trying to make Runner returning a list of windows as follows:

1 2 3 4 5
5 6 7 8 9
9 10 11 12 13

This is the behaviour of rollapply as in the following example

require(zoo)
TS <- zoo(c(1:20))
rollapply(TS, width = 5, by = 4, FUN = print, align = "left")

I thought I could get the same results using the lag argument in runner

list1<-runner(
  x=c(1:20),
  k = 5,
  lag=4,
  simplify = F
)

list2<-runner(
  x=c(1:20),
  k = 5,
  lag=-4,
  simplify = F
)

What am I missing or is this feature not available?

Thanks

allow negative lag (lead)

Allow lag argument to be negative and shift window accordingly

Include bookdown

Parallelization with big datasets

Hello @gogonzo ,

I am working on environmental exposure data and I need to produce averages of a single exposure for different time windows and at selected samples of dates. I have pollutants data available at individual level for a very long time, i.e. daily levels for 11 years, and several dozen of thousands of individuals.

To achieve this I looked into one of your examples, specifically the last one, i.e. Aggregating values from another data.frame in grouped_df.

As you can see below I have prepared a working example based on your code using n individuals.
The exp_series dataset contains my toy daily exposure data for each individual (4017 is the number of days between 01/01/2005 to 31/12/2015) and the sample_dates dataset contains a sample of dates for each subject within that follow-up time from which I want to extract the average exposure of the prior 365 days with 1 day lag.
I decided to run the simulation in 2 separate conditions, n =10k (k=1000) and n =100k, and number of sample dates fixed at 10k.

System specifications of the machine I used to run the simulation on:

Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz, 2301 Mhz, 6 Core(s), 12 Logical Processor(s) with Windows 10.

Working example:

library(parallel)
library(runner)

#Sample data
set.seed(111)
exp_series <- data.frame(
  id = as.character(rep(x=1:100000, each=4017)), #change x= 10000
  date = rep(seq(as.Date('2005-01-01'),
                 as.Date('2015-12-31'), by = 'day'),times=100000),#change times= 10000
  exp = rep(rnorm(n=100000, mean=10, sd=5),times=4017)#change n= 10000
)


sample_dates <- data.frame(
  Event_id = as.character(replicate(n=10000,
sample(x=1:100000,size = 1,replace = TRUE))), #change x=10000
  Event_date = sample(
    seq(as.Date('2005-01-01'),
        as.Date('2015-12-31'), by = 'day'),
    size =10000,replace = TRUE)
)


You can see below the results of my simulation in terms of time and warnings produced.  

#Uncomment when running parallelized code
#numCores <- detectCores()
#cl <- makeCluster(numCores)


#Exposure summaries dataset
system.time(exposure_summaries <- sample_dates %>%
  group_by(Event_id) %>%
  mutate(
    mean = runner(
      x = exp_series[exp_series$id ==  Event_id[1],], 
      k = "365 days", 
      lag = "1 days",
      idx =exp_series$date[exp_series$id == Event_id[1]],
      at = Event_date,
      f = function(x) {mean(x$exp)},
      na_pad=FALSE#,
     #Uncomment when running parallelized code
     #cl = cl 
    )
  )
)

#Uncomment when running parallelized code
#stopCluster(cl)

Results on my machine:

##################
#Results from testing: serial vs parallelization

###
#exp_series with 10k rows

#YES parallelization:
#user   system  elapsed 
#12267.75  1409.73 13952.49

#NO parallelization:
#user  system elapsed 
#9678.58  269.07 9950.24 
###


###
#exp_series with 100k rows

#YES parallelization:
# user   system  elapsed 
# 99763.36   3943.07 106903.92 

#NO parallelization:
#user  system elapsed 
# Warning: stack imbalance in '==', 96 then 95
# Warning: stack imbalance in '[', 93 then 92
# Warning: stack imbalance in 'NextMethod', 84 then 83
# Warning: stack imbalance in '[', 68 then 67
# user   system  elapsed 
# 96603.98  2732.89 99313.77
###

I was not able to properly find the reason of the warnings above for the 100k simulation but they did not appear when running the parallelized code.
The interesting thing for me is that parallelized the code does not seem to speed up the running time, rather it does slow it down! I know that this usually happens when the amount of work to be processed is not big enough, but this seems to be big enough.

Might the problem be linked to the fact that, in this case, to speed up the code I would ideally have to parallelize the process by groups. Meaning, using a core to compute the averages within each individual in parallel. On my machine that would allow to work on 24 individuals at the same time.

What do you think about this? I really need to speed this up!!

Can't get it constant

I read the docs on:
https://gogonzo.github.io/runner/articles/apply_any_r_function.html

... but I can't get it.

Is it possible to keep the window size constant while using an index?
For example this gives me varying window sizes:
do.call(runner,list(x=df,k=2,idx='id',f=doThatShit,simplify=FALSE,na_pad=FALSE))
while....
do.call(runner,list(x=df,k=2,f=doThatShit,simplify=FALSE,na_pad=FALSE))
gives me a constant window, but not in the right interval.
I would need an offset of 5 instead of 1 step forward, or exactly 5 days without spreading dates across windows. How would we do this?

X argument works opnly with data.frame

In the documentation, it says x argument accepts (vector, data.frame, matrix)

But for me only data.frame works.

Data:

x <- structure(c(203.37, 203.09, 201.85, 205.26, 207.51, 207, 206.6, 
                 208.95, 208.83, 207.93, 210.39, 211, 210.36, 210.15, 210.04, 
                 208.08, 208.56, 207.74, 204.84, 202.54, 205.62, 205.47, 208.73, 
                 208.55, 209.31, 209.07, 209.35, 209.32, 209.56, 208.69, 210.68, 
                 208.53, 205.61, 209.62, 208.35, 206.95, 205.34, 205.87, 201.88, 
                 202.9, 205.03, 208.03, 204.86, 200.02, 201.67, 203.5, 206.02, 
                 205.68, 205.21, 207.4, 205.93, 203.87, 201.02, 201.36, 198.82, 
                 194.05, 191.925, 192.11, 193.66, 188.83, 191.93, 187.81, 188.06, 
                 185.65, 186.69, 190.52, 187.64, 190.2, 188.13, 189.11, 193.72, 
                 193.65, 190.16, 191.3, 191.6, 187.95, 185.42, 185.43, 185.27, 
                 182.86, 186.63, 189.78, 192.88, 192.09, 192, 194.78, 192.32, 
                 193.2, 195.54, 195.09, 193.56, 198.11, 199, 199.775, 200.43, 
                 200.59, 198.4, 199.38, 199.54, 202.76), class = c("xts", "zoo"
                 ), from = "20151016  10:34:29", to = "20201016  10:34:29", src = "IB", updated = structure(1602840870.87554, class = c("POSIXct", 
                                                                                                                                        "POSIXt")), index = structure(c(1445205600, 1445292000, 1445378400, 
                                                                                                                                                                        1445464800, 1445551200, 1445814000, 1445900400, 1445986800, 1446073200, 
                                                                                                                                                                        1446159600, 1446418800, 1446505200, 1446591600, 1446678000, 1446764400, 
                                                                                                                                                                        1447023600, 1447110000, 1447196400, 1447282800, 1447369200, 1447628400, 
                                                                                                                                                                        1447714800, 1447801200, 1447887600, 1447974000, 1448233200, 1448319600, 
                                                                                                                                                                        1448406000, 1448578800, 1448838000, 1448924400, 1449010800, 1449097200, 
                                                                                                                                                                        1449183600, 1449442800, 1449529200, 1449615600, 1449702000, 1449788400, 
                                                                                                                                                                        1450047600, 1450134000, 1450220400, 1450306800, 1450393200, 1450652400, 
                                                                                                                                                                        1450738800, 1450825200, 1450911600, 1451257200, 1451343600, 1451430000, 
                                                                                                                                                                        1451516400, 1451862000, 1451948400, 1452034800, 1452121200, 1452207600, 
                                                                                                                                                                        1452466800, 1452553200, 1452639600, 1452726000, 1452812400, 1453158000, 
                                                                                                                                                                        1453244400, 1453330800, 1453417200, 1453676400, 1453762800, 1453849200, 
                                                                                                                                                                        1453935600, 1454022000, 1454281200, 1454367600, 1454454000, 1454540400, 
                                                                                                                                                                        1454626800, 1454886000, 1454972400, 1455058800, 1455145200, 1455231600, 
                                                                                                                                                                        1455577200, 1455663600, 1455750000, 1455836400, 1456095600, 1456182000, 
                                                                                                                                                                        1456268400, 1456354800, 1456441200, 1456700400, 1456786800, 1456873200, 
                                                                                                                                                                        1456959600, 1457046000, 1457305200, 1457391600, 1457478000, 1457564400,

Reproducible example:

# Doesnt work xts
test <- runner::runner(
  x = Cl(close),
  f = function(x) {
    exuber::radf(x, minw = 10, lag = 1)
  },
  k = 50,
  na_pad = TRUE
)

# Doesnt work vector
test <- runner::runner(
  x = as.vector(Cl(close)),
  f = function(x) {
    exuber::radf(x, minw = 10, lag = 1)
  },
  k = 50,
  na_pad = TRUE
)

# Doesnt work matrix
test <- runner::runner(
  x = zoo::coredata(Cl(close)),
  f = function(x) {
    exuber::radf(x, minw = 10, lag = 1)
  },
  k = 50,
  na_pad = TRUE
)

# Works df!
test <- runner::runner(
  x = as.data.frame(zoo::coredata(Cl(close))),
  f = function(x) {
    exuber::radf(x, minw = 10, lag = 1)
  },
  k = 50,
  na_pad = TRUE
)

Check if the indexes are sorted in ascending order

idx have to be sorted in ascending order to correctly compute statistics. At the beginning all functions using idx argument have to be validated - throw an error.

Check lengths of arguments

Add functions checking length of k, lag and idx.

length(k) == length(x) | length(k) == 1
length(lag) == length(x) | length(lag) == 1
length(idx) == length(x) | is.null(idx)

Only use full windows.

Unfortunately I don't understand the docs...

https://mran.microsoft.com/snapshot/2020-03-10/web/packages/runner/vignettes/apply_any_r_function.html

I drank some beers, but it does not help.

How can I tell runner to only use full windows defined by k?

Fix grouping issue to pass `.` as grouped

Example


data <- read.table(
  text = "REQUEST.DATE Title.ID ID.Index Copies
 1 2013-07-09          2        1      1
 2 2013-08-07          2        2      2
 3 2013-08-20          2        3      3
 4 2013-09-08          2        4      4
 5 2013-09-28          2        5      5
 6 2013-12-27          2        6      5
 7 2014-02-10          2        7      5
 8 2014-03-12          2        8      5
 9 2014-03-14          2        9      5
10 2014-08-27          2       10      5
11 2014-04-27          6        1      1
12 2014-08-01          6        2      2
13 2014-11-13          6        3      2
14 2015-02-14          6        4      2
15 2015-05-14          6        5      2", 
  header = TRUE)

data$REQUEST.DATE <- as.Date(as.character(data$REQUEST.DATE))

library(dplyr)
library(runner)
data %>%
  group_by(Title.ID) %>%
  mutate(
    Copies = runner(
      x = x
      function(x) {
        browser() # 15 rows instead of 10/5 (size of groups)
      }
    )
  )

https://dplyr.tidyverse.org/reference/group_map.html

Implement runner for data.frames

Find quick solution for grouped at.
Simplify operations like this one

Update Documentation

Fix vignette

Fix vignette up to recent changes

extend idx argument to numeric values

Change int type to float or double in all functions

From discussion #32

Version with parallel execution not in CRAN?

Hello there! I took a look to the README and #57, and it seems that this cool feature is already available, but latest version in CRAN is 0.3.7. I was wondering CRAN is outdated, or if this version was not finally released?.

Thanks for your work.

Parallel execution

would like to give suggestion for new feature: parallel execution of rolling. It would help for big datasets.

Create `runner.data.table` for `.SD`

To automatically dispatch proper method and avoid unnecessary object checking.

x[[1]] == as.name("runner")

Is there an explanation for the message?

test        <-do.call(runner,list(x=df,k=daSize,f=custom_fun_runner,simplify=FALSE,na_pad=TRUE))
View(test)
stop()

Error in x[[1]] == as.name("runner") : 
  comparison (1) is possible only for atomic and list types
     x

Does runner has a problem with do.call()?

Win 7, R latest, runner 0.4.1
This works perfect... calling it with df <- doRollingAnal(df)

doRollingAnal=function(df){
	#https://cran.r-project.org/web/packages/runner/vignettes/apply_any_r_function.html
	myWindow <- 5
	colVect  <- c('date','n1','n2')
	df       <- df %>% dplyr::select(any_of(colVect)) %>% 
	tidyr::nest(data=everything()) %>%
		dplyr::mutate(testData=purrr::map(data, ~ runner::runner(x=.,k=myWindow,f=rollingAnal,simplify = FALSE,na_pad=TRUE))) %>%
		tidyr::unnest(c(testData)) 
	df <- do.call("rbind",df[['testData']])
	df <- na.omit(df)
	return(df)
},

however calling it with... df<-do.call(doRollingAnal,list(df)) produces:

  Error in `mutate_cols()`:
! Problem with `mutate()` column `testData`.
i `testData = purrr::map(...)`.
x comparison (1) is possible only for atomic and list types
Caused by error in `x[[1]] == as.name("runner")`:
! comparison (1) is possible only for atomic and list types
Run `rlang::last_error()` to see where the error occurred.
     x
  1. +-base::do.call(pslScore$doRollingAnal, list(partA)) at R/tests/scoring.R:2920:12
  2. +-`<rfMthdDf>`(`<df[,5668]>`)
  3. | \-... %>% tidyr::unnest(c(testData)) at C:\Users\test.R:706:12
  4. +-tidyr::unnest(., c(testData))
  5. +-dplyr::mutate(...)
  6. +-dplyr:::mutate.data.frame(...)
  7. | \-dplyr:::mutate_cols(.data, ..., caller_env = caller_env())
  8. |   +-base::withCallingHandlers(...)
  9. |   \-mask$eval_all_mutate(quo)
 10. +-purrr::map(...)
 11. | \-.f(.x[[i]], ...)
 12. |   +-runner::runner(...)
 13. |   \-runner:::runner.data.frame(...)
 14. |     \-runner:::set_from_attribute_difftime(x, k)
 15. |       \-runner:::get_runner_call_arg_names()
 16. |         +-base::which(...)
 17. |         \-base::vapply(...)
 18. |           \-runner FUN(X[[i]], ...)
 19. +-base::.handleSimpleError(...)
 20. | \-dplyr h(simpleError(msg, call))
 21. |   \-rlang::abort(...)
 22. |     \-rlang:::signal_abort(cnd, .file)
 23. |       \-base::stop(fallback)
 24. \-global `<fn>`()
 25.   \-lobstr::cst()

Thank you.

include static code tests

add and fix spelling

apply rolling with function that has 2 vector args

I sit possible to solve this problem: https://stackoverflow.com/questions/64062969/rolling-over-function-with-2-vector-arguments with your package?

Q: multiple values having same idx

Hi,
I am wondering what is the use case of multiple values having same idx?

Error in sys.call(-runner_call_idx)

I've been getting the following error message:
Error in sys.call(-runner_call_idx) : invalid 'which' argument

sys.call(-runner_call_idx)
get_runner_call_arg_names()
set_from_attribute_difftime(x, k)
runner.data.frame(x, lag = "1 months", k = "4 months", idx = x$date, f = function(x) { cor(x$a, x$b) })
runner::runner(x, lag = "1 months", k = "4 months", idx = x$date, f = function(x) { cor(x$a, x$b) })

I've checked if it had something to do with my code but I get the same error message when running the following code from the tutorial:

x <- data.frame(
date = seq.Date(Sys.Date(), Sys.Date() + 365, length.out = 20),
a = rnorm(20),
b = rnorm(20)
)

runner::runner(
x,
lag = "1 months",
k = "4 months",
idx = x$date,
f = function(x) {
cor(x$a, x$b)
}
)

Any help on how to solve this is much appreciated :)

First result incorrectly omitted with na_pad = TRUE

As you can see below, test begins only from row 13 with value 2021-03-13 13:15:00.

I believe for 1 hour window it could start from row 12 with value 2021-03-13 13:10:00.

slot <- seq.POSIXt(from = as.POSIXct("2021-03-13 13:10:00"), 
                     to = as.POSIXct("2021-03-13 14:10:00"),
                     by = "5 min")
test <- runner::runner(x = slot, idx = slot, k = "1 hour", 
                       f = function(x) x[1], na_pad = TRUE)
test <- as.POSIXct(test, origin = "1970-1-1")
print(data.frame(slot, test))

#                   slot                test
# 1  2021-03-13 13:10:00                <NA>
# 2  2021-03-13 13:15:00                <NA>
# 3  2021-03-13 13:20:00                <NA>
# 4  2021-03-13 13:25:00                <NA>
# 5  2021-03-13 13:30:00                <NA>
# 6  2021-03-13 13:35:00                <NA>
# 7  2021-03-13 13:40:00                <NA>
# 8  2021-03-13 13:45:00                <NA>
# 9  2021-03-13 13:50:00                <NA>
# 10 2021-03-13 13:55:00                <NA>
# 11 2021-03-13 14:00:00                <NA>
# 12 2021-03-13 14:05:00                <NA>
# 13 2021-03-13 14:10:00 2021-03-13 13:15:00

Also for some reason POSIXct is stripped from test, but this is less of a problem.

runner to accept output of any type

Extend function for other data types

grouped_df - k value ignored if not using run_by

Hi,

I came across an issue with grouped_df/dplyr usage of runner. Using the example from the vigngette following works fine and produces the desired windows:

x <- cumsum(rnorm(20))
y <- 3 * x + rnorm(20)
date <- Sys.Date() + cumsum(sample(1:3, 20, replace = TRUE)) # unequaly spaced time series
group <-  rep(c("a", "b"), each = 10)


data.frame(date, group, y, x) %>%
  group_by(group) %>%
  run_by(idx = "date", k = "5 days") %>%
  mutate(
    alpha_5 = runner(
      x = ., 
      f = function(x) {
        print(nrow(x))
        coefficients(lm(x ~ y, x))[1]
      }
    ),
    beta_5 = runner(
      x = ., 
      f = function(x) {
        print(nrow(x))
        coefficients(lm(x ~ y, x))[1]
      }
    )
  )

# Output:
[1] 1
[1] 2
[1] 3
[1] 2
[1] 2
[1] 2
[1] 3
[1] 3
[1] 3
[1] 3
[1] 1
[1] 2
[1] 3
[1] 3
[1] 3
[1] 4
[1] 4
[1] 4
[1] 3

However if I move the idx & k into the runner calls, the window always starts at first index and the window grows each iteration:

library(runner)
x <- cumsum(rnorm(20))
y <- 3 * x + rnorm(20)
date <- Sys.Date() + cumsum(sample(1:3, 20, replace = TRUE)) # unequaly spaced time series
group <-  rep(c("a", "b"), each = 10)


data.frame(date, group, y, x) %>%
  group_by(group) %>%
  #run_by(idx = "date", k = "5 days") %>%
  mutate(
    alpha_5 = runner(
      x = ., idx = "date", k = "5 days",
      f = function(x) {
        print(nrow(x))
        coefficients(lm(x ~ y, x))[1]
      }
    ),
    beta_5 = runner(
      x = ., idx = "date", k = "5 days",
      f = function(x) {
        print(nrow(x))
        coefficients(lm(x ~ y, x))[1]
      }
    )
  )

# Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

this might not be visible in near all contexts for users but created strage results in my anaylisis leading to me raising this issue.

The different results are however quite drastic and is this probaly a bug?

Best
Thorsten

R studio crashes when using Rcpp inside runner

I have Rcpp function that works as expected:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double backtest_cpp_gpt(NumericVector returns, NumericVector indicator, double threshold) {
  int n = indicator.size();
  NumericVector sides(n, 1.0); // Initialize with 1s

  for(int z = 1; z < n; ++z) { // Start from 1 since we look back one period
    if(!NumericVector::is_na(indicator[z-1]) && indicator[z-1] > threshold) {
      sides[z] = 0;
    }
  }

  double cum_returns = 1.0; // Start cumulative returns at 1
  for(int z = 0; z < n; ++z) {
    cum_returns *= (1 + returns[z] * sides[z]);
  }
  return cum_returns - 1; // Adjust for initial value
}

// [[Rcpp::export]]
NumericVector opt(NumericVector returns,
                  List indicators,
                  NumericVector thresholds) {
  // Define the number of indicators and the number of thresholds
  // and loop over them to calculate the backtest results
  int n_thresholds = thresholds.size();
  NumericVector results(n_thresholds);

  // Loop over n_thresholds and calculate the backtest results
  for(int j = 0; j < n_thresholds; ++j) {
    NumericVector indicator_ = indicators[j];
    results[j] = backtest_cpp_gpt(returns, indicator_, thresholds[j]);
  }

  return results;
}

Single function works as expected

opt(returns_wf, vars_wf_l, thresholds_wf)

but when I use this function inside runner

runner(
x = df,
f = function(x) {
opt(x$returns, vars_wf_l, thresholds_wf)
},
at = 100
)

it crashes R studio.

allow lag_run to pick nearest value lower and greater than lag

now always picks value being closer to actual