klarsen1 / marketmatching Goto Github PK

View Code? Open in Web Editor NEW

131.0 131.0 37.0 8.15 MB

License: Other

R 4.10% HTML 95.90%

marketmatching's People

Contributors

Stargazers

Watchers

marketmatching's Issues

no valid data in post-period using dynamic warping

I am getting the following error when I pass the output of best_matches to inference:

Error in stopif(length(post_period) == 0, TRUE, "ERROR: no valid data in the post period") : 
  ERROR: no valid data in the post period
In addition: Warning message:
In max(date) : no non-missing arguments to max; returning -Inf

This is the code I am using to call best_matches.

mm_uk <- best_matches(data=joins_data_mm_uk,
                      id_variable="regions",
                      date_variable="dates",
                      markets_to_be_matched=c("UK"),
                      matching_variable="kpi",
                      parallel=FALSE,
                      warping_limit=1, # warping limit=1
                      dtw_emphasis=1, # rely only on dtw for pre-screening
                      matches=10, # request 15 matches
                      start_match_period="2022-06-01",
                      end_match_period="2022-08-07")

This error only seems to crop up when I set the dtw_emphasis to 1. With any other value of dtw_emphasis, the inference function runs without any problem.

What are the steps you would recommend to try and debug this? From what I can see, there is valid data in the post period for both of the selected control markets.

Thanks!

Best Match Error

I'm seeing this error when I apply the best_matches piece.

Error in { : task 150 failed - "replacement has 1 row, data has 0"

Any ideas?

> mm <- best_matches(data = radio, id_variable = "market", date_variable = "week", matching_variable = "installs", parallel = TRUE, warping_limit = 1, dtw_emphasis = 1, matches = 5, start_match_period = "2017-07-03", end_match_period = "2018-09-10")

Question - Not an Issue!

Loved the post and this package - thanks for sharing it!

One question I had was the following, to your knowledge is there anything special about structural time series that makes it especially suitable for causal analysis like this, where a model is used to forecast the time series and the differential between forecast and prediction is the estimate of causality? For example, (since I don't really understand structural time series models) could we use a block bootstrap on the same type of data and use say a neural net for the model (the bootstrap providing the prediction intervals t)?

Deprecated ggplot2 feature

Hi,

Thanks for this awesome package. It is exactly what I have been looking for to help streamline my causal inference analyses.

I am getting the below warning message when using the latest version. I guess it is nothing too serious as it is related to ggplot visual components but wanted to raise the issue in any case.

The warning message seems to only appear when the argument analyze_betas = TRUE in the inference function is specified.

Cheers

Warning message:
The <scale> argument of guides() cannot be FALSE. Use "none" instead as of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the MarketMatching package.
Please report the issue to the authors.
This warning is displayed once every 8 hours.
Call lifecycle::last_lifecycle_warnings() to see where this warning was generated.

Weights for data points

Thank you for this library! Is there any way to incorporate weights for data points (e.g., if the daily statistics are based on very different sample sizes, then one might want to weigh these values accordingly in the analysis)?

v1.1.3 in CRAN

Hello,
I'd like to test the prospective_power function but it seems the last version is not available in CRAN yet, and I cannot compile packages from github from my work laptop

Thanks!

Number of controls is limited to five

/e: ....................................
Inspecting the source-code I've found the control_matches parameter for the inference() function. This resolves my issue. However, this is not documented anywhere. Hence, I've spent way too much time resolving this :/
..........................................

I've been trying to add more than 5 controls using MarketMatching. Setting matches = 8 in best_matches() correctly returns 8 controls for each test . However, the inference() function seems to be always only using 5 controls at most. I have also experienced this behavior with other data-sets, which have hundreds of potential controls. Also, I made sure that this is not an issue of the underlying CausalImpact package or other required / implicitly loaded modules (BoomSpikeSlab, bsts). The following code reproduces the issue. Notice, that adjusting matches to a value below 5 has the proper effect:

data(weather, package="MarketMatching")
mm <- best_matches(data=weather,
                   id_variable="Area",
                   date_variable="Date",
                   matching_variable="Mean_TemperatureF",
                   parallel=FALSE,
                   warping_limit=1, # warping limit=1
                   dtw_emphasis=1, # rely only on dtw for pre-screening
                   matches=8,    # Values below 5 have an effect on inference, but not above!?
                   start_match_period="2014-01-01",
                   end_match_period="2014-10-01")

results <- MarketMatching::inference(matched_markets = mm,
                                    test_market = "CPH",
                                    end_post_period = "2015-10-01")

head(results$CausalImpactObject$model$bsts.model$coefficient)


I'm working with this environment:
> 
> R version 3.5.1 (2018-07-02)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 14393)
> 
> Matrix products: default
> 
> locale:
> [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
> [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
> [5] LC_TIME=German_Germany.1252    
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
>  [1] MarketMatching_1.1.1 bindrcpp_0.2.2       forcats_0.3.0       
>  [4] stringr_1.3.1        dplyr_0.7.6          purrr_0.2.5         
>  [7] readr_1.1.1          tidyr_0.8.1          tibble_1.4.2        
> [10] ggplot2_3.0.0        tidyverse_1.2.1      CausalImpact_1.2.4
> [13] bsts_0.8.0           xts_0.11-1           zoo_1.8-4           
> [16] BoomSpikeSlab_1.0.0  Boom_0.8             MASS_7.3-50         
> [19] RevoUtils_11.0.1     RevoUtilsMath_11.0.0
> 
> loaded via a namespace (and not attached):
>  [1] Rcpp_0.12.18      lubridate_1.7.4   lattice_0.20-35   foreach_1.4.4    
>  [5] assertthat_0.2.0  digest_0.6.15     IRdisplay_0.5.0   R6_2.2.2         
>  [9] cellranger_1.1.0  plyr_1.8.4        repr_0.15.0       backports_1.1.2  
> [13] evaluate_0.11     httr_1.3.1        pillar_1.3.0      rlang_0.2.1      
> [17] lazyeval_0.2.1    uuid_0.1-2        readxl_1.1.0      data.table_1.12.0
> [21] rstudioapi_0.8    labeling_0.3      munsell_0.5.0     proxy_0.4-22     
> [25] broom_0.5.0       compiler_3.5.1    modelr_0.1.2      pkgconfig_2.0.1  
> [29] base64enc_0.1-3   htmltools_0.3.6   tidyselect_0.2.4  codetools_0.2-16 
> [33] dtw_1.20-1        crayon_1.3.4      withr_2.1.2       grid_3.5.1       
> [37] nlme_3.1-137      jsonlite_1.5      gtable_0.2.0      magrittr_1.5     
> [41] scales_1.0.0      cli_1.0.0         stringi_1.2.4     reshape2_1.4.3   
> [45] doParallel_1.0.14 xml2_1.2.0        IRkernel_0.8.12   iterators_1.0.10 
> [49] tools_3.5.1       glue_1.3.0        hms_0.4.2         parallel_3.5.1   
> [53] colorspace_1.3-2  rvest_0.3.2       pbdZMQ_0.3-3      bindr_0.1.1      
> [57] haven_1.1.2

MAPE function

I looked into the

mape_no_zeros <- function(test, ref){ d <- cbind.data.frame(test, ref) d <- subset(d, abs(test)>0) return(mean(abs(lm(test ~ ref, data=d)$residuals)/d$test)) }

If the test contains negative values, for each training data pair, would we calculating its absolute deviation as |ref/test|, instead of |ref|/test?

return(mean(abs(lm(test ~ ref, data=d)$residuals/d$test)))

https://en.wikipedia.org/wiki/Mean_absolute_percentage_error

is there python equivalent

any similar package in Python?

How to get value after which the response value would have been siginificant?

library(MarketMatching)
data(weather, package="MarketMatching")
mm <- best_matches(data=weather,
                   id_variable="Area",
                   date_variable="Date",
                   matching_variable="Mean_TemperatureF",
                   parallel=FALSE,
                   warping_limit=1, # warping limit=1
                   dtw_emphasis=1, # rely only on dtw for pre-screening
                   matches=5, # request 5 matches
                   start_match_period="2014-01-01",
                   end_match_period="2014-10-01")
                   
results <- MarketMatching::inference(matched_markets = mm,
                                    test_market = "CPH",
                                    end_post_period = "2015-10-01")
                                    
pred = sum(results$Predictions$Predicted)
actual = sum(results$Predictions$Response)
lower = sum(results$Predictions$lower_bound)
upper = sum(results$Predictions$upper_bound)
effect = actual-pred

Is it possible to determine at what cumulative response value the response would have been significant?

Let's say I have a predicted cumulative value of 19007. The confidence interval is [16053, 21967].
The cumulative response value is 18497, and it not significant with 95% confidence interval. This makes sense ofcourse since the response is lower than the prediction.
However, I want to know how far off the response value was from reaching statistical significance. Is this value presented in the package output?
Is it maybe as easy as taking the sum of the upper bound values?

'prospective_power' is not an exported object from 'namespace:MarketMatching'

I just installed MarketMatching from github and tried running the sample code for Pseudo Power Curves: https://github.com/klarsen1/MarketMatching#prospective-pseudo-power-curves

I get the following error:

Error: 'prospective_power' is not an exported object from 'namespace:MarketMatching'

Any ideas?

Unexpected error in MarketMatching::inference()

Thanks first of all for the useful package; I'd used it on several cases and bumped into this recently.

I tried debugging and found my way down to the CausalImpact() call; however, I can't find an explanation to the error.

Error:

Error in data.frame(y.model, cum.y.model, point.pred, cum.pred) : 
  arguments imply differing number of rows: 122, 123

In MarketMatching::inference(), the ts variable created, which is passed on to CausalImpact::CausalImpact(ts, pre.period, post.period, alpha = alpha, model.args = bsts_modelargs) has dim() 122 10, so not sure where the 123 comes from.

Apologies for not providing a reproducible example as data is confidential; further, toy problems don't have this issue.

Look forward to insights.

Prior SD Determination

I was reading your original post and it looks like you must be running the causualimpact package over a grid of prior SD and plotting MAPE, DW. Is this done just for the pre-period? If so, do you think it better to try and determine the best level out of sample (Cross validation)? Do you find that it is useful to change the SD based on these findings?

Expose full functionality of bsts

This package is really helpful for convenience of the analysis flow - it would be improved though by exposing all the parameters of bsts. For example, the seasonality component.

calculation of MAPE

MarketMatching/R/functions.R

Line 149 in 9174be6

return(mean(abs(stats::lm(test ~ ref, data=d)$residuals)/abs(d$test)))

hey @klarsen1 ,

I love this wrapper you've created - it's very useful! Trying to understand the MAPE calculation. is there a reason you're fitting an lm and taking the residuals of that lm? I tried to reproduce the numbers this helper function reads out, but could not do so.

 # make some fake data
v1 <- c(1,3,6)
v2 <- c(2,3,7)


mape_no_zeros <- function(test, ref){
  d <- cbind.data.frame(test, ref)
  d <- subset(d, abs(test)>0)
  return(mean(abs(stats::lm(test ~ ref, data=d)$residuals)/abs(d$test)))
}


mape_hand_calc <- function(test, ref){
  # following https://en.wikipedia.org/wiki/Mean_absolute_percentage_error
  if(length(test) != length(ref)){
    stop("inputs need to be same length")
  }
  
  # setup
  n_points <- length(ref)
  rolling_sum <- 0
  
  # for each time point, get difference between reference and pred value, divide by ref
  # add abs of this to running total
  for(i in 1:n_points){
    rolling_sum <- rolling_sum + abs((ref[i] - test[i]) / ref[i])
  }
  return(rolling_sum / n_points)
}

# show results
mape_hand_calc(v2, v1)
mape_hand_calc(v1, v2)
mape_no_zeros(v1, v2)
mape_no_zeros(v2, v1)

also, I don't mean this a contribution, this is just a question

RelativeDistance normalised by 'ref' rather than 'test'

Hi,

Line 47 of functions.R, the calculate_distances function, the dtw distance is normalised using the abs(sum()) of the query timeseries:

dist <- dtw(test, ref, window.type=sakoeChibaWindow, window.size=warping_limit)$distance / abs(sum(test))

However, I can only replicate the RelativeDistance values returned in BestMatches if the dtw distance is divided by the abs(sum()) of the reference timeseries; naively this makes sense as the reference ts is common amongst all possible matches.

I don't understand how this is the case given the code though, as the ref and test vars are defined correctly - mkts variable has all columns in the correct/expected positions, i.e. query at index 1 and ref at 2:

mkts <- create_market_vectors(data, ThisMarket, ThatMarket)
 test <- mkts[[1]]
ref <- mkts[[2]]

I.e. in mkts.csv.zip, column 1 is my query ts and 2 is my ref (checked and double-checked!), the BestMatches dataframe returned by best_matches() lists the RelativeDistance as 0.6599712, but this value is equal to:

dtw(mkts[[1]], mkts[[2]], window.type=sakoeChibaWindow, window.size=1)$distance / abs(sum(mkts[[2]]))

as opposed to:

dtw(mkts[[1]], mkts[[2]], window.type=sakoeChibaWindow, window.size=1)$distance / abs(sum(mkts[[1]]))

Apologies if I'm missing something obvious.

Issue with MarketMatching::inference method

Hi Team,
Can someone help me with this error or explain more when we face this error.

Error in stopif(length(post_period) == 0, TRUE, "ERROR: no valid data in the post period") : ERROR: no valid data in the post period Some( Error in stopif(length(post_period) == 0, TRUE, "ERROR: no valid data in the post period"): ERROR: no valid data in the post period )

Scenario:
When I use the MarketMatching::inference method, getting this error

Currently, I am specifying matches=3, and one of the matches (market) returned by the MarketMatching::best_matches has 0 (zero) values for the columns that need to be matched (matching_variable).

Note: It doesn't have any null/nan values, and there is no gap in the pre and post-time period.

Thanks in Advance

Installation failing due to dependency upgrade

The following dependencies have been updated, causing the MarketMatching package version 1.2.0 to fail.

Boom: 0.9.11 -> 0.9.14
BoomSpikeSlab: 1.2.5 -> 1.2.6

Error when downloading/installing the repo/library in R.

When running the following line to install the package:
devtools::install_github("klarsen1/MarketMatching", build_vignettes=TRUE)

I get the following error:

Downloading GitHub repo klarsen1/MarketMatching@master
Error in FUN(X[[i]], ...) : 
  Invalid comparison operator in dependency: >=

I am using R version 3.4.1 and devtools_2.0.1.

Thanks a lot for looking into this issue.

ERROR: no valid data in the post period

Hello, I'm also having issues processing my dataset, it's really weird, cause I have the same dataset processing a different event, and it works fine.

The whole dataset time frame goes from 22/07/01 to 22/09/27
The intervention happened between 22/08/15 to 22/09/11 which leaves 16 days for the post period.
The date column is parsed as date, so that's not the issue.

For my use case, this happened so far with this particular analysis. I have run other analysis using the same dataset but different combination of test/control markets and events.

Here follows the summary of the data present from the best_matches method:

id_var             date_var            match_var      
 Length:10340       Min.   :2022-07-01   Min.   :  1.000  
 Class :character   1st Qu.:2022-07-21   1st Qu.:  1.000  
 Mode  :character   Median :2022-08-12   Median :  2.000  
                    Mean   :2022-08-11   Mean   :  5.399  
                    3rd Qu.:2022-09-03   3rd Qu.:  5.000  
                    Max.   :2022-09-27   Max.   :145.000

I also checked the dataset itself, to see if there are missing dates or something, that's also not the case.

Specify id variable as factor in PlotActuals

Hi there!
I frequently work with the MarketMatching package (and I love it!). One (minor) bug I frequently come across is that numeric id variables result in a wrongly specified plot as outcome of the inference function. In the code that generates the PlotActuals plot, this bug could easily be fixed by specifying the color variable as a factor variable right in the plot. (Of course, it is also possible to specify the id variable as factor variable earlier on but in my experience this often leads to other errors such as an incorrect assignment of factor levels to id variables). The following tiny change to line 585 and below should do the trick:

  plotdf <- data[data$id_var %in% c(test_market, control_market),]
  results[[12]] <- ggplot(data=plotdf, aes(x=date_var, y=match_var, colour=as.factor(id_var))) +
    geom_line() +
    theme_bw() + theme(legend.title = element_blank(), axis.title.x = element_blank()) + ylab("") + xlab("Date") +
    geom_vline(xintercept=as.numeric(MatchingEndDate), linetype=2) +
    scale_y_continuous(labels = scales::comma, limits=c(ymin, ymax))`

Would it be possible to implement this change in the next version of MarketMatching?

Cannot install package

I followed description in repo on how to install package but:

> install.packages("MarketMatching") Warning in install.packages : package ‘MarketMatching’ is not available (for R version 3.4.4)

I also tried to install from Github but it didn't work as it fails to install dependency Boom

Warning in install.packages : dependencies ‘Boom’, ‘BoomSpikeSlab’ are not available

Documentation typo

Lines 374 and 375 of functions.R should be:

#' \item{\code{TestData}}{A \code{vector} with the test market data}
#' \item{\code{ControlData}}{A \code{data.frame} with the data for the control markets}

Currently it is this:
#' \item{\code{TestData}}{A \code{data.frame} with the test market data}
#' \item{\code{TestData}}{A \code{data.frame} with the data for the control markets}

How to Leave Out Dates from Training and Inference for Validation?

Does MarketMatching allow you to select a post period start date when running inference? I would ideally like to use best_matches to identify some good control markets over a training period, and then test out whether the correlation still holds on a validation period. In order to do this, I need to leave out some dates between the matching period and the inference period. Is there a way to do this? Thank you.

no valid data in the post period

I have a dataset with dates spanning from 2018-02-02 to 2018-05-24 and an intervention on 2018-05-04.

I've set
start_match_period="2018-02-02"
end_match_period="2018-05-03"
end_post_period = "2018-05-24"
which should be correct but I keep getting this error and have no idea why:

Error in stopif(length(post_period) == 0, TRUE, "ERROR: no valid data in the post period") :
ERROR: no valid data in the post period
In addition: Warning message:
In max(date) : no non-missing arguments to max; returning -Inf

@klarsen1 can you please help me out?

klarsen1 / marketmatching Goto Github PK

marketmatching's People

Contributors

Stargazers

Watchers

Forkers

marketmatching's Issues

Recommend Projects

Recommend Topics

Recommend Org