gslab-econ / gslab_r Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 1.0 18.15 MB

R 99.91% TeX 0.09%

gslab_r's People

Contributors

Stargazers

Watchers

Forkers

anhnguyendepocen

gslab_r's Issues

SaveData cannot take in data.table objects

Error message when exporting data.table objects using SaveData:

expert_iran |>
SaveData(key = c("city", 'date'),
outfile = file.path(outdir, 'expert_iran_protests.csv'),
logfile = file.path(outdir, 'expert_iran_protests.log'),
appendlog = F)
Error: i evaluates to a logical vector length 2 but there are 460 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.

Temp fix is to convert dt to df before using SaveData.

I envision 3 options:

Do nothing; user has to turn setDF furst in order to use SaveData
Do almost nothing except add 1 line similar to setDF(df) in the very beginning of the main function in SaveData.R
Modify (possibly significantly) SaveData.R so that it can take in both DT and DF in a (computationally efficient) manner.

MD5 changes by operating system

In another project we noticed that MD5 hashes change for the same R object depending on the operating system. We also noticed the same issue for outputted CSV files.

When computing hashes on outputted CSV files, we were able to get a unique hash by specifying the same line ending for different operating systems.

The issue remains for hashes computed inside R.

@santiagohermo, @miikapaal fyi.

Variable order in SaveData logs

SaveData reorders variable in alphabetical order when writing the log with variable summaries. In this issue we will make a minor modification to keep the original ordering of data frames.

Finalize SaveData() function

In this issue, I'll bring in the SaveData() function from the Protests project.

Develop GSLabMLE package

Follow #2, the goal is to migrate the gslab_mle MATLAB library to R package following the standard workflow outlined in the README.md.

Investigate potential bug in sortbykey parameter in SaveData

The purpose of this issue (#28) is to address a potential bug in the sortbykey parameter of the SaveData package, whereby row indices may not be properly sorted given a set of key inputs when the argument is set to TRUE. Any proposed revisions should be self-contained within ~/SaveData/R/SaveData.R. The relevant development branch is issue28_sbk_savedata.

cc @jmshapir @rcalvo12

Create binned scatter function

Per comment here, goal is to adapt the binned scatter function(s) we use in ad-price-driver for cross-project/repo usage. Relevant code and documentation can be found here (note not all files in the linked directory is for creating/testing binscatter.R).

@jmshapir fyi.

Improve dependency structure for SaveData

Per an e-mail suggestion today from Matthias, we thought that switching to the packagename::function convention when calling functions from other packages would avoid bugs in cases where another preloaded package has a function whose name is the same as one of the functions we are calling in SaveData.

Here are the changes that Matthias suggested.

One issue is that the unit tests fail when I apply these changes, and I couldn't immediately fix it. @jmshapir, my gslab_r bandwidth over the remaining time may not suffice to complete this issue, so I am not self-assigning, but please let me know if you disagree.

Implement autofill function

In this issue we'll bring the autofill function we've written in Protests to this repo.

RDS capitalization

Allowing for alternative capitalization schemes for .RDS files to allow for ".Rds" as an appropriate extension.

Team discussion on OOP system in R

@gentzkow wants us to put our heads together to discuss which OOP system we want to use to create our gslab_r packages. Our ultimate goal is to migrate the MATLAB classes gslab_model, gslab_mle and gslab_mde to R packages. The discussion should get everyone familiar with the OOP system in R, and decide which system we want to use to create our packages.

I will briefly summarize what I have learned regarding to different systems in R and my current implementation of gslab_model in my next comment. We should look at the whole gslab_model, gslab_mle, and gslab_mde system and think about which features will be hard to implement in one or the other.

Function for assigning scalars to variables from an input file

In this issue we will write a helper function for assigning scalars to R variables from an input file, i.e. a functionality similar to the one in loadglob.

Migrate numerical_derivatives from MATLAB to R

Migrate from here. Should follow the standard workflow.

Update CRAN mirror

Berkeley shut down their CRAN mirror, so I will move all calls to it to the Washington University in St. Louis CRAN mirror.

Create and improve the basic classes

The goal of this task is to make a first pass on creating and improving the basic gslab_r classes (Model, ModelData, ModelEstimationOutput), which parallel the classes in MATLAB library gslab_model.

Requests for SaveData

I have been working with large data and found some limitations to SaveData that prompted to use a local version with a few different functionalities. I think that you may be interested in adding some of those to the function. Find below a list of suggestions.

Support `feather` and `fst` file formats to save data

Issue: csv files are saved quickly with data.table but they use a lot of space on disc. Formats feather (from arrow) and fst (from fst) are also really fast --sometimes even faster-- and also compress the files, potentially saving a lot of space on disc.

Proposal: Add these formats to the data dictionary.

Shortcomings of log file generation

Speed

Issue: While I haven't tested formally, I think the way the package computes summary statistics can be slow for large datasets.

Proposal: Change the computation of sumstats here. For that, make df a datatable object, and change the code to something like this:

See code

numeric_vars <- names(dplyr::select_if(dt, is.numeric))

numeric_sum <- t(rbind(dt[, lapply(.SD, mean, na.rm = T), .SDcols = numeric_vars],
                       dt[, lapply(.SD, sd,   na.rm = T), .SDcols = numeric_vars],
                       dt[, lapply(.SD, min,  na.rm = T), .SDcols = numeric_vars],
                       dt[, lapply(.SD, max, na.rm = T), .SDcols = numeric_vars]))

Optionally exclude vars

Issue: Sometimes there is a variable for which you don't want to include the summary statistics (because data are confidential, for example).

Proposal: Add a mask_vars optional argument that takes a list of variables and excludes them from the numeric sumstats computation.

Data with no numeric variables

Issue: In this case the output has a lot of NULLs where the sumstats should go.

Proposal: Do not compute sumstats, and exclude them from log file, if all variables in dataset are non-numeric

Data loaded from STATA

Issue: Sometimes when you load data the class of the column includes a label and other attributes. This distorts the type column of the log file

Proposal: Clean the extra column attributes before generating the log file (or maybe even before saving).

Allow flexible sorting with SaveData()

Currently SaveData() sorts the data based on the keys provided and saves. We'd like to allow specifying an optional sort variable. The default variables used for sorting will still be the keys.