Git Product home page Git Product logo

gslab_r's People

Contributors

arosenbe avatar ew487 avatar jmshapir avatar matthiasweigand avatar rcalvo12 avatar veli-m-andirin avatar yuchuan2016 avatar zhizhongpu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gslab_r's Issues

SaveData cannot take in data.table objects

Error message when exporting data.table objects using SaveData:

expert_iran |>
SaveData(key = c("city", 'date'),
outfile = file.path(outdir, 'expert_iran_protests.csv'),
logfile = file.path(outdir, 'expert_iran_protests.log'),
appendlog = F)
Error: i evaluates to a logical vector length 2 but there are 460 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.

Temp fix is to convert dt to df before using SaveData.

I envision 3 options:

  1. Do nothing; user has to turn setDF furst in order to use SaveData
  2. Do almost nothing except add 1 line similar to setDF(df) in the very beginning of the main function in SaveData.R
  3. Modify (possibly significantly) SaveData.R so that it can take in both DT and DF in a (computationally efficient) manner.

MD5 changes by operating system

In another project we noticed that MD5 hashes change for the same R object depending on the operating system. We also noticed the same issue for outputted CSV files.

When computing hashes on outputted CSV files, we were able to get a unique hash by specifying the same line ending for different operating systems.

The issue remains for hashes computed inside R.

@santiagohermo, @miikapaal fyi.

Variable order in SaveData logs

SaveData reorders variable in alphabetical order when writing the log with variable summaries. In this issue we will make a minor modification to keep the original ordering of data frames.

Create binned scatter function

Per comment here, goal is to adapt the binned scatter function(s) we use in ad-price-driver for cross-project/repo usage. Relevant code and documentation can be found here (note not all files in the linked directory is for creating/testing binscatter.R).

@jmshapir fyi.

Improve dependency structure for SaveData

Per an e-mail suggestion today from Matthias, we thought that switching to the packagename::function convention when calling functions from other packages would avoid bugs in cases where another preloaded package has a function whose name is the same as one of the functions we are calling in SaveData.

Here are the changes that Matthias suggested.

One issue is that the unit tests fail when I apply these changes, and I couldn't immediately fix it. @jmshapir, my gslab_r bandwidth over the remaining time may not suffice to complete this issue, so I am not self-assigning, but please let me know if you disagree.

RDS capitalization

Allowing for alternative capitalization schemes for .RDS files to allow for ".Rds" as an appropriate extension.

Team discussion on OOP system in R

@gentzkow wants us to put our heads together to discuss which OOP system we want to use to create our gslab_r packages. Our ultimate goal is to migrate the MATLAB classes gslab_model, gslab_mle and gslab_mde to R packages. The discussion should get everyone familiar with the OOP system in R, and decide which system we want to use to create our packages.

I will briefly summarize what I have learned regarding to different systems in R and my current implementation of gslab_model in my next comment. We should look at the whole gslab_model, gslab_mle, and gslab_mde system and think about which features will be hard to implement in one or the other.

Update CRAN mirror

Berkeley shut down their CRAN mirror, so I will move all calls to it to the Washington University in St. Louis CRAN mirror.

Create and improve the basic classes

The goal of this task is to make a first pass on creating and improving the basic gslab_r classes (Model, ModelData, ModelEstimationOutput), which parallel the classes in MATLAB library gslab_model.

Requests for SaveData

I have been working with large data and found some limitations to SaveData that prompted to use a local version with a few different functionalities. I think that you may be interested in adding some of those to the function. Find below a list of suggestions.

Support feather and fst file formats to save data

Issue: csv files are saved quickly with data.table but they use a lot of space on disc. Formats feather (from arrow) and fst (from fst) are also really fast --sometimes even faster-- and also compress the files, potentially saving a lot of space on disc.

Proposal: Add these formats to the data dictionary.

Shortcomings of log file generation

Speed

Issue: While I haven't tested formally, I think the way the package computes summary statistics can be slow for large datasets.

Proposal: Change the computation of sumstats here. For that, make df a datatable object, and change the code to something like this:

See code
numeric_vars <- names(dplyr::select_if(dt, is.numeric))

numeric_sum <- t(rbind(dt[, lapply(.SD, mean, na.rm = T), .SDcols = numeric_vars],
                       dt[, lapply(.SD, sd,   na.rm = T), .SDcols = numeric_vars],
                       dt[, lapply(.SD, min,  na.rm = T), .SDcols = numeric_vars],
                       dt[, lapply(.SD, max, na.rm = T), .SDcols = numeric_vars]))

Optionally exclude vars

Issue: Sometimes there is a variable for which you don't want to include the summary statistics (because data are confidential, for example).

Proposal: Add a mask_vars optional argument that takes a list of variables and excludes them from the numeric sumstats computation.

Data with no numeric variables

Issue: In this case the output has a lot of NULLs where the sumstats should go.

Proposal: Do not compute sumstats, and exclude them from log file, if all variables in dataset are non-numeric

Data loaded from STATA

Issue: Sometimes when you load data the class of the column includes a label and other attributes. This distorts the type column of the log file

Proposal: Clean the extra column attributes before generating the log file (or maybe even before saving).

Allow flexible sorting with SaveData()

Currently SaveData() sorts the data based on the keys provided and saves. We'd like to allow specifying an optional sort variable. The default variables used for sorting will still be the keys.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.