gslab_r's People
Forkers
anhnguyendepocengslab_r's Issues
SaveData cannot take in data.table objects
Error message when exporting data.table objects using SaveData:
expert_iran |>
SaveData(key = c("city", 'date'),
outfile = file.path(outdir, 'expert_iran_protests.csv'),
logfile = file.path(outdir, 'expert_iran_protests.log'),
appendlog = F)
Error: i evaluates to a logical vector length 2 but there are 460 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
Temp fix is to convert dt to df before using SaveData.
I envision 3 options:
- Do nothing; user has to turn setDF furst in order to use SaveData
- Do almost nothing except add 1 line similar to setDF(df) in the very beginning of the main function in SaveData.R
- Modify (possibly significantly) SaveData.R so that it can take in both DT and DF in a (computationally efficient) manner.
MD5 changes by operating system
In another project we noticed that MD5 hashes change for the same R object depending on the operating system. We also noticed the same issue for outputted CSV files.
When computing hashes on outputted CSV files, we were able to get a unique hash by specifying the same line ending for different operating systems.
The issue remains for hashes computed inside R.
@santiagohermo, @miikapaal fyi.
Variable order in SaveData logs
SaveData reorders variable in alphabetical order when writing the log with variable summaries. In this issue we will make a minor modification to keep the original ordering of data frames.
Finalize SaveData() function
In this issue, I'll bring in the SaveData() function from the Protests project.
Develop GSLabMLE package
Investigate potential bug in sortbykey parameter in SaveData
The purpose of this issue (#28) is to address a potential bug in the sortbykey
parameter of the SaveData
package, whereby row indices may not be properly sorted given a set of key
inputs when the argument is set to TRUE
. Any proposed revisions should be self-contained within ~/SaveData/R/SaveData.R
. The relevant development branch is issue28_sbk_savedata
.
Create binned scatter function
Improve dependency structure for SaveData
Per an e-mail suggestion today from Matthias, we thought that switching to the packagename::function
convention when calling functions from other packages would avoid bugs in cases where another preloaded package has a function whose name is the same as one of the functions we are calling in SaveData.
Here are the changes that Matthias suggested.
One issue is that the unit tests fail when I apply these changes, and I couldn't immediately fix it. @jmshapir, my gslab_r bandwidth over the remaining time may not suffice to complete this issue, so I am not self-assigning, but please let me know if you disagree.
Implement autofill function
In this issue we'll bring the autofill function we've written in Protests to this repo.
RDS capitalization
Allowing for alternative capitalization schemes for .RDS
files to allow for ".Rds" as an appropriate extension.
Team discussion on OOP system in R
@gentzkow wants us to put our heads together to discuss which OOP system we want to use to create our gslab_r packages. Our ultimate goal is to migrate the MATLAB classes gslab_model, gslab_mle and gslab_mde to R packages. The discussion should get everyone familiar with the OOP system in R, and decide which system we want to use to create our packages.
I will briefly summarize what I have learned regarding to different systems in R and my current implementation of gslab_model
in my next comment. We should look at the whole gslab_model
, gslab_mle
, and gslab_mde
system and think about which features will be hard to implement in one or the other.
Function for assigning scalars to variables from an input file
In this issue we will write a helper function for assigning scalars to R variables from an input file, i.e. a functionality similar to the one in loadglob.
Migrate numerical_derivatives from MATLAB to R
Migrate from here. Should follow the standard workflow.
Update CRAN mirror
Berkeley shut down their CRAN mirror, so I will move all calls to it to the Washington University in St. Louis CRAN mirror.
Create and improve the basic classes
The goal of this task is to make a first pass on creating and improving the basic gslab_r classes (Model, ModelData, ModelEstimationOutput), which parallel the classes in MATLAB library gslab_model.
Requests for SaveData
I have been working with large data and found some limitations to SaveData
that prompted to use a local version with a few different functionalities. I think that you may be interested in adding some of those to the function. Find below a list of suggestions.
Support feather
and fst
file formats to save data
Issue: csv
files are saved quickly with data.table
but they use a lot of space on disc. Formats feather
(from arrow) and fst
(from fst) are also really fast --sometimes even faster-- and also compress the files, potentially saving a lot of space on disc.
Proposal: Add these formats to the data dictionary.
Shortcomings of log file generation
Speed
Issue: While I haven't tested formally, I think the way the package computes summary statistics can be slow for large datasets.
Proposal: Change the computation of sumstats here. For that, make df
a datatable object, and change the code to something like this:
See code
numeric_vars <- names(dplyr::select_if(dt, is.numeric))
numeric_sum <- t(rbind(dt[, lapply(.SD, mean, na.rm = T), .SDcols = numeric_vars],
dt[, lapply(.SD, sd, na.rm = T), .SDcols = numeric_vars],
dt[, lapply(.SD, min, na.rm = T), .SDcols = numeric_vars],
dt[, lapply(.SD, max, na.rm = T), .SDcols = numeric_vars]))
Optionally exclude vars
Issue: Sometimes there is a variable for which you don't want to include the summary statistics (because data are confidential, for example).
Proposal: Add a mask_vars
optional argument that takes a list of variables and excludes them from the numeric sumstats computation.
Data with no numeric variables
Issue: In this case the output has a lot of NULL
s where the sumstats should go.
Proposal: Do not compute sumstats, and exclude them from log file, if all variables in dataset are non-numeric
Data loaded from STATA
Issue: Sometimes when you load data the class of the column includes a label and other attributes. This distorts the type
column of the log file
Proposal: Clean the extra column attributes before generating the log file (or maybe even before saving).
Allow flexible sorting with SaveData()
Currently SaveData() sorts the data based on the keys provided and saves. We'd like to allow specifying an optional sort variable. The default variables used for sorting will still be the keys.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.