edwindj / chunked Goto Github PK

Chunkwise Text-file Processing for 'dplyr'

Home Page: https://edwindj.github.io/chunked

R 92.66% TeX 7.34%

chunked's Introduction

chunked

R is a great tool, but processing data in large text files is cumbersome. chunked helps you to process large text files with dplyr while loading only a part of the data in memory. It builds on the excellent R package LaF.

Processing commands are written in dplyr syntax, and chunked (using LaF) will take care that chunk by chunk is processed, taking far less memory than otherwise. chunked is useful for select-ing columns, mutate-ing columns and filter-ing rows. It is less helpful in group-ing and summarize-ation of large text files. It can be used in data pre-processing.

Install

‘chunked’ can be installed with

install.packages('chunked')

beta version with:

install.packages('chunked', repos=c('https://cran.rstudio.com', 'https://edwindj.github.io/drat'))

and the development version with:

devtools::install_github('edwindj/chunked')

Enjoy! Feedback is welcome…

Usage

Text file -> process -> text file

Most common case is processing a large text file, select or add columns, filter it and write the result back to a text file

  read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% 
  select(col1, col2, col5) %>%
  filter(col1 > 10) %>% 
  mutate(col6 = col1 + col2) %>% 
  write_chunkwise("./large_file_out.csv")

chunked will write process the above statement in chunks of 5000 records. This is different from for example read.csv which reads all data into memory before processing it.

Text file -> process -> database

Another option is to use chunked as a preprocessing step before adding it to a database

con <- DBI::dbConnect(RSQLite::SQLite(), 'test.db', create=TRUE)
db <- dbplyr::src_dbi(con)

tbl <- 
  read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% 
  select(col1, col2, col5) %>%
  filter(col1 > 10) %>% 
  mutate(col6 = col1 + col2) %>% 
  write_chunkwise(dbplyr::src_dbi(db), 'my_large_table')
  
# tbl now points to the table in sqlite.

Db -> process -> Text file

Chunked can be used to export chunkwise to a text file. Note however that in that case processing takes place in the database and the chunkwise restrictions only apply to the writing.

Lazy processing

chunked will not start processing until collect or write_chunkwise is called.

data_chunks <- 
  read_chunkwise("./large_file_in.csv", chunk_size=5000) %>% 
  select(col1, col3)
  
# won't start processing until
collect(data_chunks)
# or
write_chunkwise(data_chunks, "test.csv")
# or
write_chunkwise(data_chunks, db, "test")

Syntax completion of variables of a chunkwise file in RStudio works like a charm…

Dplyr verbs

chunked implements the following dplyr verbs:

filter
select
rename
mutate
mutate_each
transmute
do
tbl_vars
inner_join
left_join
semi_join
anti_join

Since data is processed in chunks, some dplyr verbs are not implemented:

arrange
right_join
full_join

summarize and group_by are implemented but generate a warning: they operate on each chunk and not on the whole data set. However this makes is more easy to process a large file, by repeatedly aggregating the resulting data.

summarize
group_by

tmp <- tempfile()
write.csv(iris, tmp, row.names=FALSE, quote=FALSE)
iris_cw <- read_chunkwise(tmp, chunk_size = 30) # read in chunks of 30 rows for this example

iris_cw %>% 
  group_by(Species) %>%            # group in each chunk
  summarise( m = mean(Sepal.Width) # and summarize in each chunk
           , w = n()
           ) %>% 
  as.data.frame %>%                  # since each Species has 50 records, results will be in multiple chunks
  group_by(Species) %>%              # group the results from the chunk
  summarise(m = weighted.mean(m, w)) # and summarize it again

chunked's People

Contributors

Stargazers

Watchers

Forkers

palmakers rlugojr applied-statistic-using-r spencerx leoniedu mbsabath pachadotdev romainfrancois

chunked's Issues

Error in .local(x, ...) : Line ended while open quote

I'm getting this error upon reading in a file and then immediately collecting it.

read_chunkwise(file.path(rdir, 'frs', paste0('file', '.csv'))) %>% collect()

However, the odd thing is I can read in the file with fread and it works totally fine, just for some reason with read_chunkwise() I'm getting this error. Any thoughts as to why this might be happening?

Support for fst format

It would be great to extend dplyr support to fst(https://github.com/fstpackage/ format which is fast and can be read in chunked (possibly processed in parallel).

Add reshape: gather (melt) and spread (dcast)

Could you please add functions to reshape the data from wide to long format and vice versa, please?
It's called reshape, gather, melt, spread, dcast...

If done with dplyr or data.table it crashes with large datasets.

`read_chunkwise` fails on `tbl` object

Dear Edwin,

I encountered an error while trying to reproduce you codes on the slides(UseR! 2016) of the package on page 11 named 'Scenario 3: DB -> TXT'.

library("chunked")
library("RSQLite")

dbcon <- dbConnect(SQLite(), "test.sqlite")
dbWriteTable(dbcon, "mtcars", mtcars)
dbDisconnect(dbcon)


tbl<-
  ( src_sqlite("test.sqlite") %>%
      tbl("mtcars")
  ) %>%
  read_chunkwise(chunk_size = 10, format = "tbl") %>%
  write_chunkwise('test2.csv')

This produces the error:

Error in UseMethod("read_chunkwise") : 
  no applicable method for 'read_chunkwise' applied to an object of class "c('tbl_sqlite', 'tbl_sql', 'tbl_lazy', 'tbl')"

Reading the function read_chunkwise makes me feel that no method is present for a tbl. Am I making a mistake?

rsqlite version: 1.1.2
dplyr version: 0.5.0
chunked version: 0.3

Regards,
Srikanth KS

Can't read a file chunkwise

I'm trying to read the attached file like so:

test <- read_chunkwise('br_reporting_2017.csv', format='csv')

But it's giving me this warning:

1: In FUN(X[[i]], ...) :
  Unsupported type 'logical'; using default type 'string'
2: In FUN(X[[i]], ...) :
  Unsupported type 'logical'; using default type 'string'

And when I try to access the test value, I get:

Error in UseMethod("groups") : 
  no applicable method for 'groups' applied to an object of class "NULL"

Do you know what might be going on here?

here's the attached file: https://s3.amazonaws.com/rcrainfo-ftp/Production/2021-03-08T08-09-38-0500/Biennial%20Report/BR_REPORTING_2017/BR_REPORTING_2017.zip

data missing - no `fill = TRUE` option

When using read_csv_chunkwise if there is data missing in the first line of file (or second line, as the case may be header = TRUE) there is no option to use fill=TRUE and the following error is shown:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 13 elements

can this be solved somehow ?

Can't read csvs with NAs in numeric columns

When reading data stored as a csv with NAs in a numeric column this error is returned:

Error in .local(x, ...) : 
Conversion to double failed; line=1; column=17; string='NA'

I'm experimenting with trying to use mutate to force the NA to be read as missing, I'll update the issue if I find a work around.

Functions need examples and more verbose documentation

Bare bones doc without clear error messages means others can't learn & use this.

In FUN(X[[i]], ...) : Unsupported type 'logical'; using default type 'string'

Error in .local(x, ...) :
Conversion to int failed; line=957; column=52; string='V061'

Error in .local(x, ...) : Line has too many columns

chunked testthat 0.12.0 tests gives error

The new version of testthat (0.12.0) gives an error due to a change in the interface of testthat. This is solved in branch "testthat", but is not backwards compatible, so this can only be uploaded when testthat is refreshed.

dplyr 0.8.0

On checking reverse dependencies of packages with the release candidate of dplyr, chunked fails with below.

The reason is that filter_() gained a .preserve argument. You should be able to do something like this (not tested):

filter_.chunkwise <- function(.data, ..., .dots. preserve = FALSE){
  .dots <- lazyeval::all_dots(.dots, ...)
  cmd <- if(packageVersion("dplyr") < "0.7.99" ) {
    lazyeval::lazy(filter_(.data, .dots=.dots))
  } else {
    lazyeval::lazy(filter_(.data, .dots=.dots, .preserve = .preserve))
  }
  record(.data, cmd)
}

so that chunked works against both can dplyr and the upcoming 0.8.0.

> revdepcheck::revdep_details(revdep = "chunked")
══ Reverse dependency check ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ chunked 0.4 ══

Status: BROKEN

── Newly failing

✖ checking examples ... ERROR
✖ checking tests ...

── Before ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
0 errors ✔ | 0 warnings ✔ | 0 notes ✔

── After ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
❯ checking examples ... ERROR
  Running examples in ‘chunked-Ex.R’ failed
  The error most likely occurred in:
  
  > ### Name: read_csv_chunkwise
  > ### Title: Read chunkwise data from text files
  > ### Aliases: read_csv_chunkwise read_csv2_chunkwise read_table_chunkwise
  > ###   read_laf_chunkwise
  > 
  > ### ** Examples
  > 
  > # create csv file for demo purpose
  > in_file <- file.path(tempdir(), "in.csv")
  > write.csv(women, in_file, row.names = FALSE, quote = FALSE)
  > 
  > #
  > women_chunked <-
  +   read_chunkwise(in_file) %>%  #open chunkwise connection
  +   mutate(ratio = weight/height) %>%
  +   filter(ratio > 2) %>%
  +   select(height, ratio) %>%
  +   inner_join(data.frame(height=63:66)) # you can join with data.frames!
  > 
  > # no processing done until
  > out_file <- file.path(tempdir(), "processed.csv")
  > women_chunked %>%
  +   write_chunkwise(file=out_file)
  Error: `.preserve` (`.preserve = FALSE`) must not be named, do you need `==`?
  Execution halted

❯ checking tests ...
  See below...

── Test failures ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── testthat ────

> library(testthat)
> library(chunked)
Loading required package: dplyr

Attaching package: 'dplyr'

The following object is masked from 'package:testthat':

    matches

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

> 
> test_check("chunked")
── 1. Error: filter(): can filter rows (@test-verbs.R#28)  ─────────────────────
`.preserve` (`.preserve = FALSE`) must not be named, do you need `==`?
1: expect_equal(tbl_women %>% filter(height > 65) %>% as.data.frame, women %>% filter(height > 
       65)) at testthat/test-verbs.R:28
2: quasi_label(enquo(object), label)
3: eval_bare(get_expr(quo), get_env(quo))
4: tbl_women %>% filter(height > 65) %>% as.data.frame
5: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
6: eval(quote(`_fseq`(`_lhs`)), env, env)
7: eval(quote(`_fseq`(`_lhs`)), env, env)
8: `_fseq`(`_lhs`)
9: freduce(value, `_function_list`)
10: withVisible(function_list[[k]](value))
...
23: filter_(.data, .dots = .dots)
24: filter_.data.frame(.data, .dots = .dots) at /Users/romain/git/tidyverse/dplyr/R/manip.r:122
25: filter(.data, !!!dots, .preserve = .preserve) at /Users/romain/git/tidyverse/dplyr/R/dataframe.R:66
26: filter.data.frame(.data, !!!dots, .preserve = .preserve) at /Users/romain/git/tidyverse/dplyr/R/manip.r:113
27: as.data.frame(filter(tbl_df(.data), ..., .preserve = .preserve)) at /Users/romain/git/tidyverse/dplyr/R/dataframe.R:61
28: filter(tbl_df(.data), ..., .preserve = .preserve) at /Users/romain/git/tidyverse/dplyr/R/dataframe.R:61
29: filter.tbl_df(tbl_df(.data), ..., .preserve = .preserve) at /Users/romain/git/tidyverse/dplyr/R/manip.r:113
30: bad_eq_ops(bad, "must not be named, do you need `==`?") at /Users/romain/git/tidyverse/dplyr/R/tbl-df.r:50
31: glubort(fmt_wrong_eq_ops(named_calls), ..., .envir = .envir) at /Users/romain/git/tidyverse/dplyr/R/error.R:37
32: .abort(text) at /Users/romain/git/tidyverse/dplyr/R/error.R:51

══ testthat results  ═══════════════════════════════════════════════════════════
OK: 39 SKIPPED: 0 FAILED: 1
1. Error: filter(): can filter rows (@test-verbs.R#28) 

Error: testthat unit tests failed
Execution halted

2 errors ✖ | 0 warnings ✔ | 0 notes ✔

Feature request: support for fixed width file?

"The readr::read_fwf() is a nice implementation of fwf input, and might be a model for work on something comparable for this package."

I am quoting from DiskFrame/disk.frame#57

Implementing sample_n and sample_frac

We could implement a chunk wise sample_n / sample_frac with:

library(tidyverse)
big <- rerun(1000, iris) %>% bind_rows()
path <- tempfile()
write_csv(big, path)

library(chunked)
sample_n.chunkwise <- function(.data, size){
  cmd <- lazyeval::lazy(sample_n(.data, size))
  chunked:::record(.data, cmd)
}

read_csv_chunkwise(path) %>% 
  sample_n(1) %>% 
  collect()

The sample would be done in each chunk that way.

What do you think about that?
If it sounds like a good idea, let me know and I'll send you a PR.

Reading compressed files

Great package! As a potential enhancement, it would be useful if one could process compressed files (e.g., gzipped txt files) using chunked.

txt file support?

Hello,

I was wondering if you might be able to support reading in .txt files with your package, as I have data which is delimited in chunks, and would be a good test case for your package:
Data format:

beer/name: Sausa Weizen
beer/beerId: 47986
beer/brewerId: 10325
beer/ABV: 5.00
beer/style: Hefeweizen
review/appearance: 2.5
review/aroma: 2
review/palate: 1.5
review/taste: 1.5
review/overall: 1.5
review/time: 1234817823
review/profileName: stcules
review/text: A lot of foam. But a lot. In the smell some banana, and then lactic and tart. Not a good start. Quite dark orange in color, with a lively carbonation (now visible, under the foam). Again tending to lactic sourness. Same for the taste. With some yeast and banana.

beer/name: Red Moon
beer/beerId: 48213
beer/brewerId: 10325
beer/ABV: 6.20
beer/style: English Strong Ale
review/appearance: 3
review/aroma: 2.5
review/palate: 3
review/taste: 3
review/overall: 3
review/time: 1235915097
review/profileName: stcules
review/text: Dark red color, light beige foam, average. In the smell malt and caramel, not really light. Again malt and caramel in the taste, not bad in the end. Maybe a note of honey in teh back, and a light fruitiness. Average body. In the aftertaste a light bitterness, with the malt and red fruit. Nothing exceptional, but not bad, drinkable beer.

Is this what chunked was designed for, or is this a bad use case? Thanks.

Merging large files

I have an R workflow that now needs to merge some large csv files into a single file. I find I am running out of memory so the merge has started crashing - these files are getting larger each month so it needs a fix. I could write some other script like Perl or whatever, but I would prefer to keep the whole workflow in R. It's on windows so some of the cmd script solutions would be horrible (and hard to check the formats as I go).

I've been looking around to see how to read files in chunks using R and your library seems perfect. So, I have a question more than an issue....
Do you think it would be possible to use it to merge files (in this case it is a known set of file names each time). I can't yet see a way to do it.
thanks very much for a useful library, we are now using it to filter extracts out of some very large source files and it's simplified quite a bit of our data cleaning/processing already.

Does this work with non conventional delimiters ?

Very often, I've to deal with non conventional delimiters such as "|", "#" or "¥" . I think it would be reall useful to implement read_delim_chunkwise()

chunked and validate together

Hi Edwin

I really enjoyed your talk about chunked at useR 2016, as well as the related talk about validate from Mark van der Loo. I was wondering if the two of you have considered collaborating on an example of how to use the two packages together.

Consider the obvious use case of doing data quality checks on a bunch of large CSV files in a single directory. chunked could allow each large CSV to be processed in memory-efficient chunks. Within the processing step for each chunk, validate could "confront" all of the rows in the chunk with a comprehensive list of quality checks.

I played around a bit trying to create such an example myself, but I couldn't figure out how to separate the reading/writing of the CSV itself from the writing of the validate results (the "confrontation"). Despite my inability to make it work, it strikes me that the two packages combined could be a huge help in doing quality checks.

Thanks for the hard work in putting the chunkedpackage together. I look forward to seeing more from the "data cleaning" dynamic duo!

All the best
Kyle

Unsupported type 'logical'; using default type 'string'

I'm trying to save a DataFrame to CSV.
The generated file has only the header and presented the error described in the title of this issue.

Steps:

install.packages("sparklyr")
install.packages('dplyr')
install.packages('chunked', repos=c('https://cran.rstudio.com', 'http://edwindj.github.io/drat'))

library(sparklyr)
library(dplyr)
library(config)
library(DBI)
library(chunked)
library(crassy) 

conf <- spark_config()

conf$spark.executor.memoryOverhead ="2g"

conf$spark.executor.memory <- "4g"
conf$spark.executor.cores <- 2
conf$spark.executor.instances <- 4
#conf$spark.shuffle.service.enabled <- TRUE
#conf$spark.dynamicAllocation.enabled <- TRUE
conf$spark.dynamicAllocation.enabled <- FALSE
conf$sparklyr.defaultPackages = c("com.datastax.spark:spark-cassandra-connector_2.11:2.4.1", "org.mongodb.spark:mongo-spark-connector_2.11:2.4.0","com.databricks:spark-csv_2.11:1.3.0")
conf$spark.serializer    =    "org.apache.spark.serializer.KryoSerializer"

# Connect to spark
sc <- spark_connect(master = "spark://myspark:7077", 
                    spark_home = "/spark/home",
                    version = "2.4.0",
                    config = conf)

csv_file_path <- "/home/data/events.csv"
mongo_dbname <- "mydb"
mongo_collection <- "events"

sql_txt <- "SELECT id_api, cast(geometry.coordinates as string) as geo, isoDateTime FROM mongo_waze_tbl"


mongo_uri <- paste("mongodb://foo:bar*@10.8.0.5/",mongo_dbname,".",mongo_collection, "?readPreference=primaryPreferred",sep = "")

load <- invoke(spark_get_session(sc), "read") %>%
  invoke("format", "com.mongodb.spark.sql.DefaultSource") %>%
  invoke("option", "spark.mongodb.input.uri", mongo_uri) %>%
  invoke("option", "keyspace", mongo_dbname) %>%
  invoke("option", "table", mongo_collection) %>%
  invoke("option", "header", TRUE) %>%
  invoke("load")

mongo_df <- sparklyr:::spark_partition_register_df(sc, load, "mongo_waze_tbl", 0, FALSE)

mongo_flat_df <- tbl(sc, sql(sql_txt))
mongo_flat_chunked_df <- read_chunkwise(mongo_flat_df, chunk_size = 5000)
write_chunkwise(mongo_flat_chunked_df,csv_file_path)

Output:

Warning messages:
1: In FUN(X[[i]], ...) :
  Unsupported type 'logical'; using default type 'string'
2: In FUN(X[[i]], ...) :
  Unsupported type 'logical'; using default type 'string'
3: In FUN(X[[i]], ...) :
  Unsupported type 'logical'; using default type 'string'
4: In FUN(X[[i]], ...) :
  Unsupported type 'logical'; using default type 'string'
5: In FUN(X[[i]], ...) :
  Unsupported type 'logical'; using default type 'string'
6: In FUN(X[[i]], ...) :
  Unsupported type 'logical'; using default type 'string'
7: In FUN(X[[i]], ...) :
  Unsupported type 'logical'; using default type 'string'