duckdb / duckdb-r Goto Github PK
View Code? Open in Web Editor NEWThe duckdb R package
Home Page: https://r.duckdb.org/
License: Other
The duckdb R package
Home Page: https://r.duckdb.org/
License: Other
It seems that when installing a duckdb extension (like httpfs) using r-wasm/webr and coatless/quarto-webr duckdb should use HTTPS for the connection to work. However it currently uses plain HTTP http://extensions.duckdb.org/
Is it possible to force HTTPS?
Should I report this problem elsewhere?
Any known tricks to speed up install time?
Also this that happens during R CMD check
* checking whether package 'duckdb' can be installed ...
[35m/35m] OK
35 minutes is quite a blocker for comfortably testing changes...
OS OSX
DDB version 0.9.2-1
> tbl(con,sql("SELECT * FROM range(100) AS tt(x)")) %>%
+ summarise(h=sql("histogram(x)"))
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_prepare: Unknown column type for prepare: MAP(BIGINT, UBIGINT)
Run `rlang::last_trace()` to see where the error occurred.
Starting with the relational API.
Opt-in for dbGetQuery()
. The relational API and duckplyr are currently missing too many pieces.
DuckDB's ability to execute SELECT
statements directly on files (CSV, Parquet etc) worked fine with dbplyr
, up until 2.3.4. However with the upgrade to 2.4.0, it fails. With the below example I'm querying a specific file called metadata.0.2.3.parquet
, but you can replace it with any parquet file and the same issue will occur:
duckdb::duckdb() |>
DBI::dbConnect(drv = _, read_only = TRUE) |>
dplyr::tbl("metadata.0.2.3.parquet")
It looks like you tried to incorrectly use a table in a schema as source.
ℹ If you want to specify a schema use `in_schema()` or `in_catalog()`.
ℹ If your table actually contains "." in the name use `check_from = FALSE` to silence this message.
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_prepare: Failed to prepare query SELECT "metadata.0.2.3.parquet".*
FROM "metadata.0.2.3.parquet"
LIMIT 11
Error: Binder Error: Referenced table "metadata.0.2.3.parquet" not found!
Candidate tables: "metadata"
Run `rlang::last_trace()` to see where the error occurred.
According to tidyverse/dbplyr#1390, this bug is something that needs to be fixed in the duckdb
package.
I've started implementing duckdb at work using {targets} and RStudio. However, between a couple coworkers, various workflows are persistently ground to a halt with:
Last error: rapi_startup: Failed to open database: IO Error: Cannot open file "...": The process cannot access the file because it is being used by another process.
We've tried being extra vigilant by always connecting with:
con <- DBI::dbConnect(
duckdb::duckdb(),
dbdir = ...,
read_only = TRUE
)
And always disconnecting with:
DBI::dbDisconnect(con, shutdown = TRUE)
But when where we need to update tables in the database and we connect with read_only = FALSE
, the 'used by another process' error always appears and we can't ascertain where there's a lingering connection, even after disconnecting, clearing our environments, and restarting R. Is this possibly an RStudio thing? Any guidance on managing connections would be greatly appreciated.
The latter give headaches with other R tooling.
duckdb crashes when trying to perform queries on a large (~150 GB) parquet database
Apologies I cannot reproduce this with a smaller dataset. I can reproduce the same errors using the remote S3 access though which might be a bit faster than downloading a full copy of the dataset, or you could probably download a copy from my server in a few days. reproducible examples of both remote and local-based access below. (The remote example goes via arrow because I cannot get direct access to S3 buckets via duckdb alone yet).
##
library(arrow)
library(dplyr)
library(duckdb)
path <- arrow::s3_bucket("ebird/observations", endpoint_override = "minio.carlboettiger.info", anonymous=TRUE)
obs <- arrow::open_dataset(path) |> to_duckdb()
tmp <- obs |>
group_by(sampling_event_identifier, scientific_name) |>
summarize(count = sum(observation_count, na.rm=TRUE),
.groups = "drop")
tmp <- tmp |> compute() # crashes
Or, after downloading from the above public S3 bucket (https://minio.carlboettiger.info/ebird/Mar-2022/observations), try local parquet access:
## pure local duckdb crashes on simple operations:
library(duckdb)
library(dplyr)
parquet <- file.path("/home/shared-data/ebird/observations/*.parquet")
conn <- dbConnect(duckdb(), "/home/shared-data/ebird/db")
dbExecute(conn = conn, paste0("PRAGMA memory_limit='12GB'"))
view_query <- paste0("CREATE VIEW 'observations' AS SELECT * FROM parquet_scan('",
parquet, "');")
DBI::dbSendQuery(conn, view_query)
obs <- tbl(conn, "observations")
tmp <- obs |>
group_by(sampling_event_identifier, scientific_name) |>
summarize(count = sum(observation_count, na.rm=TRUE),
.groups = "drop")
tmp <- tmp |> compute()
With the memory limit PRAGMA in place, the example doesn't crash the R session but does throw an OOM error:
Error: duckdb_execute_R: Failed to run query
Error: Out of Memory Error: could not allocate block of 262144 bytes
master
branch?pip install duckdb --upgrade --pre
install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
by the way, install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
no longer works as there seems to no longer be master-builds download?
Is there any integration with the nanoarrow package and ADBC?
It does not seem to be well documented.
The readme suggests running R CMD INSTALL .
on this repository to get "the bleeding edge", but this doesn't get the latest version of the underlying duckdb C++ library.
Would be nice to have shorthand instructions for installing the package with the latest version of both duckdb-r and duckdb, and to have those instructions added to the installation instructions in duckdb-web, the moreso since there seems to be a large delay between releases of libduckdb and the R package.
After installing 0.9.99.9000, array references (such as array_value(1,2,3)) return "unknown column type" to dbGetQuery(). For example:
con <- dbConnect(duckdb())
dbGetQuery(con,"select array_value(1,2,3)")
returns:
Error: rapi_prepare: Unknown column type for prepare: INTEGER[3]
Whereas a list reference:
dbGetQuery(con,"select list_value(1,2,3)")
displays the list:
list_value(1, 2, 3)
1 1, 2, 3
Array creation in general appears to work within ddb 0.9.99.9000, however.
Hello, first thanks for this amazing package.
Before the 0.10.0 (stable) version had i silent errors when exporting a MariaDB Table into Duckdb :
LOAD mysql;
ATTACH 'host=127.0.0.1 user=username password=pwd port=0 database=db_name port=3306' AS mysqldb (TYPE mysql, READ_ONLY);
CREATE OR REPLACE TABLE %s AS SELECT * FROM mysqldb.%s;
Resulted in duckDB tables with some empty values.
Since the 0.10.0 (stable) it does work using the duckDB CLI.
But as the .duckdb file format generated by the 0.10.0 (stable) version is not recognized by the duckDB R package version duckdb_0.9.2-1, i have to export the tables to parquet and then instantiate a new duckDB instance importing those parquet files:
> con <- dbConnect(duckdb("mydb.duckdb", read_only = TRUE))
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'drv' in selecting a method for function 'dbConnect': rapi_startup: Failed to open database: Serialization Error: Failed to deserialize: field id mismatch, expected: 201, got: 103
Any chance we get a 0.10.0 (stable) compatible package in R ?
Thanks a lot in advance.
The relevant info can also be hard-coded in the R code.
This is a toy example, but relevant for some CRAN packages with the default setting of max_expression_depth
. The symptoms are the same as when evaluating rel7
.
Ideally, we would already see an error when constructing rel5
. However, the system lets me construct rel5
and even rel6
, only construction of rel7
fails with the same error as the evaluation of rel5
. Is this an off-by-two error, or something more serious?
duckplyr can fall back to dplyr if the error happens at construction, but not at evaluation -- this is too late. An error on construction of rel5
or perhaps even rel4
would fix the downstream problem. How to achieve this?
duckdb <- asNamespace("duckdb")
con <- DBI::dbConnect(duckdb::duckdb())
experimental <- FALSE
df1 <- tibble::tibble(id = 1L)
DBI::dbExecute(con, "SET max_expression_depth TO 5")
#> [1] 0
rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
rel1,
list({
tmp_expr <- duckdb$expr_reference("id")
duckdb$expr_set_alias(tmp_expr, "id")
tmp_expr
})
)
rel3 <- duckdb$rel_project(
rel2,
list({
tmp_expr <- duckdb$expr_reference("id")
duckdb$expr_set_alias(tmp_expr, "id")
tmp_expr
})
)
rel4 <- duckdb$rel_project(
rel3,
list({
tmp_expr <- duckdb$expr_reference("id")
duckdb$expr_set_alias(tmp_expr, "id")
tmp_expr
})
)
rel4
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [id as id]
#> Projection [id as id]
#> Projection [id as id]
#> r_dataframe_scan(0x11cca4278)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - id (INTEGER)
rel5 <- duckdb$rel_project(
rel4,
list({
tmp_expr <- duckdb$expr_reference("id")
duckdb$expr_set_alias(tmp_expr, "id")
tmp_expr
})
)
rel5
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [id as id]
#> Projection [id as id]
#> Projection [id as id]
#> Projection [id as id]
#> r_dataframe_scan(0x11cca4278)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - id (INTEGER)
rel6 <- duckdb$rel_project(
rel5,
list({
tmp_expr <- duckdb$expr_reference("id")
duckdb$expr_set_alias(tmp_expr, "id")
tmp_expr
})
)
rel6
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [id as id]
#> Projection [id as id]
#> Projection [id as id]
#> Projection [id as id]
#> Projection [id as id]
#> r_dataframe_scan(0x11cca4278)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - id (INTEGER)
rel7 <- duckdb$rel_project(
rel6,
list({
tmp_expr <- duckdb$expr_reference("id")
duckdb$expr_set_alias(tmp_expr, "id")
tmp_expr
})
)
#> Error: {"exception_type":"Binder","exception_message":"Max expression depth limit of 5 exceeded. Use \"SET max_expression_depth TO x\" to increase the maximum expression depth."}
rel7
#> Error in eval(expr, envir, enclos): object 'rel7' not found
duckdb$rel_to_altrep(rel6)
#> Error: Error evaluating duckdb query: Parser Error: Maximum tree depth of 5 exceeded in logical planner
duckdb$rel_to_altrep(rel5)
#> Error: Error evaluating duckdb query: Parser Error: Maximum tree depth of 5 exceeded in logical planner
duckdb$rel_to_altrep(rel4)
#> id
#> 1 1
Created on 2024-03-10 with reprex v2.1.0
Hi, I think it's been about a week since DuckDB 0.9.0 was released. Are there any plans to release the R package on CRAN?
I have an issue with the dockdb database created through the R interface. I realised that the database file keeps growing even if I delete all the tables. The warning of garbage-collector is generated by calling gc() that asks to use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv), but by executing these commands the file size was not changed and the warning message was shown again in the next connection.
Here is a reproducible example:
> library(duckdb)
> con <- dbConnect(duckdb(), dbdir = "dbtest.duckdb", read_only = FALSE) # file size 12K
> dbWriteTable(con,"tbl",data.frame(x=rnorm(10000000))) # file size: 85.7 Mb
> dbListTables(con)
[1] "tbl"
> dbRemoveTable(con,'tbl') # file size: 85.7 Mb
> dbListTables(con)
character(0)
> dbDisconnect(con, shutdown=T) # file size: 85.7 Mb
> gc()
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 3792014 202.6 6845970 365.7 NA 6612186 353.2
Vcells 6318596 48.3 21276863 162.4 98304 17944806 137.0
Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.
# file size is still 85.7 Mb
> duckdb::duckdb_shutdown(duckdb('dbtest.duckdb'))
> gc() # no change in the file size
Any advice?
Here is my SessionInfo():
> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Amsterdam
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] duckdb_0.9.2-1 DBI_1.2.0
loaded via a namespace (and not attached):
[1] terra_1.7-55 vctrs_0.6.5 cli_3.6.2 rlang_1.1.3
[5] ncdf4_1.22 png_0.1-8 purrr_1.0.2 generics_0.1.3
[9] glue_1.7.0 plyr_1.8.9 maxEnt_1.0-8 sdm_1.2-18
[13] sp_2.0-0 fansi_1.0.6 grid_4.3.2 tibble_3.2.1
[17] interp_1.1-4 lifecycle_1.0.4 compiler_4.3.2 dplyr_1.1.3
[21] codetools_0.2-19 RColorBrewer_1.1-3 Rcpp_1.0.12 pkgconfig_2.0.3
[25] tidyr_1.3.0 rstudioapi_0.15.0 latticeExtra_0.6-30 lattice_0.22-5
[29] R6_2.5.1 tidyselect_1.2.0 utf8_1.2.4 pillar_1.9.0
[33] parallel_4.3.2 magrittr_2.0.3 jpeg_0.1-10 tools_4.3.2
[37] deldir_1.0-9 raster_3.6-23
I haven't tried all the metacharacters, but the following commands in the R version of duckdb:
dbGetQuery(con,"select regexp_matches('blah','\?')")
dbGetQuery(con,"select regexp_matches('blah','\.')")
dbGetQuery(con,"select regexp_matches('blah','\*')")
dbGetQuery(con,"select regexp_matches('blah','\+')")
all return an error of the form:
Error: '\?' is an unrecognized escape in character string (<input>:1:48)
Even though all these commands work in the CLI version. Am I doing something wrong?
Thanks.
Hive partitioning is a new feature of the duckdb CLI. hive_partitioning
Is this currently supported in the R package? If not, are there any plans to do so?
Thanks to all the developers.
It would be great if a pkgdown website could be added.
I saw this post of ibis.
https://ibis-project.org/posts/ibis-duckdb-geospatial/
It would be great if the R client also had an integration with the Spatial Extension. (For now, it seems to be difficult to handle because it is converted to the raw type of R)
Perhaps integrating with the geoarrow package would make sense?
@paleolimbot Sorry for tagging you, but do you have any perspectives on such integrations?
@krlmlr looking into the test failure you mentioned here darwin-eu-dev/PatientProfiles#479, I think this has been caused by a slight change in behaviour with date manipulation (see below where the current cran version returns a date but the new version gives a date time).
library(duckdb)
#> Warning: package 'duckdb' was built under R version 4.2.3
#> Loading required package: DBI
#> Warning: package 'DBI' was built under R version 4.2.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.2.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(DBI)
library(CDMConnector)
#> Warning: package 'CDMConnector' was built under R version 4.2.3
con <- DBI::dbConnect(duckdb(), path = ":memory:")
packageVersion("duckdb")
#> [1] '0.9.2.1'
test_data <- data.frame(person = 1L,
date_1 = as.Date("2001-01-01"))
db_test_data <- copy_to(con, test_data, overwrite = TRUE)
db_test_data <- db_test_data %>%
dplyr::mutate(date_2 = date_1 + years(1))
db_test_data
#> # Source: SQL [1 x 3]
#> # Database: DuckDB v0.9.2 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> person date_1 date_2
#> <int> <date> <date>
#> 1 1 2001-01-01 2002-01-01
class(db_test_data %>%
pull("date_2"))
#> [1] "Date"
Created on 2024-03-12 with reprex v2.0.2
library(duckdb)
#> Warning: package 'duckdb' was built under R version 4.2.3
#> Loading required package: DBI
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.2.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(DBI)
library(CDMConnector)
#> Warning: package 'CDMConnector' was built under R version 4.2.3
con <- DBI::dbConnect(duckdb(), path = ":memory:")
packageVersion("duckdb")
#> [1] '0.10.0'
test_data <- data.frame(person = 1L,
date_1 = as.Date("2001-01-01"))
db_test_data <- copy_to(con, test_data, overwrite = TRUE)
db_test_data <- db_test_data %>%
dplyr::mutate(date_2 = date_1 + years(1))
db_test_data
#> # Source: SQL [1 x 3]
#> # Database: DuckDB v0.10.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#> person date_1 date_2
#> <int> <date> <dttm>
#> 1 1 2001-01-01 2002-01-01 00:00:00
class(db_test_data %>%
pull("date_2"))
#> [1] "POSIXct" "POSIXt"
Created on 2024-03-12 by the reprex package (v2.0.1)
Instead of :
con <- dbConnect(duckdb())
on.exit(dbDisconnect(con, shutdown = TRUE))
we'd have :
con <- local_con()
or some other name, to follow the guidelines from https://testthat.r-lib.org/articles/test-fixtures.html
something like (borrowed from https://github.com/hannes/duckdb-rfuns).
local_con <- function() {
con <- dbConnect(duckdb())
withr::defer_parent(dbDisconnect(con, shutdown=TRUE))
con
}
I was wondering if it would be possible to expose the "INSERT OR IGNORE/REPLACE" functionality to the append and write functions so that the conflicts are managed at a row wise level.
Thanks
The DBI document says:
The following arguments are not part of the dbWriteTable() generic (to improve compatibility across backends) but are part of the DBI specification:
- row.names (default: FALSE)
- overwrite (default: FALSE)
- append (default: FALSE)
- field.types (default: NULL)
- temporary (default: FALSE)
They must be provided as named arguments. See the "Specification" and "Value" sections for details on their usage.
But now dbWriteTable__duckdb_connection_character_data.frame
can specify these options without names.
I think we need to change the position of the dot dot dot ...
.
duckdb-r/R/dbWriteTable__duckdb_connection_character_data.frame.R
Lines 9 to 17 in 8386b9c
curl -I https://cran.r-project.org/bin/macosx/big-sur-arm64/contrib/4.3/duckdb_0.9.2-1.tgz
...
Content-Length: 85023346
The file is over 80 MB. When unpacked, it contains ca. 300 MB of debug symbols (duckdb.so.dSYM
).
We should investigate whether it is possible to not include those in the download.
CC @hadley
con <- dbConnect(duckdb())
dbWriteTable(con, "mtcars", mtcars)
# the following works:
tbl(con,"mtcars")
# the following also seems to work, but it prints an error: "It looks like you tried to incorrectly use a table in a schema as source."
tbl(con,"main.mtcars")
# The error is printed whenever you use tbl() with a schema name
Will simplify future investigations.
To what extent is the SQL generated by rel_to_sql()
supposed to be runnable? A simple example involving row_number()
already gives parse errors.
con <- DBI::dbConnect(duckdb::duckdb())
experimental <- FALSE
invisible(DBI::dbExecute(con, "CREATE MACRO \"==\"(a, b) AS a = b"))
df1 <- data.frame(a = 1)
rel1 <- duckdb:::rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb:::rel_project(
rel1,
list({
tmp_expr <- duckdb:::expr_window(duckdb:::expr_function("row_number", list()), list(), list(), offset_expr = NULL, default_expr = NULL)
duckdb:::expr_set_alias(tmp_expr, "___row_number")
tmp_expr
})
)
sql <- duckdb:::rel_to_sql(rel2)
writeLines(sql)
#> SELECT row_number(, NULL, NULL) OVER () AS ___row_number FROM (SELECT * FROM r_dataframe_scan(0x107deb930, (experimental = false))) AS dataframe_4427004208_1881778879
DBI::dbGetQuery(con, sql)
#> Error: Parser Error: syntax error at or near ","
#> LINE 1: SELECT row_number(, NULL, NULL) OVER () AS ___row_number ...
#> ^
Created on 2023-04-10 with reprex v2.0.2
macOS aarch64
7bdaddf8e4504405218c8521c7c38d9c3abf33f6
R
Kirill Müller
cynkra GmbH
master
branch?For BIGINT columns the help page ?duckdb suggests a field bigint = "integer64"
can be present either in dbConnect(duckdb(bigint = "integer64"))
or dbConnect(duckdb(), bigint = "integer64")
, but this is only respected in the first case.
The help page code is at https://github.com/duckdb/duckdb/blob/f7f5cf30aa5dfbc900c98c57bef2d729f2bf1b56/tools/rpkg/man/duckdb.Rd#L31 and the implementation showing that bigint
will often be ignored is at https://github.com/duckdb/duckdb/blob/f7f5cf30aa5dfbc900c98c57bef2d729f2bf1b56/tools/rpkg/R/dbConnect__duckdb_driver.R#L55-L58
Install dependencies:
install.packages(c('duckdb', 'bit64'))
library(duckdb)
bigint
used when in duckdb()
...
con = dbConnect(duckdb(bigint = "integer64"))
dbWriteTable(con, "one", data.frame(i64 = bit64::as.integer64(1:5)))
dbGetQuery(con, "DESCRIBE one")
## column_name column_type null key default extra
## 1 i64 BIGINT YES <NA> <NA> NA
...but not when in dbConnect()
con = dbConnect(duckdb(), bigint = "integer64")
dbWriteTable(con, "one", data.frame(i64 = bit64::as.integer64(1:5)))
dbGetQuery(con, "DESCRIBE one")
## column_name column_type null key default extra
## 1 i64 DOUBLE YES <NA> <NA> NA
aarch64-apple-darwin21.6.0
0.8.1.1
R
Martin Morgan
Roswell Park Comprehensive Cancer Center
master
branch?con <- dbConnect(duckdb())
dbGetQuery(con, "select array_value(1,2,3)")
#> Error: rapi_prepare: Unknown column type for prepare: INTEGER[3]
Behavior differs between relational and SQL. @Tmonster: why would that be?
duckdb <- asNamespace("duckdb")
con <- DBI::dbConnect(duckdb::duckdb())
experimental <- FALSE
invisible(DBI::dbExecute(con, "CREATE MACRO \"!=\"(x, y) AS x <> y"))
df1 <- data.frame(a = 1)
rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_filter(
rel1,
list(
duckdb$expr_function(
"!=",
list(
duckdb$expr_reference("a"),
if ("experimental" %in% names(formals(duckdb$expr_constant))) {
duckdb$expr_constant("", experimental = experimental)
} else {
duckdb$expr_constant("")
}
)
)
)
)
rel2
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Filter [!=(a, '')]
#> r_dataframe_scan(0x152320908)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - a (DOUBLE)
duckdb$rel_to_altrep(rel2)
#> [1] a
#> <0 rows> (or 0-length row.names)
Created on 2023-11-16 with reprex v2.0.2
✗ echo 'CREATE MACRO "!="(x, y) AS x <> y; SELECT 1.0 != '"''"' AS a' | build/debug/duckdb
Error: near line 1: Conversion Error: Could not convert string "" to DECIMAL(2,1)
✗ echo 'CREATE MACRO "!="(x, y) AS x <> y; SELECT '"''"' != 1.0 AS a' | build/debug/duckdb
Error: near line 1: Conversion Error: Could not convert string "" to DECIMAL(2,1)
Currently there is a separate function named duckdb_register_arrow
available apart from duckdb_register
, but I am wondering if these can be made into one S3 generic function.
> con <- dbConnect(duckdb::duckdb())
> dbGetQuery(con, "SELECT map([1,2],['a','b']) AS x;")
Error: rapi_prepare: Unknown column type for prepare: MAP(INTEGER, VARCHAR)
Supporting this might lead to faster execution in duckplyr.
Needs duckdb/duckdb#8600.
con <- DBI::dbConnect(duckdb::duckdb())
experimental <- FALSE
invisible(
DBI::dbExecute(con, "CREATE MACRO \"___eq_na_matches_na\"(x, y) AS (x IS DISTINCT FROM y)")
)
df1 <- data.frame(a = 1L)
rel1 <- duckdb:::rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb:::rel_set_alias(rel1, "lhs")
rel3 <- duckdb:::rel_from_df(con, df1, experimental = experimental)
rel4 <- duckdb:::rel_set_alias(rel3, "rhs")
rel5 <- duckdb:::rel_join(
rel2,
rel4,
list(
duckdb:::expr_function(
"___eq_na_matches_na",
list(duckdb:::expr_reference("a", rel2), duckdb:::expr_reference("a", rel4))
)
),
"anti"
)
rel5
#> DuckDB Relation:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Join REGULAR ANTI ___eq_na_matches_na(lhs.a, rhs.a)
#> r_dataframe_scan(0x12821dee8)
#> r_dataframe_scan(0x12821dee8)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - a (INTEGER)
duckdb:::rel_to_altrep(rel5)
#> Error in row.names.data.frame(x): Error evaluating duckdb query: Not implemented Error: Unimplemented comparison type for join!
Created on 2023-08-17 with reprex v2.0.2
macOS aarch64
0ff709bdc628ea24111265eb66d74220ce3bb6df
R
Kirill Müller
cynkra GmbH
master
branch?I have tested with a master build
duckdb-only and duckdb-CLI-only example pending.
Could it be that the join is O(n^2) in this particular case?
options(conflicts.policy = list(warn = FALSE))
library(duckplyr)
data <- data.frame(a = seq_len(30000))
arrow::write_parquet(data, "data.parquet")
x <- duckplyr_df_from_file("data.parquet", "read_parquet")
y <- duckplyr_df_from_file("data.parquet", "read_parquet")
bench::mark(
iterations = 1,
collect(arrange(left_join(x, y, by = "a"), a)),
collect(arrange(left_join(x, y, by = "a", na_matches = "never"), a))
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Order [a ASC]
#> Projection [___coalesce(lhs.a_x, rhs.a_y) as a]
#> Join REGULAR LEFT ___eq_na_matches_na(lhs.a_x, rhs.a_y)
#> Projection [a as a_x]
#> read_parquet(data.parquet)
#> Projection [a as a_y]
#> read_parquet(data.parquet)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - a (INTEGER)
#>
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Order [a ASC]
#> Projection [___coalesce(lhs.a_x, rhs.a_y) as a]
#> Join REGULAR LEFT ==(lhs.a_x, rhs.a_y)
#> Projection [a as a_x]
#> read_parquet(data.parquet)
#> Projection [a as a_y]
#> read_parquet(data.parquet)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - a (INTEGER)
#>
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Order [a ASC]
#> Projection [___coalesce(lhs.a_x, rhs.a_y) as a]
#> Join REGULAR LEFT ___eq_na_matches_na(lhs.a_x, rhs.a_y)
#> Projection [a as a_x]
#> read_parquet(data.parquet)
#> Projection [a as a_y]
#> read_parquet(data.parquet)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - a (INTEGER)
#>
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Order [a ASC]
#> Projection [___coalesce(lhs.a_x, rhs.a_y) as a]
#> Join REGULAR LEFT ==(lhs.a_x, rhs.a_y)
#> Projection [a as a_x]
#> read_parquet(data.parquet)
#> Projection [a as a_y]
#> read_parquet(data.parquet)
#>
#> ---------------------
#> -- Result Columns --
#> ---------------------
#> - a (INTEGER)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 "collect(arrange(left_join(x, … 628.13ms 628.13ms 1.59 1.66MB 0
#> 2 "collect(arrange(left_join(x, … 4.48ms 4.48ms 223. 6.9KB 0
Created on 2023-11-06 with reprex v2.0.2
Hello,
I cannot install some extension in R 4.3.1 in windows.
I use duckdb duckdb_0.8.1-3 DBI_1.1.3 . I also tested v0.9.0 but in that version more extensions failed to install.
In v0.8.1 the extensions icu, json, httpfs and fts work ok, but postgres_scanner and spatial fail to install. It looks like them don't exists in the repository for platform windows_amd64_rtools. I tried to install manually from repository for platform windows_amd64 (without _rtools sufix) but they made R Session to crash in RStudio when I load them.
I test same thing in linux amd64 and arm64 and work ok with all extension in version 0.8.1 and 0.9.0
> dbExecute(con,"install postgres_scanner;")
Error: rapi_execute: Failed to run query
Error: HTTP Error: Failed to download extension "postgres_scanner" at URL "http://extensions.duckdb.org/v0.8.1/windows_amd64_rtools/postgres_scanner.duckdb_extension.gz"
Extension "postgres_scanner" is an existing extension.
Are you using a development build? In this case, extensions might not (yet) be uploaded.
> dbExecute(con,"load postgres_scanner;")
Error: rapi_execute: Failed to run query
Error: HTTP Error: Failed to download extension "postgres_scanner" at URL "http://extensions.duckdb.org/v0.8.1/windows_amd64_rtools/postgres_scanner.duckdb_extension.gz"
Extension "postgres_scanner" is an existing extension.
Are you using a development build? In this case, extensions might not (yet) be uploaded.
> dbGetQuery(con,"from duckdb_extensions();") |> as_tibble() |> select(extension_name,loaded,installed)
# A tibble: 14 × 3
extension_name loaded installed
<chr> <lgl> <lgl>
1 autocomplete FALSE FALSE
2 fts TRUE TRUE
3 httpfs FALSE TRUE
4 icu TRUE TRUE
5 inet FALSE FALSE
6 jemalloc FALSE FALSE
7 json TRUE TRUE
8 motherduck FALSE FALSE
9 parquet TRUE TRUE
10 postgres_scanner FALSE FALSE
11 spatial FALSE FALSE
12 sqlite_scanner FALSE FALSE
13 tpcds FALSE FALSE
14 tpch FALSE FALSE
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.utf8 LC_CTYPE=Spanish_Spain.utf8
[3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.utf8
time zone: Europe/Madrid
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2
[6] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0
[11] duckdb_0.8.1-3 DBI_1.1.3
loaded via a namespace (and not attached):
[1] tcltk_4.3.1 writexl_1.4.2 rlang_1.1.1 magrittr_2.0.3
[5] shinydashboard_0.7.2 snakecase_0.11.1 compiler_4.3.1 vctrs_0.6.3
[9] pkgconfig_2.0.3 shape_1.4.6 fastmap_1.1.1 backports_1.4.1
[13] ellipsis_0.3.2 utf8_1.2.3 shinyjqui_0.4.1 promises_1.2.1
[17] rmarkdown_2.25 tzdb_0.4.0 nloptr_2.0.3 missMDA_1.18
[21] bit_4.0.5 xfun_0.40 glmnet_4.1-8 jomo_2.7-6
[25] jsonlite_1.8.7 flashClust_1.01-2 later_1.3.1 pan_1.9
[29] broom_1.0.5 parallel_4.3.1 cluster_2.1.4 R6_2.5.1
[33] FactoInvestigate_1.8 stringi_1.7.12 boot_1.3-28.1 rpart_4.1.19
[37] estimability_1.4.1 Rcpp_1.0.11 assertthat_0.2.1 iterators_1.0.14
[41] knitr_1.44 base64enc_0.1-3 httpuv_1.6.11 Matrix_1.6-1.1
[45] splines_4.3.1 nnet_7.3-19 timechange_0.2.0 tidyselect_1.2.0
[49] rstudioapi_0.15.0 yaml_2.3.7 miniUI_0.1.1.1 doParallel_1.0.17
[53] codetools_0.2-19 lattice_0.21-8 withr_2.5.0 shiny_1.7.5
[57] evaluate_0.21 survival_3.5-7 pillar_1.9.0 mice_3.16.0
[61] corrplot_0.92 DT_0.29 foreach_1.5.2 generics_0.1.3
[65] hms_1.1.3 munsell_0.5.0 scales_1.2.1 minqa_1.2.6
[69] xtable_1.8-4 leaps_3.1 glue_1.6.2 janitor_2.2.0
[73] emmeans_1.8.8 scatterplot3d_0.3-44 tools_4.3.1 lme4_1.1-34
[77] colourpicker_1.3.0 mvtnorm_1.2-3 grid_4.3.1 colorspace_2.1-0
[81] nlme_3.1-163 repr_1.1.6 cli_3.6.1 fansi_1.0.4
[85] arrow_13.0.0.1 gtable_0.3.4 digest_0.6.33 ggrepel_0.9.3
[89] FactoMineR_2.8 htmlwidgets_1.6.2 skimr_2.1.5 htmltools_0.5.6
[93] Factoshiny_2.4 lifecycle_1.0.3 multcompView_0.1-9 mitml_0.4-5
[97] mime_0.12 bit64_4.0.5 MASS_7.3-60
> list.files("xxx\\AppData\\Roaming\\R\\data\\R\\duckdb\\.duckdb\\extensions\\v0.8.1\\windows_amd64_rtools\\")
[1] "fts.duckdb_extension" "httpfs.duckdb_extension" "icu.duckdb_extension"
[4] "json.duckdb_extension"
Thank you
Hi,
I saw some code (here) that used config=list() to pass parameters to a duckdb connection.
I tried it, but for me, they are ignored as illustrated below where the memory limit is 432GB instead of the requested 1GB.
Passing parameters using dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';") appears to work.
Maybe config= list() is not a thing, just wanted to make sure:
library(dplyr)
library(dbplyr)
library(duckdb)
duckdb_path <- "/devroot/sandbox/tmp/duckdb.duckdb"
con <- dbConnect(duckdb::duckdb(dbdir = duckdb_path), config=list("memory_limit"="1GB", "threads" = "1") )
dbGetQuery(con, "select current_setting('memory_limit')") # 432.6GB
dbGetQuery(con, "select current_setting('threads')") # 64
dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';")
dbGetQuery(con, "select current_setting('memory_limit')") # 1GB
dbGetQuery(con, "select current_setting('threads')") # 1
Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.0.2 (2020-06-22)
os OpenShift Enterprise
system x86_64, linux-gnu
ui RStudio
language (EN)
collate en_CA.UTF-8
ctype en_CA.UTF-8
tz America/Toronto
date 2024-02-23
rstudio 2023.06.0+421.pro1 Mountain Hydrangea (server)
pandoc NA
─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P cli 3.6.2 2023-12-11 [?] RSPM (R 4.0.5)
P DBI * 1.2.1 2024-01-12 [?] RSPM (R 4.0.5)
P dbplyr * 2.4.0 2023-10-26 [?] RSPM (R 4.0.5)
P dplyr * 1.1.4 2023-11-17 [?] RSPM (R 4.0.5)
P duckdb * 0.9.2-1 2023-11-28 [?] RSPM (R 4.0.5)
P fansi 1.0.4 2023-01-22 [?] CRAN (R 4.0.2)
P generics 0.1.3 2022-07-05 [?] RSPM (R 4.0.5)
P glue 1.6.2 2022-02-24 [?] RSPM (R 4.0.5)
P lifecycle 1.0.3 2022-10-07 [?] RSPM (R 4.0.5)
magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.0.5)
P pillar 1.9.0 2023-03-22 [?] RSPM (R 4.0.5)
P pkgconfig 2.0.3 2019-09-22 [?] RSPM (R 4.0.3)
R6 2.5.1 2021-08-19 [1] RSPM (R 4.0.5)
renv 1.0.3 2023-09-19 [1] RSPM (R 4.0.2)
P rlang 1.1.3 2024-01-10 [?] RSPM (R 4.0.5)
P rstudioapi 0.15.0 2023-07-07 [?] RSPM (R 4.0.5)
P sessioninfo 1.2.2 2021-12-06 [?] RSPM (R 4.0.5)
P tibble 3.2.1 2023-03-20 [?] RSPM (R 4.0.5)
P tidyselect 1.2.0 2022-10-10 [?] RSPM (R 4.0.5)
P utf8 1.2.3 2023-01-31 [?] RSPM (R 4.0.5)
P vctrs 0.6.5 2023-12-01 [?] RSPM (R 4.0.5)
[1] xxxxxxx /renv/library/R-4.0/x86_64-pc-linux-gnu
[2] /opt/R/4.0.2/lib/R/library
P ── Loaded and on-disk path mismatch.
I don't understand the code and the test:
duckdb-r/R/backend-dbplyr__duckdb_connection.R
Lines 370 to 379 in aef914b
duckdb-r/tests/testthat/test_tbl__duckdb_connection.R
Lines 30 to 43 in aef914b
I have the following questions:
PRAGMA
with cache = TRUE
, is this desired? Should we also run it with cache = FALSE
?PRAGMA
give a permanent side effect on the connection (yes according to the test)? Should we reset it afterward?ident_q()
is discouraged and also has no effect here because we're using NextMethod()
. What's the intention here, do we need it?tbl()
seem to be broken via tidyverse/dbplyr#1384, should we perhaps offer a different, more robust API? We could implement a dedicated function that documents all available pragmas and perhaps even offers them as autocomplete.In my scripts where I connect to a duckdb, I oftentimes get the warning Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.
.
When I use duckdb in combination with a shiny app or need to work with many different databases sequentially this clutters the console as the warning is repeated n-times.
con <- DBI::dbConnect(duckdb::duckdb(), "my-db.db")
on.exit(DBI::dbDisconnect(con, shutdown = TRUE), add = TRUE)
# do something with the database
# eventually I get
#> Warning: Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.
The responsible line is src/database.cpp#L12
.
Is it possible to have an option to silence the warning? Maybe something like option(duckdb.silence.disconnect_warning = TRUE)
?
hi, i'm still hitting this error on both the CRAN version and also the duckdb_0.8.1-9000
dev version..
# using github
remotes::install_github("duckdb/duckdb-r")
# minimal reproducible example
library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable( con , 'my_table' , my_df )
console output:
> library(duckdb)
Loading required package: DBI
> con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
> my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
> dbWriteTable( con , 'my_table' , my_df )
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
In addition: Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.
>
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8 LC_NUMERIC=C LC_TIME=English_United States.utf8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] duckdb_0.8.1-9000 DBI_1.1.3
loaded via a namespace (and not attached):
[1] compiler_4.3.1 tools_4.3.1
>
hi
I have a problem with duckdb in R
library(duckdb)
con <- dbConnect(duckdb::duckdb())
dbExecute(con, "install 'httpfs';")
dbExecute(con, "load 'httpfs';")
dbExecute(con, "install 'httpfs';") is OK
but dbExecute(con, "load 'httpfs';") Failed
Error: rapi_execute: Failed to run query
Error: IO Error: Extension "C:\Users.....\AppData\Roaming\R\data\R\duckdb.duckdb\extensions\2414840843\windows_amd64_rtools\httpfs.duckdb_extension" could not be loaded: Le module spécifié est introuvable.
regards
Hi,
is it possible to (save a view? / register?) a result from a dplyr query with a duckdb table as source?
In this basic exemple, I collect or compute before saving to a new table.
Is it possible to bypass this ?
(My queries are often complex and this is not a real option for me to do it via SQL.)
duck <- dbConnect(duckdb())
duck |>
tbl(sql("select * from nested_table")) |>
pivot_wider(names_from = key, values_from = c(str, int)) |>
to_arrow() |>
compute() |>
to_duckdb(duck, "unested_table", auto_disconnect = FALSE)
Thank you.
When the connection is created with disk-base storage, the following gives the infamous garbage-collection warning (#34, #58) when gc()
is called, despite already being correctly shutdown as directed:
dir <- tempfile()
con <- DBI::dbConnect(duckdb::duckdb(), dir)
DBI::dbDisconnect(con, shutdown=TRUE)
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 732512 39.2 1403478 75 1403478 75.0
# Vcells 1236192 9.5 8388608 64 2328637 17.8
# Warning message:
# Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.
Without the dbdir argument, we do not get a warning:
con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbDisconnect(con, shutdown=TRUE)
gc()
Any advice on suppressing this warning? (I have a few packages that wrap duckdb and users are frequently confused by seeing this warning, mistaking it for an error).
The sql
function used internally would be useful to perform processing via DuckDB. (e.g., reading Parquet files).
Lines 1 to 15 in d243b53
Would you consider exporting this with a name like duckdb_query
?
This is related to #34, but while that issue asks about suppressing the warning in general, this issue is about why we see that warning on instantiation. (Perhaps from #34 (comment), this might be somewhat of a dupe, but I see a difference between suppressing the warning in general, and fixing why we see the warning here in the first place.)
(fn <- tempfile(fileext = ".duckdb"))
# [1] "/tmp/RtmpxVwd7K/file1a8c4190c110f.duckdb"
gc() ; con <- DBI::dbConnect(duckdb::duckdb(), dbdir=fn) ; gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 359465 19.2 664345 35.5 664345 35.5
# Vcells 698733 5.4 8388608 64.0 1815226 13.9
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 538607 28.8 1206704 64.5 664345 35.5
# Vcells 1037401 8.0 8388608 64.0 1815226 13.9
# Warning message:
# Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.
I cannot find a way to instantiate the connection without this warning. Yes, it's just a warning, but ... I don't know how to prevent users of my packages/code from seeing the warning (suppressWarnings(..)
doesn't work here).
sessioninfo::session_info()
# ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# setting value
# version R version 4.3.2 (2023-10-31)
# os Ubuntu 23.10
# system x86_64, linux-gnu
# ui X11
# language (EN)
# collate C.UTF-8
# ctype C.UTF-8
# tz America/New_York
# date 2023-12-20
# pandoc 2.17.1.1 @ /usr/bin/pandoc
# ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
# package * version date (UTC) lib source
# cli 3.6.1 2023-03-23 [1] RSPM (R 4.3.0)
# DBI 1.1.3 2022-06-18 [1] RSPM (R 4.3.0)
# duckdb 0.9.2-1 2023-11-28 [1] RSPM (R 4.3.2)
# sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.3.0)
# [1] /home/r2/R/x86_64-pc-linux-gnu-library/4.3
# [2] /opt/R/4.3.2/lib/R/library
# ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Hi,
Thanks for all the great work on maintaining this package. I can interpret from the work being done on 0.10.0 that this requires some effort, but it's greatly appreciated.
"Out-of-tree-extensions" appear to be missing on Windows as reported in other issues (e.g. #23). I made a quick script to check which extensions are not available based on duckdb_extensions()
and came up with the following list for version 0.10.0 and 0.9.2. As pointed out by @Mause, most of these are "Out-of-tree-extensions".
Extension | Linux | OSX | Windows | ||||
---|---|---|---|---|---|---|---|
amd64 | amd64_gcc4 | arm64 | amd64 | arm64 | amd64 | amd64_rtools | |
arrow | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
autocomplete | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
aws | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
azure | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
excel | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
fts | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
httpfs | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
iceberg | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
icu | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
inet | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
json | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
motherduck | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
mysql_scanner | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
parquet | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
postgres_scanner | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
spatial | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
sqlite_scanner | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
substrait | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
tpcds | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
tpch | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Extension | Linux | OSX | Windows | ||||
---|---|---|---|---|---|---|---|
amd64 | amd64_gcc4 | arm64 | amd64 | arm64 | amd64 | amd64_rtools | |
arrow | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
autocomplete | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
aws | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
azure | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
excel | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
fts | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
httpfs | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
iceberg | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
icu | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
inet | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
json | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
motherduck | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
mysql_scanner | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
parquet | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
postgres_scanner | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
spatial | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
sqlite_scanner | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
substrait | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
tpcds | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
tpch | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
32GB RAM use when writing a 16GB file, and also RAM use reaching 32GB momentarily when reading the same 16GB file. From #72 (comment), by @SimonCoulombe.
library(DBI)
library(dplyr)
library(dbplyr)
library(duckdb)
duckdb_path <- "/devroot/sandbox/tmp/duckdb.duckdb"
con <- dbConnect(duckdb::duckdb(dbdir = duckdb_path))
dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';")
# run this once to create the duckdb file then restart session:
if (FALSE){
bigdata <- data.table::rbindlist(rlang::rep_along(1:3e6, list(iris)))
dim(bigdata) # 450M rows, 5 columns
lobstr::obj_size(bigdata) # 16.20 GB in RAM
dbWriteTable(con, "straight_from_memory", bigdata)
}
bigdata <- tbl(con, "straight_from_memory") %>% collect()
I am trying to install the duckdb
R package. Every time I try, it takes an hour and then fails. I have tried to install from CRAN, R-Universe, and directly from the source tar.gz file. None of them works. The truncated installation output is below (because it exceeds the GitHub issue character limit of 65,536 characters).
Interestingly, devtools::install_github("duckdb/duckdb-r")
works perfectly, but it doesn't come with any extensions. How can I get it to work from CRAN?
> install.packages("duckdb", repos = c("https://duckdb.r-universe.dev", "https://cloud.r-project.org"))
Installing package into 'C:/Documents/R/win-library/4.1'
(as 'lib' is unspecified)
There is a binary version available but the source version is later:
binary source needs_compilation
duckdb 0.7.1-1 0.9.2-1 TRUE
installing the source package 'duckdb'
trying URL 'https://duckdb.r-universe.dev/src/contrib/duckdb_0.9.2-1.tar.gz'
Content type 'application/x-gzip' length 3962978 bytes (3.8 MB)
downloaded 3.8 MB
* installing *source* package 'duckdb' ...
** using staged installation
** libs
Warning: this package has a non-empty 'configure.win' file,
so building only the main architecture
/mingw64/bin/g++ -std=gnu++17 -I"C:/DOCUME~1/R/R-41~1.3/include" -DNDEBUG -Iinclude -I../inst/include -DDUCKDB_DISABLE_PRINT -DDUCKDB_R_BUILD -Iduckdb/src/include -Iduckdb/third_party/fmt/include -Iduckdb/third_party/fsst -Iduckdb/third_party/re2 -Iduckdb/third_party/miniz -Iduckdb/third_party/utf8proc/include -Iduckdb/third_party/utf8proc -Iduckdb/third_party/hyperloglog -Iduckdb/third_party/skiplist -Iduckdb/third_party/fastpforlib -Iduckdb/third_party/tdigest -Iduckdb/third_party/libpg_query/include -Iduckdb/third_party/libpg_query -Iduckdb/third_party/concurrentqueue -Iduckdb/third_party/pcg -Iduckdb/third_party/httplib -Iduckdb/third_party/fast_float -Iduckdb/third_party/mbedtls -Iduckdb/third_party/mbedtls/include -Iduckdb/third_party/mbedtls/library -Iduckdb/third_party/jaro_winkler -Iduckdb/third_party/jaro_winkler/details -Iduckdb/extension/parquet/include -Iduckdb/third_party/parquet -Iduckdb/third_party/thrift -Iduckdb/third_party/snappy -Iduckdb/third_party/zstd/include -Iduckdb/third_party/mbedtls -Iduckdb/third_party/mbedtls/include -I../inst/include -Iduckdb -DDUCKDB_EXTENSION_PARQUET_LINKED -DDUCKDB_BUILD_LIBRARY -DDUCKDB_PLATFORM_RTOOLS=1 -O2 -Wall -mfpmath=sse -msse2 -mstackrealign -c duckdb/ub_src_catalog.cpp -o duckdb/ub_src_catalog.o
. . .
/mingw64/bin/g++ -std=gnu++17 -I"C:/DOCUME~1/R/R-41~1.3/include" -DNDEBUG -Iinclude -I../inst/include -DDUCKDB_DISABLE_PRINT -DDUCKDB_R_BUILD -Iduckdb/src/include -Iduckdb/third_party/fmt/include -Iduckdb/third_party/fsst -Iduckdb/third_party/re2 -Iduckdb/third_party/miniz -Iduckdb/third_party/utf8proc/include -Iduckdb/third_party/utf8proc -Iduckdb/third_party/hyperloglog -Iduckdb/third_party/skiplist -Iduckdb/third_party/fastpforlib -Iduckdb/third_party/tdigest -Iduckdb/third_party/libpg_query/include -Iduckdb/third_party/libpg_query -Iduckdb/third_party/concurrentqueue -Iduckdb/third_party/pcg -Iduckdb/third_party/httplib -Iduckdb/third_party/fast_float -Iduckdb/third_party/mbedtls -Iduckdb/third_party/mbedtls/include -Iduckdb/third_party/mbedtls/library -Iduckdb/third_party/jaro_winkler -Iduckdb/third_party/jaro_winkler/details -Iduckdb/extension/parquet/include -Iduckdb/third_party/parquet -Iduckdb/third_party/thrift -Iduckdb/third_party/snappy -Iduckdb/third_party/zstd/include -Iduckdb/third_party/mbedtls -Iduckdb/third_party/mbedtls/include -I../inst/include -Iduckdb -DDUCKDB_EXTENSION_PARQUET_LINKED -DDUCKDB_BUILD_LIBRARY -DDUCKDB_PLATFORM_RTOOLS=1 -O2 -Wall -mfpmath=sse -msse2 -mstackrealign -c cpp11.cpp -o cpp11.o
Error in system(sprintf("%s -Pg %s", nm, shQuote(f)), intern = TRUE) :
'""' not found
Calls: <Anonymous> -> lapply -> FUN -> strsplit -> system
Execution halted
make: *** [C:/DOCUME~1/R/R-41~1.3/share/make/winshlib.mk:35: symbols.rds] Error 1
ERROR: compilation failed for package 'duckdb'
* removing 'C:/Documents/R/win-library/4.1/duckdb'
* restoring previous 'C:/Documents/R/win-library/4.1/duckdb'
Warning in install.packages :
installation of package 'duckdb' had non-zero exit status
Here is my sessionInfo()
> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0
[8] tibble_3.2.1 ggplot2_3.4.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] rstudioapi_0.14 magrittr_2.0.3 hms_1.1.3 tidyselect_1.2.0 munsell_0.5.0 timechange_0.2.0 colorspace_2.1-1
[8] R6_2.5.1 rlang_1.1.0 fansi_0.5.0 tools_4.1.3 pak_0.6.0 grid_4.1.3 gtable_0.3.1
[15] utf8_1.2.3 cli_3.6.0 withr_2.5.0 lifecycle_1.0.3 processx_3.8.0 tzdb_0.3.0 callr_3.7.3
[22] vctrs_0.6.4 ps_1.7.2 glue_1.6.2 stringi_1.7.5 compiler_4.1.3 pillar_1.9.0 generics_0.1.3
[29] scales_1.2.1 pkgconfig_2.0.3
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.