duckdb / duckdb-r Goto Github PK

View Code? Open in Web Editor NEW

86.0 7.0 21.0 27.07 MB

The duckdb R package

Home Page: https://r.duckdb.org/

License: Other

R 64.68% Shell 0.72% C++ 31.37% Python 1.76% CMake 0.12% Makefile 1.36%

database duckdb olap r

duckdb-r's Introduction

DuckDB

DuckDB is a high-performance analytical database system. It is designed to be fast, reliable, portable, and easy to use. DuckDB provides a rich SQL dialect, with support far beyond basic SQL. DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (arrays, structs), and more. For more information on using DuckDB, please refer to the DuckDB documentation.

Installation

If you want to install and use DuckDB, please see our website for installation and usage instructions.

Data Import

For CSV files and Parquet files, data import is as simple as referencing the file in the FROM clause:

SELECT * FROM 'myfile.csv';
SELECT * FROM 'myfile.parquet';

Refer to our Data Import section for more information.

SQL Reference

The website contains a reference of functions and SQL constructs available in DuckDB.

Development

For development, DuckDB requires CMake, Python3 and a C++11 compliant compiler. Run make in the root directory to compile the sources. For development, use make debug to build a non-optimized debug version. You should run make unit and make allunit to verify that your version works properly after making changes. To test performance, you can run BUILD_BENCHMARK=1 BUILD_TPCH=1 make and then perform several standard benchmarks from the root directory by executing ./build/release/benchmark/benchmark_runner. The details of benchmarks are in our Benchmark Guide.

Please also refer to our Build Guide and Contribution Guide.

Support

See the Support Options page.

duckdb-r's People

Contributors

Stargazers

Watchers

duckdb-r's Issues

Prefer plain old subtrees in favor of submodules

The latter give headaches with other R tooling.

`local_con()` test fixture

Instead of :

  con <- dbConnect(duckdb())
  on.exit(dbDisconnect(con, shutdown = TRUE))

we'd have :

con <- local_con()

or some other name, to follow the guidelines from https://testthat.r-lib.org/articles/test-fixtures.html

something like (borrowed from https://github.com/hannes/duckdb-rfuns).

local_con <- function() {
  con <- dbConnect(duckdb())
  withr::defer_parent(dbDisconnect(con, shutdown=TRUE))
  con
}

duckdb R client uses excessive RAM and crashes

What happens?

duckdb crashes when trying to perform queries on a large (~150 GB) parquet database

To Reproduce

Apologies I cannot reproduce this with a smaller dataset. I can reproduce the same errors using the remote S3 access though which might be a bit faster than downloading a full copy of the dataset, or you could probably download a copy from my server in a few days. reproducible examples of both remote and local-based access below. (The remote example goes via arrow because I cannot get direct access to S3 buckets via duckdb alone yet).

## 
library(arrow)
library(dplyr)
library(duckdb)
path <- arrow::s3_bucket("ebird/observations", endpoint_override = "minio.carlboettiger.info", anonymous=TRUE)
obs <- arrow::open_dataset(path) |> to_duckdb()
tmp <- obs |> 
  group_by(sampling_event_identifier, scientific_name) |>
  summarize(count = sum(observation_count, na.rm=TRUE),
            .groups = "drop") 
tmp <- tmp |> compute() # crashes

Or, after downloading from the above public S3 bucket (https://minio.carlboettiger.info/ebird/Mar-2022/observations), try local parquet access:

## pure local duckdb crashes on simple operations:
library(duckdb)
library(dplyr)
parquet <- file.path("/home/shared-data/ebird/observations/*.parquet")
conn <- dbConnect(duckdb(), "/home/shared-data/ebird/db")
dbExecute(conn = conn, paste0("PRAGMA memory_limit='12GB'"))
view_query <- paste0("CREATE VIEW 'observations' AS SELECT * FROM parquet_scan('",
                     parquet, "');")
DBI::dbSendQuery(conn, view_query)
obs <- tbl(conn, "observations")

tmp <- obs |> 
  group_by(sampling_event_identifier, scientific_name) |>
  summarize(count = sum(observation_count, na.rm=TRUE),
            .groups = "drop") 
tmp <- tmp |> compute()

With the memory limit PRAGMA in place, the example doesn't crash the R session but does throw an OOM error:

Error: duckdb_execute_R: Failed to run query
Error: Out of Memory Error: could not allocate block of 262144 bytes

Environment (please complete the following information):

OS: Ubuntu 20.04
DuckDB Version: 0.3.4 (latest)
DuckDB Client: R

Before Submitting

Have you tried this on the latest master branch?

Python: pip install duckdb --upgrade --pre
R: install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
Other Platforms: You can find binaries here or compile from source.

by the way, install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL) no longer works as there seems to no longer be master-builds download?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

pkgdown website?

It would be great if a pkgdown website could be added.

Support for hive partitioning

Hive partitioning is a new feature of the duckdb CLI. hive_partitioning

Is this currently supported in the R package? If not, are there any plans to do so?

Thanks to all the developers.

Spatial Extension support

I saw this post of ibis.
https://ibis-project.org/posts/ibis-duckdb-geospatial/

It would be great if the R client also had an integration with the Spatial Extension. (For now, it seems to be difficult to handle because it is converted to the raw type of R)

Perhaps integrating with the geoarrow package would make sense?
@paleolimbot Sorry for tagging you, but do you have any perspectives on such integrations?

Listing all past contributors

Will simplify future investigations.

MAP type is not working in R

> con <- dbConnect(duckdb::duckdb())
> dbGetQuery(con, "SELECT map([1,2],['a','b']) AS x;")
Error: rapi_prepare: Unknown column type for prepare: MAP(INTEGER, VARCHAR)

Change in behaviour with date manipulation

@krlmlr looking into the test failure you mentioned here darwin-eu-dev/PatientProfiles#479, I think this has been caused by a slight change in behaviour with date manipulation (see below where the current cran version returns a date but the new version gives a date time).

library(duckdb)
#> Warning: package 'duckdb' was built under R version 4.2.3
#> Loading required package: DBI
#> Warning: package 'DBI' was built under R version 4.2.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.2.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI)
library(CDMConnector)
#> Warning: package 'CDMConnector' was built under R version 4.2.3

con <- DBI::dbConnect(duckdb(), path = ":memory:")

packageVersion("duckdb")
#> [1] '0.9.2.1'

test_data <- data.frame(person = 1L,
                        date_1 = as.Date("2001-01-01"))

db_test_data <- copy_to(con, test_data, overwrite = TRUE)
db_test_data <- db_test_data  %>%
  dplyr::mutate(date_2 = date_1 + years(1)) 
db_test_data
#> # Source:   SQL [1 x 3]
#> # Database: DuckDB v0.9.2 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#>   person date_1     date_2    
#>    <int> <date>     <date>    
#> 1      1 2001-01-01 2002-01-01
class(db_test_data %>% 
        pull("date_2"))
#> [1] "Date"

^{Created on 2024-03-12 with reprex v2.0.2}

library(duckdb)
#> Warning: package 'duckdb' was built under R version 4.2.3
#> Loading required package: DBI
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.2.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(DBI)
library(CDMConnector)
#> Warning: package 'CDMConnector' was built under R version 4.2.3

con <- DBI::dbConnect(duckdb(), path = ":memory:")

packageVersion("duckdb")
#> [1] '0.10.0'

test_data <- data.frame(person = 1L,
                        date_1 = as.Date("2001-01-01"))

db_test_data <- copy_to(con, test_data, overwrite = TRUE)
db_test_data <- db_test_data  %>%
  dplyr::mutate(date_2 = date_1 + years(1)) 
db_test_data
#> # Source:   SQL [1 x 3]
#> # Database: DuckDB v0.10.0 [eburn@Windows 10 x64:R 4.2.1/:memory:]
#>   person date_1     date_2             
#>    <int> <date>     <dttm>             
#> 1      1 2001-01-01 2002-01-01 00:00:00
class(db_test_data %>% 
        pull("date_2"))
#> [1] "POSIXct" "POSIXt"

^{Created on 2024-03-12 by the reprex package (v2.0.1)}

Not correctly escaping column names in sql for `duckdb` connection

See tidyverse/dplyr#6979

garbage-collection warning despite using shutdown -- only when dbdir is specified

When the connection is created with disk-base storage, the following gives the infamous garbage-collection warning (#34, #58) when gc() is called, despite already being correctly shutdown as directed:

dir <- tempfile()
con <- DBI::dbConnect(duckdb::duckdb(), dir)
DBI::dbDisconnect(con, shutdown=TRUE)
gc()

#          used (Mb) gc trigger (Mb) max used (Mb)
# Ncells  732512 39.2    1403478   75  1403478 75.0
# Vcells 1236192  9.5    8388608   64  2328637 17.8
# Warning message:
# Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.

Without the dbdir argument, we do not get a warning:

con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbDisconnect(con, shutdown=TRUE)
gc()

Any advice on suppressing this warning? (I have a few packages that wrap duckdb and users are frequently confused by seeing this warning, mistaking it for an error).

Array returns "unknown column type" in 0.9.99.9000 when getting the results of a query

After installing 0.9.99.9000, array references (such as array_value(1,2,3)) return "unknown column type" to dbGetQuery(). For example:

con <- dbConnect(duckdb())
dbGetQuery(con,"select array_value(1,2,3)")

returns:

Error: rapi_prepare: Unknown column type for prepare: INTEGER[3]

Whereas a list reference:

dbGetQuery(con,"select list_value(1,2,3)")

displays the list:

  list_value(1, 2, 3)
1             1, 2, 3

Array creation in general appears to work within ddb 0.9.99.9000, however.

Install extension when using webr

It seems that when installing a duckdb extension (like httpfs) using r-wasm/webr and coatless/quarto-webr duckdb should use HTTPS for the connection to work. However it currently uses plain HTTP http://extensions.duckdb.org/

See coatless/quarto-webr#139

Is it possible to force HTTPS?
Should I report this problem elsewhere?

Cannot install extensions postgres_scanner or spatial in MS Windows

Hello,

I cannot install some extension in R 4.3.1 in windows.
I use duckdb duckdb_0.8.1-3 DBI_1.1.3 . I also tested v0.9.0 but in that version more extensions failed to install.

In v0.8.1 the extensions icu, json, httpfs and fts work ok, but postgres_scanner and spatial fail to install. It looks like them don't exists in the repository for platform windows_amd64_rtools. I tried to install manually from repository for platform windows_amd64 (without _rtools sufix) but they made R Session to crash in RStudio when I load them.

I test same thing in linux amd64 and arm64 and work ok with all extension in version 0.8.1 and 0.9.0

> dbExecute(con,"install postgres_scanner;")
Error: rapi_execute: Failed to run query
Error: HTTP Error: Failed to download extension "postgres_scanner" at URL "http://extensions.duckdb.org/v0.8.1/windows_amd64_rtools/postgres_scanner.duckdb_extension.gz"
Extension "postgres_scanner" is an existing extension.

Are you using a development build? In this case, extensions might not (yet) be uploaded.
> dbExecute(con,"load postgres_scanner;")
Error: rapi_execute: Failed to run query
Error: HTTP Error: Failed to download extension "postgres_scanner" at URL "http://extensions.duckdb.org/v0.8.1/windows_amd64_rtools/postgres_scanner.duckdb_extension.gz"
Extension "postgres_scanner" is an existing extension.

Are you using a development build? In this case, extensions might not (yet) be uploaded.
> dbGetQuery(con,"from duckdb_extensions();") |> as_tibble() |> select(extension_name,loaded,installed)
# A tibble: 14 × 3
   extension_name   loaded installed
   <chr>            <lgl>  <lgl>    
 1 autocomplete     FALSE  FALSE    
 2 fts              TRUE   TRUE     
 3 httpfs           FALSE  TRUE     
 4 icu              TRUE   TRUE     
 5 inet             FALSE  FALSE    
 6 jemalloc         FALSE  FALSE    
 7 json             TRUE   TRUE     
 8 motherduck       FALSE  FALSE    
 9 parquet          TRUE   TRUE     
10 postgres_scanner FALSE  FALSE    
11 spatial          FALSE  FALSE    
12 sqlite_scanner   FALSE  FALSE    
13 tpcds            FALSE  FALSE    
14 tpch             FALSE  FALSE    
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
[3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
[5] LC_TIME=Spanish_Spain.utf8    

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3     purrr_1.0.2    
 [6] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.3   tidyverse_2.0.0
[11] duckdb_0.8.1-3  DBI_1.1.3      

loaded via a namespace (and not attached):
 [1] tcltk_4.3.1          writexl_1.4.2        rlang_1.1.1          magrittr_2.0.3      
 [5] shinydashboard_0.7.2 snakecase_0.11.1     compiler_4.3.1       vctrs_0.6.3         
 [9] pkgconfig_2.0.3      shape_1.4.6          fastmap_1.1.1        backports_1.4.1     
[13] ellipsis_0.3.2       utf8_1.2.3           shinyjqui_0.4.1      promises_1.2.1      
[17] rmarkdown_2.25       tzdb_0.4.0           nloptr_2.0.3         missMDA_1.18        
[21] bit_4.0.5            xfun_0.40            glmnet_4.1-8         jomo_2.7-6          
[25] jsonlite_1.8.7       flashClust_1.01-2    later_1.3.1          pan_1.9             
[29] broom_1.0.5          parallel_4.3.1       cluster_2.1.4        R6_2.5.1            
[33] FactoInvestigate_1.8 stringi_1.7.12       boot_1.3-28.1        rpart_4.1.19        
[37] estimability_1.4.1   Rcpp_1.0.11          assertthat_0.2.1     iterators_1.0.14    
[41] knitr_1.44           base64enc_0.1-3      httpuv_1.6.11        Matrix_1.6-1.1      
[45] splines_4.3.1        nnet_7.3-19          timechange_0.2.0     tidyselect_1.2.0    
[49] rstudioapi_0.15.0    yaml_2.3.7           miniUI_0.1.1.1       doParallel_1.0.17   
[53] codetools_0.2-19     lattice_0.21-8       withr_2.5.0          shiny_1.7.5         
[57] evaluate_0.21        survival_3.5-7       pillar_1.9.0         mice_3.16.0         
[61] corrplot_0.92        DT_0.29              foreach_1.5.2        generics_0.1.3      
[65] hms_1.1.3            munsell_0.5.0        scales_1.2.1         minqa_1.2.6         
[69] xtable_1.8-4         leaps_3.1            glue_1.6.2           janitor_2.2.0       
[73] emmeans_1.8.8        scatterplot3d_0.3-44 tools_4.3.1          lme4_1.1-34         
[77] colourpicker_1.3.0   mvtnorm_1.2-3        grid_4.3.1           colorspace_2.1-0    
[81] nlme_3.1-163         repr_1.1.6           cli_3.6.1            fansi_1.0.4         
[85] arrow_13.0.0.1       gtable_0.3.4         digest_0.6.33        ggrepel_0.9.3       
[89] FactoMineR_2.8       htmlwidgets_1.6.2    skimr_2.1.5          htmltools_0.5.6     
[93] Factoshiny_2.4       lifecycle_1.0.3      multcompView_0.1-9   mitml_0.4-5         
[97] mime_0.12            bit64_4.0.5          MASS_7.3-60         
> list.files("xxx\\AppData\\Roaming\\R\\data\\R\\duckdb\\.duckdb\\extensions\\v0.8.1\\windows_amd64_rtools\\")
[1] "fts.duckdb_extension"    "httpfs.duckdb_extension" "icu.duckdb_extension"   
[4] "json.duckdb_extension"

Thank you

"Out-of-tree-extensions" missing for Windows

Hi,
Thanks for all the great work on maintaining this package. I can interpret from the work being done on 0.10.0 that this requires some effort, but it's greatly appreciated.

"Out-of-tree-extensions" appear to be missing on Windows as reported in other issues (e.g. #23). I made a quick script to check which extensions are not available based on duckdb_extensions() and came up with the following list for version 0.10.0 and 0.9.2. As pointed out by @Mause, most of these are "Out-of-tree-extensions".

0.10.0

Extension	Linux			OSX		Windows
Extension	amd64	amd64_gcc4	arm64	amd64	arm64	amd64	amd64_rtools
arrow	✅	✅	✅	✅	✅	❌	❌
autocomplete	✅	✅	✅	✅	✅	✅	✅
aws	✅	✅	✅	✅	✅	✅	❌
azure	✅	✅	✅	✅	✅	✅	❌
excel	✅	✅	✅	✅	✅	✅	✅
fts	✅	✅	✅	✅	✅	✅	✅
httpfs	✅	✅	✅	✅	✅	✅	✅
iceberg	✅	✅	✅	✅	✅	✅	❌
icu	✅	✅	✅	✅	✅	✅	✅
inet	✅	✅	✅	✅	✅	✅	✅
json	✅	✅	✅	✅	✅	✅	✅
motherduck	✅	✅	✅	✅	✅	✅	❌
mysql_scanner	✅	✅	✅	✅	✅	✅	❌
parquet	✅	✅	✅	✅	✅	✅	✅
postgres_scanner	✅	✅	✅	✅	✅	✅	❌
spatial	✅	✅	✅	✅	✅	✅	❌
sqlite_scanner	✅	✅	✅	✅	✅	✅	✅
substrait	✅	✅	✅	✅	✅	❌	❌
tpcds	✅	✅	✅	✅	✅	✅	✅
tpch	✅	✅	✅	✅	✅	✅	✅

0.9.2

Extension	Linux			OSX		Windows
Extension	amd64	amd64_gcc4	arm64	amd64	arm64	amd64	amd64_rtools
arrow	✅	✅	✅	✅	✅	❌	❌
autocomplete	✅	✅	✅	✅	✅	✅	✅
aws	✅	✅	✅	✅	✅	✅	❌
azure	✅	✅	✅	✅	✅	✅	❌
excel	✅	✅	✅	✅	✅	✅	✅
fts	✅	✅	✅	✅	✅	✅	✅
httpfs	✅	✅	✅	✅	✅	✅	✅
iceberg	✅	✅	✅	✅	✅	✅	❌
icu	✅	✅	✅	✅	✅	✅	✅
inet	✅	✅	✅	✅	✅	✅	✅
json	✅	✅	✅	✅	✅	✅	✅
motherduck	✅	✅	✅	✅	✅	✅	❌
mysql_scanner	✅	✅	✅	✅	✅	✅	❌
parquet	✅	✅	✅	✅	✅	✅	✅
postgres_scanner	✅	✅	✅	✅	✅	✅	❌
spatial	✅	✅	✅	✅	✅	✅	❌
sqlite_scanner	✅	✅	✅	✅	✅	✅	❌
substrait	✅	✅	✅	✅	✅	❌	❌
tpcds	✅	✅	✅	✅	✅	✅	✅
tpch	✅	✅	✅	✅	✅	✅	✅

Backslash doesn't escape metacharacters in duckdb for R

I haven't tried all the metacharacters, but the following commands in the R version of duckdb:

dbGetQuery(con,"select regexp_matches('blah','\?')")
dbGetQuery(con,"select regexp_matches('blah','\.')")
dbGetQuery(con,"select regexp_matches('blah','\*')")
dbGetQuery(con,"select regexp_matches('blah','\+')")

all return an error of the form:

Error: '\?' is an unrecognized escape in character string (<input>:1:48)

Even though all these commands work in the CLI version. Am I doing something wrong?

Thanks.

Construction of deep relational trees

This is a toy example, but relevant for some CRAN packages with the default setting of max_expression_depth . The symptoms are the same as when evaluating rel7 .

Ideally, we would already see an error when constructing rel5 . However, the system lets me construct rel5 and even rel6, only construction of rel7 fails with the same error as the evaluation of rel5 . Is this an off-by-two error, or something more serious?

duckplyr can fall back to dplyr if the error happens at construction, but not at evaluation -- this is too late. An error on construction of rel5 or perhaps even rel4 would fix the downstream problem. How to achieve this?

duckdb <- asNamespace("duckdb")
con <- DBI::dbConnect(duckdb::duckdb())
experimental <- FALSE
df1 <- tibble::tibble(id = 1L)

DBI::dbExecute(con, "SET max_expression_depth TO 5")
#> [1] 0


rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_project(
  rel1,
  list({
    tmp_expr <- duckdb$expr_reference("id")
    duckdb$expr_set_alias(tmp_expr, "id")
    tmp_expr
  })
)
rel3 <- duckdb$rel_project(
  rel2,
  list({
    tmp_expr <- duckdb$expr_reference("id")
    duckdb$expr_set_alias(tmp_expr, "id")
    tmp_expr
  })
)
rel4 <- duckdb$rel_project(
  rel3,
  list({
    tmp_expr <- duckdb$expr_reference("id")
    duckdb$expr_set_alias(tmp_expr, "id")
    tmp_expr
  })
)
rel4
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [id as id]
#>   Projection [id as id]
#>     Projection [id as id]
#>       r_dataframe_scan(0x11cca4278)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - id (INTEGER)
rel5 <- duckdb$rel_project(
  rel4,
  list({
    tmp_expr <- duckdb$expr_reference("id")
    duckdb$expr_set_alias(tmp_expr, "id")
    tmp_expr
  })
)
rel5
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [id as id]
#>   Projection [id as id]
#>     Projection [id as id]
#>       Projection [id as id]
#>         r_dataframe_scan(0x11cca4278)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - id (INTEGER)
rel6 <- duckdb$rel_project(
  rel5,
  list({
    tmp_expr <- duckdb$expr_reference("id")
    duckdb$expr_set_alias(tmp_expr, "id")
    tmp_expr
  })
)
rel6
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Projection [id as id]
#>   Projection [id as id]
#>     Projection [id as id]
#>       Projection [id as id]
#>         Projection [id as id]
#>           r_dataframe_scan(0x11cca4278)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - id (INTEGER)
rel7 <- duckdb$rel_project(
  rel6,
  list({
    tmp_expr <- duckdb$expr_reference("id")
    duckdb$expr_set_alias(tmp_expr, "id")
    tmp_expr
  })
)
#> Error: {"exception_type":"Binder","exception_message":"Max expression depth limit of 5 exceeded. Use \"SET max_expression_depth TO x\" to increase the maximum expression depth."}
rel7
#> Error in eval(expr, envir, enclos): object 'rel7' not found
duckdb$rel_to_altrep(rel6)
#> Error: Error evaluating duckdb query: Parser Error: Maximum tree depth of 5 exceeded in logical planner
duckdb$rel_to_altrep(rel5)
#> Error: Error evaluating duckdb query: Parser Error: Maximum tree depth of 5 exceeded in logical planner
duckdb$rel_to_altrep(rel4)
#>   id
#> 1  1

^{Created on 2024-03-10 with reprex v2.1.0}

Avoid establishing connection for `dbGetInfo()`

The relevant info can also be hard-coded in the R code.

#73 (comment)

expose ON CONFLICT for Append and Write

I was wondering if it would be possible to expose the "INSERT OR IGNORE/REPLACE" functionality to the append and write functions so that the conflicts are managed at a row wise level.

Thanks

Expose the internal default connection and `sql` function?

The sql function used internally would be useful to perform processing via DuckDB. (e.g., reading Parquet files).

duckdb-r/R/sql.R

Lines 1 to 15 in d243b53

 #' Run a SQL query 

 #' 

 #' `sql()` runs an arbitrary SQL query and returns a data.frame the query results 

 #' 

 #' @param sql A SQL string 

 #' @param conn An optional connection, defaults to built-in default 

 #' @return A data frame with the query result 

 #' @noRd 

 #' @examples 

 #' print(duckdb::sql("SELECT 42")) 

 sql <- function(sql, conn = default_connection()) { 

 stopifnot(dbIsValid(conn)) 

 dbGetQuery(conn, sql) 

 }

Would you consider exporting this with a name like duckdb_query?

Remove from mainline DuckDB

dbWriteTable's arguments order

The DBI document says:

The following arguments are not part of the dbWriteTable() generic (to improve compatibility across backends) but are part of the DBI specification:

row.names (default: FALSE)

overwrite (default: FALSE)

append (default: FALSE)

field.types (default: NULL)

temporary (default: FALSE)

They must be provided as named arguments. See the "Specification" and "Value" sections for details on their usage.

But now dbWriteTable__duckdb_connection_character_data.frame can specify these options without names.
I think we need to change the position of the dot dot dot ....

duckdb-r/R/dbWriteTable__duckdb_connection_character_data.frame.R

Lines 9 to 17 in 8386b9c

 dbWriteTable__duckdb_connection_character_data.frame <- function(conn, 

 name, 

 value, 

 row.names = FALSE, 

 overwrite = FALSE, 

 append = FALSE, 

 field.types = NULL, 

 temporary = FALSE, 

 ...) {

Debug symbols in CRAN builds

curl -I https://cran.r-project.org/bin/macosx/big-sur-arm64/contrib/4.3/duckdb_0.9.2-1.tgz 
...
Content-Length: 85023346

The file is over 80 MB. When unpacked, it contains ca. 300 MB of debug symbols (duckdb.so.dSYM).

We should investigate whether it is possible to not include those in the download.

CC @hadley

Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update

hi, i'm still hitting this error on both the CRAN version and also the duckdb_0.8.1-9000 dev version..

# using github
remotes::install_github("duckdb/duckdb-r")


# minimal reproducible example
library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable( con , 'my_table' , my_df )

console output:

> library(duckdb)
Loading required package: DBI
> con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
> my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
> dbWriteTable( con , 'my_table' , my_df )
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
In addition: Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 
> 
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] duckdb_0.8.1-9000 DBI_1.1.3        

loaded via a namespace (and not attached):
[1] compiler_4.3.1 tools_4.3.1   
>

"Bleeding edge" install instructions don't pull latest duckdb

The readme suggests running R CMD INSTALL . on this repository to get "the bleeding edge", but this doesn't get the latest version of the underlying duckdb C++ library.

Would be nice to have shorthand instructions for installing the package with the latest version of both duckdb-r and duckdb, and to have those instructions added to the installation instructions in duckdb-web, the moreso since there seems to be a large delay between releases of libduckdb and the R package.

How to deal with `IO Error: Cannot open file...used by another process`?

I've started implementing duckdb at work using {targets} and RStudio. However, between a couple coworkers, various workflows are persistently ground to a halt with:

Last error: rapi_startup: Failed to open database: IO Error: Cannot open file "...": The process cannot access the file because it is being used by another process.

We've tried being extra vigilant by always connecting with:

  con <- DBI::dbConnect(
    duckdb::duckdb(),
    dbdir = ...,
    read_only = TRUE
  )

And always disconnecting with:

DBI::dbDisconnect(con, shutdown = TRUE)

But when where we need to update tables in the database and we connect with read_only = FALSE, the 'used by another process' error always appears and we can't ascertain where there's a lingering connection, even after disconnecting, clearing our environments, and restarting R. Is this possibly an RStudio thing? Any guidance on managing connections would be greatly appreciated.

[r] `bigint` advertised in `duckdb()` and `dbConnect()`, but only the former is supported

What happens?

For BIGINT columns the help page ?duckdb suggests a field bigint = "integer64" can be present either in dbConnect(duckdb(bigint = "integer64")) or dbConnect(duckdb(), bigint = "integer64"), but this is only respected in the first case.

The help page code is at https://github.com/duckdb/duckdb/blob/f7f5cf30aa5dfbc900c98c57bef2d729f2bf1b56/tools/rpkg/man/duckdb.Rd#L31 and the implementation showing that bigint will often be ignored is at https://github.com/duckdb/duckdb/blob/f7f5cf30aa5dfbc900c98c57bef2d729f2bf1b56/tools/rpkg/R/dbConnect__duckdb_driver.R#L55-L58

To Reproduce

Install dependencies:

install.packages(c('duckdb', 'bit64'))
library(duckdb)

bigint used when in duckdb()...

con = dbConnect(duckdb(bigint = "integer64"))
dbWriteTable(con, "one", data.frame(i64 = bit64::as.integer64(1:5)))
dbGetQuery(con, "DESCRIBE one")
##   column_name column_type null  key default extra
## 1         i64      BIGINT  YES <NA>    <NA>    NA

...but not when in dbConnect()

con = dbConnect(duckdb(), bigint = "integer64")
dbWriteTable(con, "one", data.frame(i64 = bit64::as.integer64(1:5)))
dbGetQuery(con, "DESCRIBE one")
##   column_name column_type null  key default extra
## 1         i64      DOUBLE  YES <NA>    <NA>    NA

OS:

aarch64-apple-darwin21.6.0

DuckDB Version:

0.8.1.1

DuckDB Client:

Full Name:

Martin Morgan

Affiliation:

Roswell Park Comprehensive Cancer Center

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

New array data type in version 0.10.0 can't be returned to R

con <- dbConnect(duckdb())
dbGetQuery(con, "select array_value(1,2,3)")
#> Error: rapi_prepare: Unknown column type for prepare: INTEGER[3]

dbplyr 2.4.0 breaks duckdb's ability to treat files as tables

DuckDB's ability to execute SELECT statements directly on files (CSV, Parquet etc) worked fine with dbplyr, up until 2.3.4. However with the upgrade to 2.4.0, it fails. With the below example I'm querying a specific file called metadata.0.2.3.parquet, but you can replace it with any parquet file and the same issue will occur:

duckdb::duckdb() |>
    DBI::dbConnect(drv = _, read_only = TRUE) |>
    dplyr::tbl("metadata.0.2.3.parquet")

It looks like you tried to incorrectly use a table in a schema as source.
ℹ If you want to specify a schema use `in_schema()` or `in_catalog()`.
ℹ If your table actually contains "." in the name use `check_from = FALSE` to silence this message.
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_prepare: Failed to prepare query SELECT "metadata.0.2.3.parquet".*
FROM "metadata.0.2.3.parquet"
LIMIT 11
Error: Binder Error: Referenced table "metadata.0.2.3.parquet" not found!
Candidate tables: "metadata"
Run `rlang::last_trace()` to see where the error occurred.

According to tidyverse/dbplyr#1390, this bug is something that needs to be fixed in the duckdb package.

Comparing strings with numbers in the relational API

Behavior differs between relational and SQL. @Tmonster: why would that be?

Relational

duckdb <- asNamespace("duckdb")
con <- DBI::dbConnect(duckdb::duckdb())
experimental <- FALSE
invisible(DBI::dbExecute(con, "CREATE MACRO \"!=\"(x, y) AS x <> y"))
df1 <- data.frame(a = 1)

rel1 <- duckdb$rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb$rel_filter(
  rel1,
  list(
    duckdb$expr_function(
      "!=",
      list(
        duckdb$expr_reference("a"),
        if ("experimental" %in% names(formals(duckdb$expr_constant))) {
          duckdb$expr_constant("", experimental = experimental)
        } else {
          duckdb$expr_constant("")
        }
      )
    )
  )
)
rel2
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Filter [!=(a, '')]
#>   r_dataframe_scan(0x152320908)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (DOUBLE)
duckdb$rel_to_altrep(rel2)
#> [1] a
#> <0 rows> (or 0-length row.names)

^{Created on 2023-11-16 with reprex v2.0.2}

SQL

✗ echo 'CREATE MACRO "!="(x, y) AS x <> y; SELECT 1.0 != '"''"' AS a' | build/debug/duckdb
Error: near line 1: Conversion Error: Could not convert string "" to DECIMAL(2,1)
✗ echo 'CREATE MACRO "!="(x, y) AS x <> y; SELECT '"''"' != 1.0 AS a' | build/debug/duckdb
Error: near line 1: Conversion Error: Could not convert string "" to DECIMAL(2,1)

Unable to install duckdb R package.

I am trying to install the duckdb R package. Every time I try, it takes an hour and then fails. I have tried to install from CRAN, R-Universe, and directly from the source tar.gz file. None of them works. The truncated installation output is below (because it exceeds the GitHub issue character limit of 65,536 characters).

Interestingly, devtools::install_github("duckdb/duckdb-r") works perfectly, but it doesn't come with any extensions. How can I get it to work from CRAN?

> install.packages("duckdb", repos = c("https://duckdb.r-universe.dev", "https://cloud.r-project.org"))
Installing package into 'C:/Documents/R/win-library/4.1'
(as 'lib' is unspecified)

  There is a binary version available but the source version is later:
        binary  source needs_compilation
duckdb 0.7.1-1 0.9.2-1              TRUE

installing the source package 'duckdb'

trying URL 'https://duckdb.r-universe.dev/src/contrib/duckdb_0.9.2-1.tar.gz'
Content type 'application/x-gzip' length 3962978 bytes (3.8 MB)
downloaded 3.8 MB

* installing *source* package 'duckdb' ...
** using staged installation
** libs
Warning: this package has a non-empty 'configure.win' file,
so building only the main architecture

/mingw64/bin/g++  -std=gnu++17 -I"C:/DOCUME~1/R/R-41~1.3/include" -DNDEBUG -Iinclude -I../inst/include -DDUCKDB_DISABLE_PRINT -DDUCKDB_R_BUILD -Iduckdb/src/include -Iduckdb/third_party/fmt/include -Iduckdb/third_party/fsst -Iduckdb/third_party/re2 -Iduckdb/third_party/miniz -Iduckdb/third_party/utf8proc/include -Iduckdb/third_party/utf8proc -Iduckdb/third_party/hyperloglog -Iduckdb/third_party/skiplist -Iduckdb/third_party/fastpforlib -Iduckdb/third_party/tdigest -Iduckdb/third_party/libpg_query/include -Iduckdb/third_party/libpg_query -Iduckdb/third_party/concurrentqueue -Iduckdb/third_party/pcg -Iduckdb/third_party/httplib -Iduckdb/third_party/fast_float -Iduckdb/third_party/mbedtls -Iduckdb/third_party/mbedtls/include -Iduckdb/third_party/mbedtls/library -Iduckdb/third_party/jaro_winkler -Iduckdb/third_party/jaro_winkler/details -Iduckdb/extension/parquet/include -Iduckdb/third_party/parquet -Iduckdb/third_party/thrift -Iduckdb/third_party/snappy -Iduckdb/third_party/zstd/include -Iduckdb/third_party/mbedtls -Iduckdb/third_party/mbedtls/include -I../inst/include -Iduckdb -DDUCKDB_EXTENSION_PARQUET_LINKED -DDUCKDB_BUILD_LIBRARY -DDUCKDB_PLATFORM_RTOOLS=1         -O2 -Wall  -mfpmath=sse -msse2 -mstackrealign  -c duckdb/ub_src_catalog.cpp -o duckdb/ub_src_catalog.o

. . .

/mingw64/bin/g++  -std=gnu++17 -I"C:/DOCUME~1/R/R-41~1.3/include" -DNDEBUG -Iinclude -I../inst/include -DDUCKDB_DISABLE_PRINT -DDUCKDB_R_BUILD -Iduckdb/src/include -Iduckdb/third_party/fmt/include -Iduckdb/third_party/fsst -Iduckdb/third_party/re2 -Iduckdb/third_party/miniz -Iduckdb/third_party/utf8proc/include -Iduckdb/third_party/utf8proc -Iduckdb/third_party/hyperloglog -Iduckdb/third_party/skiplist -Iduckdb/third_party/fastpforlib -Iduckdb/third_party/tdigest -Iduckdb/third_party/libpg_query/include -Iduckdb/third_party/libpg_query -Iduckdb/third_party/concurrentqueue -Iduckdb/third_party/pcg -Iduckdb/third_party/httplib -Iduckdb/third_party/fast_float -Iduckdb/third_party/mbedtls -Iduckdb/third_party/mbedtls/include -Iduckdb/third_party/mbedtls/library -Iduckdb/third_party/jaro_winkler -Iduckdb/third_party/jaro_winkler/details -Iduckdb/extension/parquet/include -Iduckdb/third_party/parquet -Iduckdb/third_party/thrift -Iduckdb/third_party/snappy -Iduckdb/third_party/zstd/include -Iduckdb/third_party/mbedtls -Iduckdb/third_party/mbedtls/include -I../inst/include -Iduckdb -DDUCKDB_EXTENSION_PARQUET_LINKED -DDUCKDB_BUILD_LIBRARY -DDUCKDB_PLATFORM_RTOOLS=1         -O2 -Wall  -mfpmath=sse -msse2 -mstackrealign  -c cpp11.cpp -o cpp11.o
Error in system(sprintf("%s -Pg %s", nm, shQuote(f)), intern = TRUE) : 
  '""' not found
Calls: <Anonymous> -> lapply -> FUN -> strsplit -> system
Execution halted
make: *** [C:/DOCUME~1/R/R-41~1.3/share/make/winshlib.mk:35: symbols.rds] Error 1
ERROR: compilation failed for package 'duckdb'
* removing 'C:/Documents/R/win-library/4.1/duckdb'
* restoring previous 'C:/Documents/R/win-library/4.1/duckdb'
Warning in install.packages :
  installation of package 'duckdb' had non-zero exit status

Here is my sessionInfo()

> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0    
 [8] tibble_3.2.1    ggplot2_3.4.1   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] rstudioapi_0.14  magrittr_2.0.3   hms_1.1.3        tidyselect_1.2.0 munsell_0.5.0    timechange_0.2.0 colorspace_2.1-1
 [8] R6_2.5.1         rlang_1.1.0      fansi_0.5.0      tools_4.1.3      pak_0.6.0        grid_4.1.3       gtable_0.3.1    
[15] utf8_1.2.3       cli_3.6.0        withr_2.5.0      lifecycle_1.0.3  processx_3.8.0   tzdb_0.3.0       callr_3.7.3     
[22] vctrs_0.6.4      ps_1.7.2         glue_1.6.2       stringi_1.7.5    compiler_4.1.3   pillar_1.9.0     generics_0.1.3  
[29] scales_1.2.1     pkgconfig_2.0.3

any tricks to speed up install time

Any known tricks to speed up install time?
Also this that happens during R CMD check

* checking whether package 'duckdb' can be installed ...
 [35m/35m] OK

35 minutes is quite a blocker for comfortably testing changes...

Implement a way to access the new ALTREP lists

~~Starting with the relational API.~~

Opt-in for dbGetQuery() . The relational API and duckplyr are currently missing too many pieces.

Revisit `tbl.duckdb_connection()`

I don't understand the code and the test:

duckdb-r/R/backend-dbplyr__duckdb_connection.R

Lines 370 to 379 in aef914b

 # Customized handling for tbl() to allow the use of replacement scans 

 # @param src .con A \code{\link{dbConnect}} object, as returned by \code{dbConnect()} 

 # @param from Table or parquet/csv -files to be registered 

 # @param cache Enable object cache for parquet files 

 tbl.duckdb_connection <- function(src, from, cache = FALSE, ...) { 

 ident_q <- pkg_method("ident_q", "dbplyr") 

 if (!inherits(from, "sql") & !DBI::dbExistsTable(src, from)) from <- ident_q(from) 

 if (cache) DBI::dbExecute(src, "PRAGMA enable_object_cache") 

 NextMethod("tbl") 

 }

duckdb-r/tests/testthat/test_tbl__duckdb_connection.R

Lines 30 to 43 in aef914b

 test_that("Object cache can be enabled for parquet files with dplyr::tbl()", { 

 skip_if_not_installed("dbplyr") 

 con <- DBI::dbConnect(duckdb()) 

 on.exit(DBI::dbDisconnect(con, shutdown = TRUE)) 

 DBI::dbExecute(con, "SET enable_object_cache=False;") 

 tab1 <- dplyr::tbl(con, "data/userdata1.parquet", cache = TRUE) 

 expect_true(DBI::dbGetQuery(con, "SELECT value FROM duckdb_settings() WHERE name='enable_object_cache';") == "true") 

 DBI::dbExecute(con, "SET enable_object_cache=False;") 

 tab2 <- dplyr::tbl(con, "'data/userdata1.parquet'", cache = FALSE) 

 expect_true(DBI::dbGetQuery(con, "SELECT value FROM duckdb_settings() WHERE name='enable_object_cache';") == "false") 

 })

I have the following questions:

What is this object cache for parquet files?
We're only executing the PRAGMA with cache = TRUE, is this desired? Should we also run it with cache = FALSE?
Does the PRAGMA give a permanent side effect on the connection (yes according to the test)? Should we reset it afterward?
To my understanding, the use of ident_q() is discouraged and also has no effect here because we're using NextMethod() . What's the intention here, do we need it?
Extra arguments to tbl() seem to be broken via tidyverse/dbplyr#1384, should we perhaps offer a different, more robust API? We could implement a dedicated function that documents all available pragmas and perhaps even offers them as autocomplete.

CC @mgirlich + @hannes.

Duckdb 0.10.0 (stable) Release for R ?

Hello, first thanks for this amazing package.
Before the 0.10.0 (stable) version had i silent errors when exporting a MariaDB Table into Duckdb :

LOAD mysql;
ATTACH 'host=127.0.0.1 user=username password=pwd port=0 database=db_name port=3306' AS mysqldb (TYPE mysql, READ_ONLY);
CREATE OR REPLACE TABLE %s AS SELECT * FROM mysqldb.%s;

Resulted in duckDB tables with some empty values.

Since the 0.10.0 (stable) it does work using the duckDB CLI.
But as the .duckdb file format generated by the 0.10.0 (stable) version is not recognized by the duckDB R package version duckdb_0.9.2-1, i have to export the tables to parquet and then instantiate a new duckDB instance importing those parquet files:

> con <- dbConnect(duckdb("mydb.duckdb", read_only = TRUE))
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'drv' in selecting a method for function 'dbConnect': rapi_startup: Failed to open database: Serialization Error: Failed to deserialize: field id mismatch, expected: 201, got: 103

Any chance we get a 0.10.0 (stable) compatible package in R ?

Thanks a lot in advance.

parameters passed using dbConnect( config = list() ) are ignored

Hi,
I saw some code (here) that used config=list() to pass parameters to a duckdb connection.
I tried it, but for me, they are ignored as illustrated below where the memory limit is 432GB instead of the requested 1GB.

Passing parameters using dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';") appears to work.

Maybe config= list() is not a thing, just wanted to make sure:

library(dplyr)
library(dbplyr)
library(duckdb)

duckdb_path <- "/devroot/sandbox/tmp/duckdb.duckdb"

con <- dbConnect(duckdb::duckdb(dbdir = duckdb_path), config=list("memory_limit"="1GB", "threads" = "1") )
dbGetQuery(con, "select current_setting('memory_limit')")  #                         432.6GB
dbGetQuery(con, "select current_setting('threads')")  # 64
dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';")
dbGetQuery(con, "select current_setting('memory_limit')")    # 1GB
dbGetQuery(con, "select current_setting('threads')")   # 1


 Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.0.2 (2020-06-22)
 os       OpenShift Enterprise
 system   x86_64, linux-gnu
 ui       RStudio
 language (EN)
 collate  en_CA.UTF-8
 ctype    en_CA.UTF-8
 tz       America/Toronto
 date     2024-02-23
 rstudio  2023.06.0+421.pro1 Mountain Hydrangea (server)
 pandoc   NA

─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 ! package     * version date (UTC) lib source
 P cli           3.6.2   2023-12-11 [?] RSPM (R 4.0.5)
 P DBI         * 1.2.1   2024-01-12 [?] RSPM (R 4.0.5)
 P dbplyr      * 2.4.0   2023-10-26 [?] RSPM (R 4.0.5)
 P dplyr       * 1.1.4   2023-11-17 [?] RSPM (R 4.0.5)
 P duckdb      * 0.9.2-1 2023-11-28 [?] RSPM (R 4.0.5)
 P fansi         1.0.4   2023-01-22 [?] CRAN (R 4.0.2)
 P generics      0.1.3   2022-07-05 [?] RSPM (R 4.0.5)
 P glue          1.6.2   2022-02-24 [?] RSPM (R 4.0.5)
 P lifecycle     1.0.3   2022-10-07 [?] RSPM (R 4.0.5)
   magrittr      2.0.3   2022-03-30 [1] RSPM (R 4.0.5)
 P pillar        1.9.0   2023-03-22 [?] RSPM (R 4.0.5)
 P pkgconfig     2.0.3   2019-09-22 [?] RSPM (R 4.0.3)
   R6            2.5.1   2021-08-19 [1] RSPM (R 4.0.5)
   renv          1.0.3   2023-09-19 [1] RSPM (R 4.0.2)
 P rlang         1.1.3   2024-01-10 [?] RSPM (R 4.0.5)
 P rstudioapi    0.15.0  2023-07-07 [?] RSPM (R 4.0.5)
 P sessioninfo   1.2.2   2021-12-06 [?] RSPM (R 4.0.5)
 P tibble        3.2.1   2023-03-20 [?] RSPM (R 4.0.5)
 P tidyselect    1.2.0   2022-10-10 [?] RSPM (R 4.0.5)
 P utf8          1.2.3   2023-01-31 [?] RSPM (R 4.0.5)
 P vctrs         0.6.5   2023-12-01 [?] RSPM (R 4.0.5)

[1] xxxxxxx /renv/library/R-4.0/x86_64-pc-linux-gnu
 [2] /opt/R/4.0.2/lib/R/library

 P ── Loaded and on-disk path mismatch.

How to write a new table from a dplyr query without collecting in R

Hi,

is it possible to (save a view? / register?) a result from a dplyr query with a duckdb table as source?
In this basic exemple, I collect or compute before saving to a new table.
Is it possible to bypass this ?
(My queries are often complex and this is not a real option for me to do it via SQL.)

duck <- dbConnect(duckdb())

duck |> 
  tbl(sql("select * from nested_table")) |>
  pivot_wider(names_from = key, values_from = c(str, int)) |>
  to_arrow() |>
  compute() |>
  to_duckdb(duck, "unested_table", auto_disconnect = FALSE)

Thank you.

Relational: Anti-join with `IS DISTINCT FROM`

What happens?

Supporting this might lead to faster execution in duckplyr.

To Reproduce

Needs duckdb/duckdb#8600.

con <- DBI::dbConnect(duckdb::duckdb())
experimental <- FALSE
invisible(
  DBI::dbExecute(con, "CREATE MACRO \"___eq_na_matches_na\"(x, y) AS (x IS DISTINCT FROM y)")
)
df1 <- data.frame(a = 1L)

rel1 <- duckdb:::rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb:::rel_set_alias(rel1, "lhs")
rel3 <- duckdb:::rel_from_df(con, df1, experimental = experimental)
rel4 <- duckdb:::rel_set_alias(rel3, "rhs")
rel5 <- duckdb:::rel_join(
  rel2,
  rel4,
  list(
    duckdb:::expr_function(
      "___eq_na_matches_na",
      list(duckdb:::expr_reference("a", rel2), duckdb:::expr_reference("a", rel4))
    )
  ),
  "anti"
)
rel5
#> DuckDB Relation: 
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Join REGULAR ANTI ___eq_na_matches_na(lhs.a, rhs.a)
#>   r_dataframe_scan(0x12821dee8)
#>   r_dataframe_scan(0x12821dee8)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (INTEGER)
duckdb:::rel_to_altrep(rel5)
#> Error in row.names.data.frame(x): Error evaluating duckdb query: Not implemented Error: Unimplemented comparison type for join!

^{Created on 2023-08-17 with reprex v2.0.2}

OS:

macOS aarch64

DuckDB Version:

0ff709bdc628ea24111265eb66d74220ce3bb6df

DuckDB Client:

Full Name:

Kirill Müller

Affiliation:

cynkra GmbH

Have you tried this on the latest `master` branch?

I have tested with a master build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Yes, I have

Returning a duckdb table to R via tbl() prints an error if a schema name is used

con <- dbConnect(duckdb())
dbWriteTable(con, "mtcars", mtcars)

# the following works:
tbl(con,"mtcars")

# the following also seems to work, but it prints an error: "It looks like you tried to incorrectly use a table in a schema as source."
tbl(con,"main.mtcars")

# The error is printed whenever you use tbl() with a schema name

R: rel_to_sql() generates invalid SQL for `r_dataframe_scan()`

What happens?

To what extent is the SQL generated by rel_to_sql() supposed to be runnable? A simple example involving row_number() already gives parse errors.

To Reproduce

con <- DBI::dbConnect(duckdb::duckdb())
experimental <- FALSE
invisible(DBI::dbExecute(con, "CREATE MACRO \"==\"(a, b) AS a = b"))
df1 <- data.frame(a = 1)

rel1 <- duckdb:::rel_from_df(con, df1, experimental = experimental)
rel2 <- duckdb:::rel_project(
  rel1,
  list({
    tmp_expr <- duckdb:::expr_window(duckdb:::expr_function("row_number", list()), list(), list(), offset_expr = NULL, default_expr = NULL)
    duckdb:::expr_set_alias(tmp_expr, "___row_number")
    tmp_expr
  })
)
sql <- duckdb:::rel_to_sql(rel2)
writeLines(sql)
#> SELECT row_number(, NULL, NULL) OVER () AS ___row_number FROM (SELECT * FROM r_dataframe_scan(0x107deb930, (experimental = false))) AS dataframe_4427004208_1881778879
DBI::dbGetQuery(con, sql)
#> Error: Parser Error: syntax error at or near ","
#> LINE 1: SELECT row_number(, NULL, NULL) OVER () AS ___row_number ...
#>                           ^

^{Created on 2023-04-10 with reprex v2.0.2}

OS:

macOS aarch64

DuckDB Version:

7bdaddf8e4504405218c8521c7c38d9c3abf33f6

DuckDB Client:

Full Name:

Kirill Müller

Affiliation:

cynkra GmbH

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

garbage-collection warning on instantiation

This is related to #34, but while that issue asks about suppressing the warning in general, this issue is about why we see that warning on instantiation. (Perhaps from #34 (comment), this might be somewhat of a dupe, but I see a difference between suppressing the warning in general, and fixing why we see the warning here in the first place.)

(fn <- tempfile(fileext = ".duckdb"))
# [1] "/tmp/RtmpxVwd7K/file1a8c4190c110f.duckdb"
gc() ; con <- DBI::dbConnect(duckdb::duckdb(), dbdir=fn) ; gc()
#          used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 359465 19.2     664345 35.5   664345 35.5
# Vcells 698733  5.4    8388608 64.0  1815226 13.9
#           used (Mb) gc trigger (Mb) max used (Mb)
# Ncells  538607 28.8    1206704 64.5   664345 35.5
# Vcells 1037401  8.0    8388608 64.0  1815226 13.9
# Warning message:
# Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.

I cannot find a way to instantiate the connection without this warning. Yes, it's just a warning, but ... I don't know how to prevent users of my packages/code from seeing the warning (suppressWarnings(..) doesn't work here).

sessioninfo::session_info()
# ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#  setting  value
#  version  R version 4.3.2 (2023-10-31)
#  os       Ubuntu 23.10
#  system   x86_64, linux-gnu
#  ui       X11
#  language (EN)
#  collate  C.UTF-8
#  ctype    C.UTF-8
#  tz       America/New_York
#  date     2023-12-20
#  pandoc   2.17.1.1 @ /usr/bin/pandoc
# ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#  package     * version date (UTC) lib source
#  cli           3.6.1   2023-03-23 [1] RSPM (R 4.3.0)
#  DBI           1.1.3   2022-06-18 [1] RSPM (R 4.3.0)
#  duckdb        0.9.2-1 2023-11-28 [1] RSPM (R 4.3.2)
#  sessioninfo   1.2.2   2021-12-06 [1] RSPM (R 4.3.0)
#  [1] /home/r2/R/x86_64-pc-linux-gnu-library/4.3
#  [2] /opt/R/4.3.2/lib/R/library
# ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Make `duckdb_register` as a generic function

Currently there is a separate function named duckdb_register_arrow available apart from duckdb_register, but I am wondering if these can be made into one S3 generic function.

Terribly slow joins with custom condition over Parquet files

duckdb-only and duckdb-CLI-only example pending.

Could it be that the join is O(n^2) in this particular case?

options(conflicts.policy = list(warn = FALSE))
library(duckplyr)

data <- data.frame(a = seq_len(30000))
arrow::write_parquet(data, "data.parquet")

x <- duckplyr_df_from_file("data.parquet", "read_parquet")
y <- duckplyr_df_from_file("data.parquet", "read_parquet")

bench::mark(
  iterations = 1,
  collect(arrange(left_join(x, y, by = "a"), a)),
  collect(arrange(left_join(x, y, by = "a", na_matches = "never"), a))
)
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Order [a ASC]
#>   Projection [___coalesce(lhs.a_x, rhs.a_y) as a]
#>     Join REGULAR LEFT ___eq_na_matches_na(lhs.a_x, rhs.a_y)
#>       Projection [a as a_x]
#>         read_parquet(data.parquet)
#>       Projection [a as a_y]
#>         read_parquet(data.parquet)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (INTEGER)
#> 
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Order [a ASC]
#>   Projection [___coalesce(lhs.a_x, rhs.a_y) as a]
#>     Join REGULAR LEFT ==(lhs.a_x, rhs.a_y)
#>       Projection [a as a_x]
#>         read_parquet(data.parquet)
#>       Projection [a as a_y]
#>         read_parquet(data.parquet)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (INTEGER)
#> 
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Order [a ASC]
#>   Projection [___coalesce(lhs.a_x, rhs.a_y) as a]
#>     Join REGULAR LEFT ___eq_na_matches_na(lhs.a_x, rhs.a_y)
#>       Projection [a as a_x]
#>         read_parquet(data.parquet)
#>       Projection [a as a_y]
#>         read_parquet(data.parquet)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (INTEGER)
#> 
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Order [a ASC]
#>   Projection [___coalesce(lhs.a_x, rhs.a_y) as a]
#>     Join REGULAR LEFT ==(lhs.a_x, rhs.a_y)
#>       Projection [a as a_x]
#>         read_parquet(data.parquet)
#>       Projection [a as a_y]
#>         read_parquet(data.parquet)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (INTEGER)
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 "collect(arrange(left_join(x, … 628.13ms 628.13ms      1.59    1.66MB        0
#> 2 "collect(arrange(left_join(x, …   4.48ms   4.48ms    223.       6.9KB        0

^{Created on 2023-11-06 with reprex v2.0.2}

garbage-collection issue: the file size is not reduced and keep growing despite deleting all the tables!

I have an issue with the dockdb database created through the R interface. I realised that the database file keeps growing even if I delete all the tables. The warning of garbage-collector is generated by calling gc() that asks to use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv), but by executing these commands the file size was not changed and the warning message was shown again in the next connection.

Here is a reproducible example:

> library(duckdb)
> con <- dbConnect(duckdb(), dbdir = "dbtest.duckdb", read_only = FALSE) # file size 12K

> dbWriteTable(con,"tbl",data.frame(x=rnorm(10000000))) # file size:  85.7 Mb

> dbListTables(con)
[1] "tbl"

> dbRemoveTable(con,'tbl') # file size: 85.7 Mb

> dbListTables(con) 
character(0)

> dbDisconnect(con, shutdown=T) # file size: 85.7 Mb

> gc()
          used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells 3792014 202.6    6845970 365.7         NA  6612186 353.2
Vcells 6318596  48.3   21276863 162.4      98304 17944806 137.0
Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 

# file size is still 85.7 Mb

> duckdb::duckdb_shutdown(duckdb('dbtest.duckdb'))
> gc() # no change in the file size

Any advice?

Here is my SessionInfo():

> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Amsterdam
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] duckdb_0.9.2-1 DBI_1.2.0     

loaded via a namespace (and not attached):
 [1] terra_1.7-55        vctrs_0.6.5         cli_3.6.2           rlang_1.1.3        
 [5] ncdf4_1.22          png_0.1-8           purrr_1.0.2         generics_0.1.3     
 [9] glue_1.7.0          plyr_1.8.9          maxEnt_1.0-8        sdm_1.2-18         
[13] sp_2.0-0            fansi_1.0.6         grid_4.3.2          tibble_3.2.1       
[17] interp_1.1-4        lifecycle_1.0.4     compiler_4.3.2      dplyr_1.1.3        
[21] codetools_0.2-19    RColorBrewer_1.1-3  Rcpp_1.0.12         pkgconfig_2.0.3    
[25] tidyr_1.3.0         rstudioapi_0.15.0   latticeExtra_0.6-30 lattice_0.22-5     
[29] R6_2.5.1            tidyselect_1.2.0    utf8_1.2.4          pillar_1.9.0       
[33] parallel_4.3.2      magrittr_2.0.3      jpeg_0.1-10         tools_4.3.2        
[37] deldir_1.0-9        raster_3.6-23

``histogram(arg)`` aggregate function not working in R

OS OSX
DDB version 0.9.2-1

> tbl(con,sql("SELECT * FROM range(100) AS tt(x)")) %>%
+ summarise(h=sql("histogram(x)"))
Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! rapi_prepare: Unknown column type for prepare: MAP(BIGINT, UBIGINT)
Run `rlang::last_trace()` to see where the error occurred.

Excessive RAM usage for `DBI::dbWriteTable()` and `dplyr::collect()`

32GB RAM use when writing a 16GB file, and also RAM use reaching 32GB momentarily when reading the same 16GB file. From #72 (comment), by @SimonCoulombe.

library(DBI)
library(dplyr)
library(dbplyr)
library(duckdb)

duckdb_path <- "/devroot/sandbox/tmp/duckdb.duckdb"

con <- dbConnect(duckdb::duckdb(dbdir = duckdb_path))
dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';")

# run this once to create the duckdb file  then restart session:
if (FALSE){
  bigdata <-  data.table::rbindlist(rlang::rep_along(1:3e6, list(iris)))
  dim(bigdata) # 450M rows, 5 columns
  lobstr::obj_size(bigdata) # 16.20 GB in RAM

  dbWriteTable(con, "straight_from_memory", bigdata)
}


bigdata <- tbl(con, "straight_from_memory") %>% collect()

Release 0.9.0 on CRAN?

Hi, I think it's been about a week since DuckDB 0.9.0 was released. Are there any plans to release the R package on CRAN?

nanoarrow/ADBC integrations?

Is there any integration with the nanoarrow package and ADBC?
It does not seem to be well documented.

dbExecute(con, "load 'httpfs';") Failed

hi
I have a problem with duckdb in R

library(duckdb)
con <- dbConnect(duckdb::duckdb())

dbExecute(con, "INSTALL 'fts';")

dbExecute(con, "LOAD 'fts';")

dbExecute(con, "install 'httpfs';")
dbExecute(con, "load 'httpfs';")

dbExecute(con, "install 'httpfs';") is OK
but dbExecute(con, "load 'httpfs';") Failed

Error: rapi_execute: Failed to run query
Error: IO Error: Extension "C:\Users.....\AppData\Roaming\R\data\R\duckdb.duckdb\extensions\2414840843\windows_amd64_rtools\httpfs.duckdb_extension" could not be loaded: Le module spécifié est introuvable.

regards

Option to silence warning "Database is garbage-collected ..."

In my scripts where I connect to a duckdb, I oftentimes get the warning Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this..

When I use duckdb in combination with a shiny app or need to work with many different databases sequentially this clutters the console as the warning is repeated n-times.

con <- DBI::dbConnect(duckdb::duckdb(), "my-db.db")
on.exit(DBI::dbDisconnect(con, shutdown = TRUE), add = TRUE)

# do something with the database
# eventually I get
#> Warning: Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this.

The responsible line is src/database.cpp#L12.

Is it possible to have an option to silence the warning? Maybe something like option(duckdb.silence.disconnect_warning = TRUE)?

	#' Run a SQL query
	#'
	#' `sql()` runs an arbitrary SQL query and returns a data.frame the query results
	#'
	#' @param sql A SQL string
	#' @param conn An optional connection, defaults to built-in default
	#' @return A data frame with the query result
	#' @noRd
	#' @examples
	#' print(duckdb::sql("SELECT 42"))

	sql <- function(sql, conn = default_connection()) {
	stopifnot(dbIsValid(conn))
	dbGetQuery(conn, sql)
	}

	dbWriteTable__duckdb_connection_character_data.frame <- function(conn,
	name,
	value,
	row.names = FALSE,
	overwrite = FALSE,
	append = FALSE,
	field.types = NULL,
	temporary = FALSE,
	...) {

	# Customized handling for tbl() to allow the use of replacement scans
	# @param src .con A \code{\link{dbConnect}} object, as returned by \code{dbConnect()}
	# @param from Table or parquet/csv -files to be registered
	# @param cache Enable object cache for parquet files
	tbl.duckdb_connection <- function(src, from, cache = FALSE, ...) {
	ident_q <- pkg_method("ident_q", "dbplyr")
	if (!inherits(from, "sql") & !DBI::dbExistsTable(src, from)) from <- ident_q(from)
	if (cache) DBI::dbExecute(src, "PRAGMA enable_object_cache")
	NextMethod("tbl")
	}

	test_that("Object cache can be enabled for parquet files with dplyr::tbl()", {
	skip_if_not_installed("dbplyr")

	con <- DBI::dbConnect(duckdb())
	on.exit(DBI::dbDisconnect(con, shutdown = TRUE))

	DBI::dbExecute(con, "SET enable_object_cache=False;")
	tab1 <- dplyr::tbl(con, "data/userdata1.parquet", cache = TRUE)
	expect_true(DBI::dbGetQuery(con, "SELECT value FROM duckdb_settings() WHERE name='enable_object_cache';") == "true")

	DBI::dbExecute(con, "SET enable_object_cache=False;")
	tab2 <- dplyr::tbl(con, "'data/userdata1.parquet'", cache = FALSE)
	expect_true(DBI::dbGetQuery(con, "SELECT value FROM duckdb_settings() WHERE name='enable_object_cache';") == "false")
	})

duckdb / duckdb-r Goto Github PK

duckdb-r's Introduction

DuckDB

Installation

Data Import

SQL Reference

Development

Support

duckdb-r's People

Contributors

Stargazers

Watchers

Forkers

duckdb-r's Issues

What happens?

To Reproduce

Environment (please complete the following information):

Before Submitting

0.10.0

0.9.2

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Relational

SQL

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

dbExecute(con, "INSTALL 'fts';")

dbExecute(con, "LOAD 'fts';")

Recommend Projects

Recommend Topics

Recommend Org

Have you tried this on the latest `master` branch?

Have you tried this on the latest `master` branch?

Have you tried this on the latest `master` branch?