hrbrmstr / sergeant Goto Github PK

View Code? Open in Web Editor NEW

125.0 14.0 15.0 18.18 MB

:guardsman: Tools to Transform and Query Data with 'Apache' 'Drill'

Home Page: https://hrbrmstr.github.io/sergeant/

License: Other

R 100.00%

drill parquet-files sql dplyr r rstats apache-drill r-cyber

sergeant's Introduction

💂 sergeant

Tools to Transform and Query Data with ‘Apache’ ‘Drill’

IMPORTANT

Version 0.7.0+ (a.k.a. the main branch) splits off the JDBC interface into a separate package sergeant.caffeinated (GitHub).

I# Description

Drill + sergeant is (IMO) a streamlined alternative to Spark + sparklyr if you don’t need the ML components of Spark (i.e. just need to query “big data” sources, need to interface with parquet, need to combine disparate data source types — json, csv, parquet, rdbms - for aggregation, etc). Drill also has support for spatial queries.

Using Drill SQL queries that reference parquet files on a local linux or macOS workstation can often be more performant than doing the same data ingestion & wrangling work with R (especially for large or disperate data sets). Drill can often help further streamline workflows that involve wrangling many tiny JSON files on a daily basis.

Drill can be obtained from https://drill.apache.org/download/ (use “Direct File Download”). Drill can also be installed via Docker. For local installs on Unix-like systems, a common/suggestion location for the Drill directory is /usr/local/drill as the install directory.

Drill embedded (started using the $DRILL_BASE_DIR/bin/drill-embedded script) is a super-easy way to get started playing with Drill on a single workstation and most of many workflows can “get by” using Drill this way.

There are a few convenience wrappers for various informational SQL queries (like drill_version()). Please file an PR if you add more.

Some of the more “controlling vs data ops” REST API functions aren’t implemented. Please file a PR if you need those.

The following functions are implemented:

DBI (REST)

A “just enough” feature complete R DBI driver has been implemented using the Drill REST API, mostly to facilitate the dplyr interface. Use the RJDBC driver interface if you need more DBI functionality.
This also means that SQL functions unique to Drill have also been “implemented” (i.e. made accessible to the dplyr interface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub. Many should work without it, but some may require a custom interface.

dplyr: (REST)

src_drill: Connect to Drill (using dplyr) + supporting functions

Note that a number of Drill SQL functions have been mapped to R functions (e.g. grepl) to make it easier to transition from non-database-backed SQL ops to Drill. See the help on drill_custom_functions for more info on these helper Drill custom function mappings.

Drill APIs:

drill_connection: Setup parameters for a Drill server/cluster connection
drill_active: Test whether Drill HTTP REST API server is up
drill_cancel: Cancel the query that has the given queryid
drill_functions: Show all the available Drill built-in functions & UDFs (Apache Drill 1.15.0+ required)
drill_jdbc: Connect to Drill using JDBC
drill_metrics: Get the current memory metrics
drill_options: List the name, default, and data type of the system and session options
drill_popts: Show all the available Drill options (1.15.0+)
drill_profile: Get the profile of the query that has the given query id
drill_profiles: Get the profiles of running and completed queries
drill_query: Submit a query and return results
drill_set: Set Drill SYSTEM or SESSION options
drill_settings_reset: Changes (optionally, all) session settings back to system defaults
drill_show_files: Show files in a file system schema.
drill_show_schemas: Returns a list of available schemas.
drill_stats: Get Drillbit information, such as ports numbers
drill_status: Get the status of Drill
drill_storage: Get the list of storage plugin names and configurations
drill_system_reset: Changes (optionally, all) system settings back to system defaults
drill_threads: Get information about threads
drill_uplift: Turn a columnar query results into a type-converted tbl
drill_use: Change to a particular schema.
drill_version: Identify the version of Drill running

Helpers

ctas_profile: Generate a Drill CTAS Statement from a Query
drill_up: sart a Dockerized Drill Instance # sdrill_down: stop a Dockerized Drill Instance by container id
howall_drill: Show all dead and running Drill Docker containers
stopall_drill: Prune all dead and running Drill Docker containers

Installation

install.packages("sergeant", repos = "https://cinc.rud.is")
# or
devtools::install_git("https://git.rud.is/hrbrmstr/sergeant.git")
# or
devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant")
# or
devtools::install_gitlab("hrbrmstr/sergeant")
# or
devtools::install_bitbucket("hrbrmstr/sergeant")
# or
devtools::install_github("hrbrmstr/sergeant")

Usage

`dplyr` interface

library(sergeant)
library(tidyverse)

# use localhost if running standalone on same system otherwise the host or IP of your Drill server
ds <- src_drill("localhost")  #ds
db <- tbl(ds, "cp.`employee.json`") 

# without `collect()`:
count(db, gender, marital_status)
##  # Source:   lazy query [?? x 3]
##  # Database: DrillConnection
##  # Groups:   gender
##    gender marital_status     n
##    <chr>  <chr>          <dbl>
##  1 F      S                297
##  2 M      M                278
##  3 M      S                276
##  4 F      M                304

count(db, gender, marital_status) %>% collect()
##  # A tibble: 4 x 3
##  # Groups:   gender [2]
##    gender marital_status     n
##    <chr>  <chr>          <dbl>
##  1 F      S                297
##  2 M      M                278
##  3 M      S                276
##  4 F      M                304

group_by(db, position_title) %>%
  count(gender) -> tmp2

group_by(db, position_title) %>%
  count(gender) %>%
  ungroup() %>%
  mutate(full_desc = ifelse(gender == "F", "Female", "Male")) %>%
  collect() %>%
  select(Title = position_title, Gender = full_desc, Count = n)
##  # A tibble: 30 x 3
##     Title                  Gender Count
##     <chr>                  <chr>  <dbl>
##   1 President              Female     1
##   2 VP Country Manager     Male       3
##   3 VP Country Manager     Female     3
##   4 VP Information Systems Female     1
##   5 VP Human Resources     Female     1
##   6 Store Manager          Female    13
##   7 VP Finance             Male       1
##   8 Store Manager          Male      11
##   9 HQ Marketing           Female     2
##  10 HQ Information Systems Female     4
##  # … with 20 more rows

arrange(db, desc(employee_id)) %>% print(n = 20)
##  # Source:     table<cp.`employee.json`> [?? x 20]
##  # Database:   DrillConnection
##  # Ordered by: desc(employee_id)
##     employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date
##     <chr>       <chr>     <chr>      <chr>     <chr>       <chr>          <chr>    <chr>         <chr>      <chr>    
##   1 999         Beverly … Beverly    Dittmar   17          Store Permane… 8        17            1914-02-02 1998-01-…
##   2 998         Elizabet… Elizabeth  Jantzer   17          Store Permane… 8        17            1914-02-02 1998-01-…
##   3 997         John Swe… John       Sweet     17          Store Permane… 8        17            1914-02-02 1998-01-…
##   4 996         William … William    Murphy    17          Store Permane… 8        17            1914-02-02 1998-01-…
##   5 995         Carol Li… Carol      Lindsay   17          Store Permane… 8        17            1914-02-02 1998-01-…
##   6 994         Richard … Richard    Burke     17          Store Permane… 8        17            1914-02-02 1998-01-…
##   7 993         Ethan Bu… Ethan      Bunosky   17          Store Permane… 8        17            1914-02-02 1998-01-…
##   8 992         Claudett… Claudette  Cabrera   17          Store Permane… 8        17            1914-02-02 1998-01-…
##   9 991         Maria Te… Maria      Terry     17          Store Permane… 8        17            1914-02-02 1998-01-…
##  10 990         Stacey C… Stacey     Case      17          Store Permane… 8        17            1914-02-02 1998-01-…
##  11 99          Elizabet… Elizabeth  Horne     18          Store Tempora… 6        18            1976-10-05 1997-01-…
##  12 989         Dominick… Dominick   Nutter    17          Store Permane… 8        17            1914-02-02 1998-01-…
##  13 988         Brian Wi… Brian      Willeford 17          Store Permane… 8        17            1914-02-02 1998-01-…
##  14 987         Margaret… Margaret   Clendenen 17          Store Permane… 8        17            1914-02-02 1998-01-…
##  15 986         Maeve Wa… Maeve      Wall      17          Store Permane… 8        17            1914-02-02 1998-01-…
##  16 985         Mildred … Mildred    Morrow    16          Store Tempora… 8        16            1914-02-02 1998-01-…
##  17 984         French W… French     Wilson    16          Store Tempora… 8        16            1914-02-02 1998-01-…
##  18 983         Elisabet… Elisabeth  Duncan    16          Store Tempora… 8        16            1914-02-02 1998-01-…
##  19 982         Linda An… Linda      Anderson  16          Store Tempora… 8        16            1914-02-02 1998-01-…
##  20 981         Selene W… Selene     Watson    16          Store Tempora… 8        16            1914-02-02 1998-01-…
##  # … with more rows, and 6 more variables: salary <chr>, supervisor_id <chr>, education_level <chr>,
##  #   marital_status <chr>, gender <chr>, management_role <chr>

mutate(db, position_title = tolower(position_title)) %>%
  mutate(salary = as.numeric(salary)) %>%
  mutate(gender = ifelse(gender == "F", "Female", "Male")) %>%
  mutate(marital_status = ifelse(marital_status == "S", "Single", "Married")) %>%
  group_by(supervisor_id) %>%
  summarise(underlings_count = n()) %>%
  collect()
##  # A tibble: 112 x 2
##     supervisor_id underlings_count
##     <chr>                    <dbl>
##   1 0                            1
##   2 1                            7
##   3 5                            9
##   4 4                            2
##   5 2                            3
##   6 20                           2
##   7 21                           4
##   8 22                           7
##   9 6                            4
##  10 36                           2
##  # … with 102 more rows

REST API

dc <- drill_connection("localhost") 

drill_active(dc)
##  [1] TRUE

drill_version(dc)
##  [1] "1.15.0"

drill_storage(dc)$name
##   [1] "cp"       "dfs"      "drilldat" "hbase"    "hdfs"     "hive"     "kudu"     "mongo"    "my"       "s3"

drill_query(dc, "SELECT * FROM cp.`employee.json` limit 100")
##  # A tibble: 100 x 16
##     employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date
##     <chr>       <chr>     <chr>      <chr>     <chr>       <chr>          <chr>    <chr>         <chr>      <chr>    
##   1 1           Sheri No… Sheri      Nowmer    1           President      0        1             1961-08-26 1994-12-…
##   2 2           Derrick … Derrick    Whelply   2           VP Country Ma… 0        1             1915-07-03 1994-12-…
##   3 4           Michael … Michael    Spence    2           VP Country Ma… 0        1             1969-06-20 1998-01-…
##   4 5           Maya Gut… Maya       Gutierrez 2           VP Country Ma… 0        1             1951-05-10 1998-01-…
##   5 6           Roberta … Roberta    Damstra   3           VP Informatio… 0        2             1942-10-08 1994-12-…
##   6 7           Rebecca … Rebecca    Kanagaki  4           VP Human Reso… 0        3             1949-03-27 1994-12-…
##   7 8           Kim Brun… Kim        Brunner   11          Store Manager  9        11            1922-08-10 1998-01-…
##   8 9           Brenda B… Brenda     Blumberg  11          Store Manager  21       11            1979-06-23 1998-01-…
##   9 10          Darren S… Darren     Stanz     5           VP Finance     0        5             1949-08-26 1994-12-…
##  10 11          Jonathan… Jonathan   Murraiin  11          Store Manager  1        11            1967-06-20 1998-01-…
##  # … with 90 more rows, and 6 more variables: salary <chr>, supervisor_id <chr>, education_level <chr>,
##  #   marital_status <chr>, gender <chr>, management_role <chr>

drill_query(dc, "SELECT COUNT(gender) AS gctFROM cp.`employee.json` GROUP BY gender")

drill_options(dc)
##  # A tibble: 179 x 6
##     name                                                        value    defaultValue accessibleScopes kind   optionScope
##     <chr>                                                       <chr>    <chr>        <chr>            <chr>  <chr>      
##   1 debug.validate_iterators                                    FALSE    false        ALL              BOOLE… BOOT       
##   2 debug.validate_vectors                                      FALSE    false        ALL              BOOLE… BOOT       
##   3 drill.exec.functions.cast_empty_string_to_null              FALSE    false        ALL              BOOLE… BOOT       
##   4 drill.exec.hashagg.fallback.enabled                         FALSE    false        ALL              BOOLE… BOOT       
##   5 drill.exec.hashjoin.fallback.enabled                        FALSE    false        ALL              BOOLE… BOOT       
##   6 drill.exec.memory.operator.output_batch_size                16777216 16777216     SYSTEM           LONG   BOOT       
##   7 drill.exec.memory.operator.output_batch_size_avail_mem_fac… 0.1      0.1          SYSTEM           DOUBLE BOOT       
##   8 drill.exec.storage.file.partition.column.label              dir      dir          ALL              STRING BOOT       
##   9 drill.exec.storage.implicit.filename.column.label           filename filename     ALL              STRING BOOT       
##  10 drill.exec.storage.implicit.filepath.column.label           filepath filepath     ALL              STRING BOOT       
##  # … with 169 more rows

drill_options(dc, "json")
##  # A tibble: 10 x 6
##     name                                                    value defaultValue accessibleScopes kind    optionScope
##     <chr>                                                   <chr> <chr>        <chr>            <chr>   <chr>      
##   1 store.hive.maprdb_json.optimize_scan_with_native_reader FALSE false        ALL              BOOLEAN BOOT       
##   2 store.json.all_text_mode                                TRUE  false        ALL              BOOLEAN SYSTEM     
##   3 store.json.extended_types                               TRUE  false        ALL              BOOLEAN SYSTEM     
##   4 store.json.read_numbers_as_double                       FALSE false        ALL              BOOLEAN BOOT       
##   5 store.json.reader.allow_nan_inf                         TRUE  true         ALL              BOOLEAN BOOT       
##   6 store.json.reader.print_skipped_invalid_record_number   TRUE  false        ALL              BOOLEAN SYSTEM     
##   7 store.json.reader.skip_invalid_records                  TRUE  false        ALL              BOOLEAN SYSTEM     
##   8 store.json.writer.allow_nan_inf                         TRUE  true         ALL              BOOLEAN BOOT       
##   9 store.json.writer.skip_null_fields                      TRUE  true         ALL              BOOLEAN BOOT       
##  10 store.json.writer.uglify                                TRUE  false        ALL              BOOLEAN SYSTEM

Working with parquet files

drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nation.parquet` LIMIT 5")
##  # A tibble: 5 x 4
##    N_NATIONKEY N_NAME    N_REGIONKEY N_COMMENT           
##          <dbl> <chr>           <dbl> <chr>               
##  1           0 ALGERIA             0 haggle. carefully f 
##  2           1 ARGENTINA           1 al foxes promise sly
##  3           2 BRAZIL              1 y alongside of the p
##  4           3 CANADA              1 eas hang ironic, sil
##  5           4 EGYPT               4 y above the carefull

Including multiple parquet files in different directories (note the wildcard support):

drill_query(dc, "SELECT * FROM dfs.`/usr/local/drill/sample-data/nations*/nations*.parquet` LIMIT 5")
##  # A tibble: 5 x 5
##    dir0      N_NATIONKEY N_NAME    N_REGIONKEY N_COMMENT           
##    <chr>           <dbl> <chr>           <dbl> <chr>               
##  1 nationsSF           0 ALGERIA             0 haggle. carefully f 
##  2 nationsSF           1 ARGENTINA           1 al foxes promise sly
##  3 nationsSF           2 BRAZIL              1 y alongside of the p
##  4 nationsSF           3 CANADA              1 eas hang ironic, sil
##  5 nationsSF           4 EGYPT               4 y above the carefull

Drill has built-in support for spatial ops

Via: https://github.com/k255/drill-gis

A common use case is to select data within boundary of given polygon:

drill_query(dc, "
select columns[2] as city, columns[4] as lon, columns[3] as lat
    from cp.`sample-data/CA-cities.csv`
    where
        ST_Within(
            ST_Point(columns[4], columns[3]),
            ST_GeomFromText(
                'POLYGON((-121.95 37.28, -121.94 37.35, -121.84 37.35, -121.84 37.28, -121.95 37.28))'
                )
            )
")
##  # A tibble: 7 x 3
##    city        lon          lat       
##    <chr>       <chr>        <chr>     
##  1 Burbank     -121.9316233 37.3232752
##  2 San Jose    -121.8949555 37.3393857
##  3 Lick        -121.8457863 37.2871647
##  4 Willow Glen -121.8896771 37.3085532
##  5 Buena Vista -121.9166227 37.3213308
##  6 Parkmoor    -121.9307898 37.3210531
##  7 Fruitdale   -121.932746  37.31086

sergeant Metrics

Lang	# Files	(%)	LoC	(%)	Blank lines	(%)	# Lines	(%)
Rmd	1	1	55	1	54	1	89	1

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct By participating in this project you agree to abide by its terms.

sergeant's People

Contributors

Stargazers

Watchers

Forkers

tareefk alistaire47 ktargows weekend-warrior lepennec ahine psanjib1 firebitsbr scottcame takewiki trinker han-tun syedmfuad vishalbelsare

sergeant's Issues

Problem with installing sergeant

Hi,

I am encountering problems in installing sergeant.

Upon installing

devtools::install_github("hrbrmstr/sergeant")

I am getting the error below:
Error: package or namespace load failed for 'sergeant':
.onLoad failed in loadNamespace() for 'rJava', details:
call: inDL(x, as.logical(local), as.logical(now), ...)
error: unable to load shared object 'C:/Users/herma/Documents/R/win-library/3.5/rJava/libs/i386/rJava.dll':
LoadLibrary failure: %1 is not a valid Win32 application.

Error: loading failed
Execution halted
*** arch - x64
ERROR: loading failed for 'i386'

removing 'C:/Users/herma/Documents/R/win-library/3.5/sergeant'
In R CMD INSTALL

version output:
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 5.0
year 2018
month 04
day 23
svn rev 74626
language R
version.string R version 3.5.0 (2018-04-23)
nickname Joy in Playing

Support field type metadata prior to rapidjsonr switch

Current plan is to use the new metadata field and deliberately convert result sets using the specified field types and provide the following warning (once per R session) whenever BIGINT is encountered:

Warning message:
One or more columns are of type BIGINT. The sergeant package is in the process of switching to the use of the rapidjsonr package in an effort to provide support for this data type. Until then, BIGINT columns will still be converted to numeric since that's how jsonlite::fromJSON() works.

If you really need BIGINT/integer64 support, consider using the R ODBC interface to Apache Drill with the MapR ODBC drivers.

This informational warning will only be shown once per R session.

This will go away once #26 is done.

Switch to rapidjsonr

As of Drill 1.15.0 we now know the data types of things returned so we can use this to have proper 64-bit integers as well as proper types all 'round.

table identifier escaping causing dplyr interface errors

Not sure exactly what changed, but some version somewhere changed so I'm now getting errors when using the dplyr interface because it's putting double-quotes around the table identifier. Reprex:

> library(sergeant)
> tbl(src_drill('localhost'), 'cp.`employee.json`')
Request failed [500]. Retrying in 1 seconds...
Request failed [500]. Retrying in 2 seconds...
Request failed [500]. Retrying in 1.8 seconds...
Request failed [500]. Retrying in 1.7 seconds...
# Source:   table<cp.`employee.json`> [?? x 8]
# Database: DrillConnection
Warning message:
PARSE ERROR: Encountered "FROM \"" at line 1, column 1.
SQL Query: SELECT *
FROM "cp.`employee.json`"
^
LIMIT 11

Original Query:

  1: SELECT *
  2: FROM "cp.`employee.json`"
  3: LIMIT 11

Query Profile Error Link:
http://localhost:8047/profiles/772b3c22-8d04-43d5-b733-9e2ae0bcbb50
> con <- drill_connection('localhost')

If I try to run that query directly, it fails:

> drill_query(
+     con,
+     'SELECT *
+      FROM "cp.`employee.json`"
+      LIMIT 11'
+ )
  |======================================================================| 100%
Request failed [500]. Retrying in 1.4 seconds...
  |======================================================================| 100%
Request failed [500]. Retrying in 1.9 seconds...
  |======================================================================| 100%
Query ==> SELECT *      FROM "cp.`employee.json`"      LIMIT 11
PARSE ERROR: Encountered "FROM \"" at line 1, column 6.

SQL Query: SELECT *
     FROM "cp.`employee.json`"
     ^
     LIMIT 11

[Error Id: dc495019-3219-4b88-b5f8-9a4fd657dc0f ]

whereas if I remove the quotes around cp.`employee.json` it works:

> drill_query(
+     con,
+     'SELECT *
+      FROM cp.`employee.json`
+      LIMIT 11'
+ )
  |======================================================================| 100%
# A tibble: 11 x 16
   employee_id full_name first_name last_name position_id position_title
         <dbl> <chr>     <chr>      <chr>           <dbl> <chr>
 1           1 Sheri No… Sheri      Nowmer              1 President
 2           2 Derrick … Derrick    Whelply             2 VP Country Ma…
 3           4 Michael … Michael    Spence              2 VP Country Ma…
 4           5 Maya Gut… Maya       Gutierrez           2 VP Country Ma…
 5           6 Roberta … Roberta    Damstra             3 VP Informatio…
 6           7 Rebecca … Rebecca    Kanagaki            4 VP Human Reso…
 7           8 Kim Brun… Kim        Brunner            11 Store Manager
 8           9 Brenda B… Brenda     Blumberg           11 Store Manager
 9          10 Darren S… Darren     Stanz               5 VP Finance
10          11 Jonathan… Jonathan   Murraiin           11 Store Manager
11          12 Jewel Cr… Jewel      Creek              11 Store Manager
# … with 10 more variables: store_id <dbl>, department_id <dbl>,
#   birth_date <chr>, hire_date <chr>, salary <dbl>, supervisor_id <dbl>,
#   education_level <chr>, marital_status <chr>, gender <chr>,
#   management_role <chr>

Results are the same in the Drill shell. I assume this is a result of a version or setting changing somewhere, but so far I can't locate it. What I'm running:

software	version
MacOS	10.14.6
Apache Drill	1.18.0
R	4.0.3
sergeant	0.9.0
dbplyr	2.0.0
dplyr	1.0.2

Happy to write a PR to fix this if this is something that can be fixed here, but so far I can't figure out what has to change.

Prepare for new streaming JSON

Should dplyr idioms work if `tbl()` source is a custom SQL statement?

Are dplyr idioms expected to work if the src of a tbl() call is a drill SELECT statement instead of a bare table name? It looks like column name evaluation isn't working as expected in these cases. Reprex of the behavior I see under v0.8.0 is below:

library(sergeant)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

ds <- src_drill()
emp_full <- tbl(ds, "cp.`employee.json`")
emp_full %>% select(full_name)
#> # Source:   lazy query [?? x 1]
#> # Database: DrillConnection
#>    full_name        
#>    <chr>            
#>  1 Sheri Nowmer     
#>  2 Derrick Whelply  
#>  3 Michael Spence   
#>  4 Maya Gutierrez   
#>  5 Roberta Damstra  
#>  6 Rebecca Kanagaki 
#>  7 Kim Brunner      
#>  8 Brenda Blumberg  
#>  9 Darren Stanz     
#> 10 Jonathan Murraiin
#> # ... with more rows

emp_partial <- tbl(ds, "SELECT full_name from cp.`employee.json`") #edits? we don't need no stinkin' edits!
emp_partial %>% select(full_name)
#> Error in .f(.x[[i]], ...): object 'full_name' not found

^{Created on 2019-01-18 by the reprex package (v0.2.1)}

drill_query errors with semicolon completion

Hi! Maybe I just lack experience with SQL APIs, but I find it odd that a query passed to drill_query without a terminating semicolon will work but with one will error out:

library(sergeant)

dc <- drill_connection() 

dc %>% drill_query('SELECT full_name, salary FROM cp.`employee.json` LIMIT 3')
#> # A tibble: 3 x 2
#>         full_name salary
#> *           <chr>  <dbl>
#> 1    Sheri Nowmer  80000
#> 2 Derrick Whelply  40000
#> 3  Michael Spence  40000

dc %>% drill_query('SELECT full_name, salary FROM cp.`employee.json` LIMIT 3;')
#> Query ==> SELECT full_name, salary FROM cp.`employee.json` LIMIT 3;
#> PARSE ERROR: Encountered ";" at line 1, column 57.
#> Was expecting one of:
#>     <EOF> 
#>     "OFFSET" ...
#>     "FETCH" ...
#>     
#> 
#> SQL Query SELECT full_name, salary FROM cp.`employee.json` LIMIT 3;
#>                                                         ^
#> 
#> 
#> [Error Id: a172e1a9-833c-4d89-9b83-de4e250cd1c5 on 192.168.1.3:31010]

whereas the Drill shell only works with a semicolon (without it expects a continuation of the query):

0: jdbc:drill:zk=local> SELECT full_name, salary FROM cp.`employee.json` LIMIT 3;
+--------------------+----------+
|     full_name      |  salary  |
+--------------------+----------+
| Sheri Nowmer       | 80000.0  |
| Derrick Whelply    | 40000.0  |
| Michael Spence     | 40000.0  |
+--------------------+----------+
3 rows selected (0.102 seconds)

and when run through the web interface both are fine.

Looking at query.r, the whole query is passed via httr::POST, so the error seems to be a product of how the API interprets queries. I'm not sure if this is intended behavior (the Drill API docs don't use a semicolon, but the SQL docs do), but from an R perspective, it seems like a good idea to either

make note of the restriction/behavior in ?drill_query or
add query <- sub(";$", "", query) or the like to its code so it works regardless.

Thoughts?

dplyr operations fail on columns which name is a reserved word

To translate a query like:

select `Year`,`Month` from dfs.`..../AIRLINE/` LIMIT 5;

I would use something like:

airline %>% select(Year)

while fails with:

warning message:
In .local(res, ...) : PARSE ERROR: Encountered "Year" at line 1, column 9.
Was expecting one of:
...
...

due to missing quotes.

Same for

airline %>% select(`Year`)

dplyr::count() is behaving weirdly after Drill 1.10.9 release

Implement write table methods

It would be nice if the seargeant package could create new tables as I can do with other database drivers. Something like this:

library(sergeant)
library(dplyr)

conn = dbConnect(Drill())

# Load original data
original_data = tbl(conn, "`dfs.downloads`.`original_data `")

# Do some operation
edited_data = original_data %>% 
  filter(name = "John")

# Write the result to a table without loading it into R
dbWriteTable(conn, table = "`dfs.downloads`.`edited_data`", values = edited_data )

Support dbQuoteIdentifier method for `src_drill` class

Tested under 0.8.0 branch

It looks like the src_drill() class doesn't support the dbQuoteIdentifier() method. This keeps the super nifty glue::glue_sql() function from being as useful as it could be.

library(sergeant)
con <- src_drill()

fields_to_select <- c("first_name", "last_name")

glue::glue_sql("SELECT {`fields_to_select`*} FROM cp.`employee.json` LIMIT 20", .con = con)
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'dbQuoteIdentifier' for signature '"src_drill", "character"'

^{Created on 2019-01-10 by the reprex package (v0.2.1)}

drill_jdbc() returns error

Hi Bob,

I tried
drill_jdbc("localhost")

and it returns error:

Error in stop("Package `", package, "` required for `", fun, "`.\n", "Please install and try again.",  : 
  argument "fun" is missing, with no default

Looking at the code in utils.r

try_require <- function(package, fun) {
  if (requireNamespace(package, quietly = TRUE)) {
    library(package, character.only = TRUE)
    return(invisible())
  }

  stop("Package `", package, "` required for `", fun , "`.\n", # nocov start
    "Please install and try again.", call. = FALSE) # nocov end
}

arg fun is required with no default.

When try_require is called it is missing 2nd arg fun.
I suggest set fun = "" in the function definition.

Regards,
Herman

make_server not getting connection URL correctly

I could PR a fix for this, but I'm not sure how much it has to do with package version compatibility and how much other refactoring is required related to dbplyr/dplyr updates. Currently using CRAN version of everything.

library(sergeant)
ncdr <- src_drill(host = "localhost", port=22023L)

# The connection appears to work
ncdr
#> src:  DrillConnection
#> tbls: INFORMATION_SCHEMA, cp.default, dfs.default, dfs.root, dfs.tmp, sys

#But I can't do a query
snaps <- drill_query(ncdr, "SELECT * FROM INFORMATION_SCHEMA")
#> Error: length(url) == 1 is not TRUE

# It looks like there's no URL to send the query to
sergeant:::make_server(ncdr)
#> character(0)

#This seems to be because `make_server` doesn't get the host url correctly
sergeant:::make_server
#> function (drill_con) 
#> {
#>     sprintf("%s://%s:%s", ifelse(drill_con$ssl[1], "https", "http"), 
#>         drill_con$host, drill_con$port)
#> }
#> <environment: namespace:sergeant>
ncdr$host
#> NULL
ncdr$con@host
#> [1] "localhost"

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.3.2 (2016-10-31)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       <NA>                        
#>  date     2017-07-23
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source                            
#>  assertthat   0.2.0      2017-04-11 cran (@0.2.0)                     
#>  backports    1.1.0      2017-05-22 cran (@1.1.0)                     
#>  base       * 3.3.2      2016-11-01 local                             
#>  bindr        0.1        2016-11-13 CRAN (R 3.3.2)                    
#>  bindrcpp     0.2        2017-06-17 CRAN (R 3.3.2)                    
#>  colorspace   1.3-2      2016-12-14 cran (@1.3-2)                     
#>  curl         2.8.1      2017-07-21 CRAN (R 3.3.2)                    
#>  datasets   * 3.3.2      2016-11-01 local                             
#>  DBI        * 0.7        2017-06-18 CRAN (R 3.3.2)                    
#>  dbplyr     * 1.1.0      2017-06-27 CRAN (R 3.3.2)                    
#>  devtools     1.13.2     2017-06-02 CRAN (R 3.3.2)                    
#>  digest       0.6.12     2017-01-27 CRAN (R 3.3.1)                    
#>  dplyr      * 0.7.2      2017-07-20 CRAN (R 3.3.2)                    
#>  evaluate     0.10       2016-10-11 CRAN (R 3.3.1)                    
#>  glue         1.1.1      2017-06-21 CRAN (R 3.3.2)                    
#>  graphics   * 3.3.2      2016-11-01 local                             
#>  grDevices  * 3.3.2      2016-11-01 local                             
#>  hms          0.3        2016-11-22 cran (@0.3)                       
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.3.2)                    
#>  httr         1.2.1      2016-07-03 CRAN (R 3.3.2)                    
#>  jsonlite     1.5        2017-06-01 CRAN (R 3.3.2)                    
#>  knitr        1.16       2017-05-18 cran (@1.16)                      
#>  magrittr     1.5        2014-11-22 CRAN (R 3.1.2)                    
#>  memoise      1.0.0.9001 2016-12-13 Github (hadley/memoise@e392c7b)   
#>  methods    * 3.3.2      2016-11-01 local                             
#>  munsell      0.4.3      2016-02-13 CRAN (R 3.3.0)                    
#>  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.3.2)                    
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.3.0)                    
#>  purrr        0.2.2.2    2017-05-11 CRAN (R 3.3.2)                    
#>  R6           2.2.2      2017-06-17 CRAN (R 3.3.2)                    
#>  Rcpp         0.12.12    2017-07-15 CRAN (R 3.3.2)                    
#>  readr        1.1.1      2017-05-16 CRAN (R 3.3.2)                    
#>  rlang        0.1.1      2017-05-18 cran (@0.1.1)                     
#>  rmarkdown    1.3.9002   2017-02-06 Github (rstudio/rmarkdown@c4908dc)
#>  rprojroot    1.2        2017-01-16 CRAN (R 3.3.1)                    
#>  scales       0.4.1      2016-11-09 cran (@0.4.1)                     
#>  sergeant   * 0.5.2      2017-07-17 CRAN (R 3.3.2)                    
#>  stats      * 3.3.2      2016-11-01 local                             
#>  stringi      1.1.5      2017-04-07 cran (@1.1.5)                     
#>  stringr      1.2.0      2017-02-18 cran (@1.2.0)                     
#>  tibble       1.3.3      2017-05-28 CRAN (R 3.3.2)                    
#>  tools        3.3.2      2016-11-01 local                             
#>  utils      * 3.3.2      2016-11-01 local                             
#>  withr        1.0.2      2016-06-20 CRAN (R 3.3.1)                    
#>  yaml         2.1.14     2016-11-12 CRAN (R 3.3.1)

Improve drill_up() interactive experience

I'm thinking through how the drill_up() interactive experience could be made a bit easier. That function returns a stevedore object, which the user is responsible for terminating (as it says in the fine documentation 😉). If the user doesn't capture the stevedore object coming back from the drill_up() call, they have to get() and then stop() the container manually, which could be further complicated if they also failed to capture the container ID from the drill_up() call.

Any thoughts on the best way of documenting a virtuous workflow? Can't protect the user completely here, but I want to make it a bit clearer if a user don't have previous experience with stevedore. Perhaps a more verbose message when drill_up() is in interactive mode? A vignette? (To be clear, I'm considering writing the PR myself...not just adding things to your queue!)

This may or may not be motivated by this author's first stumblings through the stevedore interface. 😜

Support for authentication in drill_connection

Hi! I read in your README that sergeant tools were not designed with authentication/encryption in mind due to the lack of need for support. We have been using sergeant at my work to run drill queries against our hdfs for a while, and are currently in the process of migrating to a new drill endpoint that requires authentication. We would love to continue using the sergeant library to run these queries, so if you have the ability to implement authentication we would greatly appreciate it!

BIGINT column shows up as col_integer()

I am not sure if this is a bug or not yet. But I have a parquet table with columns of datatype BIGINT.
When querying using drill_connection(), they show up as col_integer(). I am not sure what the range of col_integer is, but if it's indeed 4 bytes then there will be loss of data when querying BIGINT columns where value exceeds integers min/max values.

drill_up fails (maybe due to Docker Toolkit?)

I am eager to try out what described in the blog post about 0.8.0 version, but after having installed as instructed

devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant", ref="0.8.0")

and then running the code below, I got a failure:

> library(sergeant)
> library(tidyverse)
── Attaching packages ───────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  2.0.1     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.3.1     ✔ forcats 0.3.0
── Conflicts ──────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
> # start Docker in terminal, then open 
> dr <- drill_up(data_dir = here::here("data-raw"))
Drill container started. Waiting for the service to become active (this may take up to 30s).
Error in drill_up(data_dir = here::here("data-raw")) : 
  Could not connect to Drill container.
>

I have been able to have drill running manually and mapping the local file system to my directory of interest.
On the Docker side I have an old MBP, so I have Docker Toolkit.
(That maybe is the reason for the failure.)
If I run drill image from the command line I get the following:

> docker run drill/apache-drill                                                                                         (master|✚2✭)
Jan 27, 2019 2:25:49 PM org.jline.utils.Log logr
WARNING: Unable to create a system terminal, creating a dumb terminal (enable debug logging for more information)
Apache Drill 1.15.0
"The only truly happy people are children, the creative minority and Drill users."
0: jdbc:drill:zk=local> Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl

I tried to hello_world Docker example using stevedore package and it works...so somewhat Docker (at least the easy bit) seems to work.

I then decided to dig into drill_up implementation and executed line by line to debug.

> docker <- stevedore::docker_client()
> (drill <- docker$container$run(image = "drill/apache-drill:1.15.0", name = "drill", cmd = "/usr/local/bin/bash", volumes = sprintf("%s:/data", here::here("data-raw"))))
E> Jan 27, 2019 3:01:03 PM org.jline.utils.Log logr
E> WARNING: Unable to create a system terminal, creating a dumb terminal (enable debug logging for more information)
E> Apache Drill 1.15.0
E> "Say hello to my little Drill."
E> Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
O> 0: jdbc:drill:zk=local> <docker_run_output>
  $container:
    <docker_container>
      id: 2db3d96f27f55d35017dd0361531b37e25c80fbbe9e01cf731654c15adb1a773
      name: drill

  $logs:
    Jan 27, 2019 3:01:03 PM org.jline.utils.Log logr
    WARNING: Unable to create a system terminal, creating a dumb terminal (enable debug logging for more information)
    Apache Drill 1.15.0
    "Say hello to my little Drill."
    Closing: org.apache.drill.jdbc.impl.DrillConnectionImpl
    0: jdbc:drill:zk=local> 
>

Then the 30 attempts of drill_active fail...(here I just show one manual try :-) )

> (drill_con <- drill_connection("localhost"))
<Drill REST API Direct Connection to localhost:8047>
> drill_active(drill_con)
[1] FALSE

I am out of ideas to try to make it work...any directions is welcome.

better error messages on query failures

current error reporting on failed queries — in a word — sucks. must improve.

Does sergeant really need Java?

I'm running Apache Drill in a docker container linked to a rocker/tidyverse container, which doesn't include Java.

Is there any reason why you use RJDBC rather than DBI?

Column data types not being parsed under v0.8.0-dev

Under sergeant master (20190106), csvh files being read in over DBI (src_drill()) are coming across as character columns, while the same file parsed under CRAN v0.5.0 sergeant is properly coerced into R data types.

Apologies for the lack of reprex here. I'm in the thick of a drill-powered project and wanted to throw this early report up in case this was something related to the rapidjsonr conversion progress and the new dev branches. If previously unknown, I'll generate a reprex when my hair is a bit less on fire. ;)

Add example of accessing S3 files

https://issues.apache.org/jira/browse/DRILL-6662 makes it possible to use non-hardcoded creds so it finally makes sense to add some examples of how to query S3 data.

build_sql moved from dplyr to dbplyr

Hi! I saw your posts about this on Twitter and it looked really cool, so I though I'd try it out. However, that involved another bout in the battle to get rJava to compile from source (which I appear to have won this time!), which led me to a sergeant installation error that

Error : object ‘build_sql’ is not exported by 'namespace:dplyr'

which is true for devel-dplyr, as it's been moved to dbplyr. (It's actually the case for most of the database-focused dplyr imports; build_sql is just the first it runs into.) Reinstalling CRAN-dplyr allows sergeant to be installed, but it still can't be used with devel-dplyr, which is a shame, as the update contains a lot of useful features.

According to the blog post about the dplyr 0.6 release candidate, it was to be submitted to CRAN yesterday. I'm not sure if that happened or quite when it will arrive on the mirrors, but for sergeant to be used at that point sergeant/R/sergeant-package.r will need to point to dbplyr.

If you want to update it now, I'm happy to make a PR if you want, though it's not a very complicated change, really.

port 8047

Hi, I am testing sergeant against MapR cluster and to my knowledge we should either be using 31010 or zk port and 8047 is drillbits management page in MapR.

It looks like using sergeant it cannot use 31010 but only 8047. Is this correct?

drill_version(ds) failed

Environment

R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_Singapore.1252  LC_CTYPE=English_Singapore.1252    LC_MONETARY=English_Singapore.1252
[4] LC_NUMERIC=C                       LC_TIME=English_Singapore.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.3.0   stringr_1.3.1   purrr_0.2.5     readr_1.1.1     tidyr_0.8.1     tibble_1.4.2   
 [7] ggplot2_3.0.0   tidyverse_1.2.1 sergeant_0.5.2  dbplyr_1.2.2    dplyr_0.7.6     DBI_1.0.0      
[13] rJava_0.9-10

How to replicate

ds <- src_drill(drill_ip) 
ds
>src:  DrillConnection
>tbls: cp.default, dfs.default, dfs.root, dfs.tmp, INFORMATION_SCHEMA, postgres.information_schema,
postgres.pg_catalog, postgres.postgres, postgres.public, postgres, sys

db <- tbl(ds, "cp.`employee.json`")
db
store_id gender department_id birth_date supervisor_id last_name position_title hire_date          
<int> <chr>          <int> <date>             <int> <chr>     <chr>          <dttm>             
1        0 F                  1 1961-08-26             0 Nowmer    President      1994-12-01 00:00:00
2        0 M                  1 1915-07-03             1 Whelply   VP Country Ma~ 1994-12-01 00:00:00
3        0 M                  1 1969-06-20             1 Spence    VP Country Ma~ 1998-01-01 00:00:00
4        0 F                  1 1951-05-10             1 Gutierrez VP Country Ma~ 1998-01-01 00:00:00
5        0 F                  2 1942-10-08             1 Damstra   VP Informatio~ 1994-12-01 00:00:00
6        0 F                  3 1949-03-27             1 Kanagaki  VP Human Reso~ 1994-12-01 00:00:00
7        9 F                 11 1922-08-10             5 Brunner   Store Manager  1998-01-01 00:00:00
8       21 F                 11 1979-06-23             5 Blumberg  Store Manager  1998-01-01 00:00:00
9        0 M                  5 1949-08-26             1 Stanz     VP Finance     1994-12-01 00:00:00
10        1 M                 11 1967-06-20             5 Murraiin  Store Manager  1998-01-01 00:00:00
# ... with more rows, and 8 more variables: management_role <chr>, salary <dbl>, marital_status <chr>,
#   full_name <chr>, employee_id <int>, education_level <chr>, first_name <chr>, position_id <int>


drill_version(ds)
Error in is.url(url) : length(url) == 1 is not TRUE

Investigation

sergeant:::make_server
function (drill_con) 
{
  sprintf("%s://%s:%s", ifelse(drill_con$ssl[1], "https", "http"), 
          drill_con$host, drill_con$port)
}
<bytecode: 0x0000000022b58b98>
<environment: namespace:sergeant>

str(ds)  
List of 1
$ con:Formal class 'DrillConnection' [package "sergeant"] with 5 slots
.. ..@ host    : chr "172.27.141.128"
.. ..@ port    : int 8047
.. ..@ ssl     : logi FALSE
.. ..@ username: chr(0) 
.. ..@ password: chr(0) 
- attr(*, "class")= chr [1:3] "src_drill" "src_sql" "src"

Addressing the ssl variable has a problem:

ds$ssl[1]
NULL

# Correct way
ds$con@ssl
FALSE

This code need to be fixed in make_server function.

function (drill_con) 
{
  sprintf("%s://%s:%s", ifelse(drill_con$con@ssl, "https", "http"), 
          drill_con$con@host, drill_con$con@port)
}

BREAKING STUFF: 0.7.0 re-introducts JDBC dependency

Hopefully if you are running into issues with this, you'll see this issue and add your comments here.

Sufficient interest in moving the JDBC bits to a sibling pkg will cause that to happen.

SQL Statement error

Hi,
In going through your code in
https://rud.is/rpubs/yelp.html#

I am unable to run this code:

tbl(db, "(SELECT b.name AS name, b.hours[5] AS sat_hrs FROM dfs.d.`/yelp/yelp_academic_dataset_business.json` b)") %>%
  filter(!is.na(sat_hrs)) %>%
  filter(grepl("Saturday", sat_hrs)) %>%
  select(name, sat_hrs) %>%
  collect() %>%
  mutate(sat_hrs = gsub("Saturday ", "", sat_hrs)) %>%
  tidyr::separate(sat_hrs, c("open", "close"), "-") %>%
  filter(open != "0:0" & close != "0:0")

tbl(db, "(SELECT b.name AS name, b.hours[5] AS sat_hrs FROM dfs.d.`/yelp/yelp_academic_dataset_business.json` b)")
returns an error
Error: is.data.frame(df) is not TRUE

tbl does not work with a SQL statement.

Herman

Provide an automated downloader and way to start drill-embedded?

You write

If there is sufficient desire for an automated downloader and a way to start the drill-embedded server from within R, please file an issue.

Consider the issue filed!

I think this would be great. I've been impressed by how RStudio's packages (blogdown, tinytex, keras, etc) ship with these mini-installers, and it's amazing how much time is saved by sidestepping external-to-R installation steps when I've introduced those tools to students.

REST API: Add support for new query metadata in 1.15.0 release

Drill Jira: https://issues.apache.org/jira/browse/DRILL-6847

Package Issue Ref: #18 (comment)

How to handle the dplyr interface?

hack up a DBI interface to the REST API like Presto does vs lazy-wrap the JDBC interface?
add a schema parameter to src_drill() so folks can just use table identifiers in tbl()?

"bad pad value" when trying to install from GitHub

Hello and thanks for the package! I can't find a way to install it on my system from GitHub, however:

> devtools::install_github("hrbrmstr/sergeant")
Downloading GitHub repo hrbrmstr/sergeant@master
Error in structure(.External(.C_dotTclObjv, objv), class = "tclObj") : 
  [tcl] bad pad value "2m": must be positive screen distance.

I can use it from CRAN, however:

> installed.packages() %>% as_tibble() %>% filter(Package == "sergeant") %>% flatten() %>% .$Version
[1] "0.5.2"
> library(sergeant)
> db <- src_drill("localhost")
> drill_connection() %>% drill_version()
[1] "1.16.0"

This is my sessionInfo:

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] nvimcom_0.9-82 colorout_1.2-1 usethis_1.5.0  devtools_2.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1        magrittr_1.5      pkgload_1.0.2     R6_2.4.0         
 [5] rlang_0.3.4       tcltk_3.6.0       tools_3.6.0       pkgbuild_1.0.3   
 [9] parallel_3.6.0    sessioninfo_1.1.1 cli_1.1.0         withr_2.1.2      
[13] remotes_2.0.4     assertthat_0.2.1  rprojroot_1.3-2   digest_0.6.18    
[17] crayon_1.3.4      processx_3.3.1    callr_3.2.0       fs_1.3.1         
[21] ps_1.3.0          curl_3.3          testthat_2.1.1    memoise_1.1.0    
[25] glue_1.3.1        compiler_3.6.0    desc_1.2.0        backports_1.1.4  
[29] prettyunits_1.0.2

Does Drill's REST API support session settings?

I'm running into an issue where session settings don't see to stick when they're submitted through sergeant. The problem seems to be with the REST api itself, not with sergeant.

Here's an example:

# Packages
library(sergeant)
#> Loading required package: DBI
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: dbplyr
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

# Connect to Drill.
con <- drill_connection()

# Alternatively with drill_use.
drill_use(con, "dfs.tmp")
#> Parsed with column specification:
#> cols(
#>   summary = col_character(),
#>   ok = col_character()
#> )
#> Default schema changed to [dfs.tmp]true
drill_query(con, "CREATE TABLE employee as SELECT * FROM cp.`employee.json`")
#> Query ==> CREATE TABLE employee as SELECT * FROM cp.`employee.json`
#> VALIDATION ERROR: Root schema is immutable. Creating or dropping tables/views is not allowed in root schema.Select a schema using 'USE schema' command.
#> 
#> 
#> [Error Id: d4d341b3-07da-4562-813e-67f0ef89c028 on DESKTOP-73M5KKL.mshome.net:31010]

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  tz       America/New_York            
#>  date     2018-03-01
#> Packages -----------------------------------------------------------------
#>  package    * version date       source        
#>  assertthat   0.2.0   2017-04-11 CRAN (R 3.4.3)
#>  backports    1.1.2   2017-12-13 CRAN (R 3.4.3)
#>  base       * 3.4.3   2017-12-06 local         
#>  bindr        0.1     2016-11-13 CRAN (R 3.4.3)
#>  bindrcpp     0.2     2017-06-17 CRAN (R 3.4.3)
#>  colorspace   1.3-2   2016-12-14 CRAN (R 3.4.3)
#>  compiler     3.4.3   2017-12-06 local         
#>  curl         3.1     2017-12-12 CRAN (R 3.4.3)
#>  datasets   * 3.4.3   2017-12-06 local         
#>  DBI        * 0.7     2017-06-18 CRAN (R 3.4.3)
#>  dbplyr     * 1.2.1   2018-02-19 CRAN (R 3.4.3)
#>  devtools     1.13.5  2018-02-18 CRAN (R 3.4.3)
#>  digest       0.6.15  2018-01-28 CRAN (R 3.4.3)
#>  dplyr      * 0.7.4   2017-09-28 CRAN (R 3.4.3)
#>  evaluate     0.10.1  2017-06-24 CRAN (R 3.4.3)
#>  glue         1.2.0   2017-10-29 CRAN (R 3.4.3)
#>  graphics   * 3.4.3   2017-12-06 local         
#>  grDevices  * 3.4.3   2017-12-06 local         
#>  hms          0.4.1   2018-01-24 CRAN (R 3.4.3)
#>  htmltools    0.3.6   2017-04-28 CRAN (R 3.4.3)
#>  httr         1.3.1   2017-08-20 CRAN (R 3.4.3)
#>  jsonlite     1.5     2017-06-01 CRAN (R 3.4.3)
#>  knitr        1.20    2018-02-20 CRAN (R 3.4.3)
#>  magrittr     1.5     2014-11-22 CRAN (R 3.4.3)
#>  memoise      1.1.0   2017-04-21 CRAN (R 3.4.3)
#>  methods    * 3.4.3   2017-12-06 local         
#>  munsell      0.4.3   2016-02-13 CRAN (R 3.4.3)
#>  pillar       1.1.0   2018-01-14 CRAN (R 3.4.3)
#>  pkgconfig    2.0.1   2017-03-21 CRAN (R 3.4.3)
#>  plyr         1.8.4   2016-06-08 CRAN (R 3.4.3)
#>  purrr        0.2.4   2017-10-18 CRAN (R 3.4.3)
#>  R6           2.2.2   2017-06-17 CRAN (R 3.4.3)
#>  Rcpp         0.12.15 2018-01-20 CRAN (R 3.4.3)
#>  readr        1.1.1   2017-05-16 CRAN (R 3.4.3)
#>  rlang        0.2.0   2018-02-20 CRAN (R 3.4.3)
#>  rmarkdown    1.8     2017-11-17 CRAN (R 3.4.3)
#>  rprojroot    1.3-2   2018-01-03 CRAN (R 3.4.3)
#>  scales       0.5.0   2017-08-24 CRAN (R 3.4.3)
#>  sergeant   * 0.5.2   2017-07-17 CRAN (R 3.4.3)
#>  stats      * 3.4.3   2017-12-06 local         
#>  stringi      1.1.6   2017-11-17 CRAN (R 3.4.2)
#>  stringr      1.3.0   2018-02-19 CRAN (R 3.4.3)
#>  tibble       1.4.2   2018-01-22 CRAN (R 3.4.3)
#>  tools        3.4.3   2017-12-06 local         
#>  utils      * 3.4.3   2017-12-06 local         
#>  withr        2.1.1   2017-12-19 CRAN (R 3.4.3)
#>  yaml         2.1.16  2017-12-12 CRAN (R 3.4.3)

One workaround is to reference cp.tmp.employee directly when creating the table, but this makes drill_use not very useful.

Created on 2018-03-01 by the reprex package (v0.2.0).

dplyr interface does not work outside of `cp` schema

Hi,

I tried to use the dplyr interface and it seems to fail with "Error in readr::type_convert(out$rows): is.data.frame(df) is not TRUE"

The reprex is below:

#install.packages("sergeant")
library(sergeant)
#> Loading required package: DBI
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> Loading required package: dbplyr
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(tidyverse)

# drill server IP Address
drill_ip <- "xxx.xxx.xxx.xxx"
drill_jdbc_port <- "31010" 

# Connecting to drill server
# src_drill: dplyr connection
# drill_connection: REST API connection
# drill_jdbc: JDBC connection (if remote, needs zk etc)

# Connect using dplyr
ds_dp <- src_drill(drill_ip) 
tbl(ds_dp, "cp.`employee.json`")
#> # Source:   table<cp.`employee.json`> [?? x 16]
#> # Database: DrillConnection
#>    store_id gender department_id birth_date supervisor_id last_name
#>       <dbl> <chr>          <dbl> <date>             <dbl> <chr>    
#>  1        0 F                  1 1961-08-26             0 Nowmer   
#>  2        0 M                  1 1915-07-03             1 Whelply  
#>  3        0 M                  1 1969-06-20             1 Spence   
#>  4        0 F                  1 1951-05-10             1 Gutierrez
#>  5        0 F                  2 1942-10-08             1 Damstra  
#>  6        0 F                  3 1949-03-27             1 Kanagaki 
#>  7        9 F                 11 1922-08-10             5 Brunner  
#>  8       21 F                 11 1979-06-23             5 Blumberg 
#>  9        0 M                  5 1949-08-26             1 Stanz    
#> 10        1 M                 11 1967-06-20             5 Murraiin 
#> # ... with more rows, and 10 more variables: position_title <chr>,
#> #   hire_date <dttm>, management_role <chr>, salary <dbl>,
#> #   marital_status <chr>, full_name <chr>, employee_id <dbl>,
#> #   education_level <chr>, first_name <chr>, position_id <dbl>
tbl(ds_dp, "SELECT * FROM dfs.`c:/apps/apache-drill-1.14.0/sample-data/nation.parquet` LIMIT 5")
#> Error in readr::type_convert(out$rows): is.data.frame(df) is not TRUE

ds_rest <- drill_connection(drill_ip) 
drill_query(ds_rest,"SELECT * FROM dfs.`c:/apps/apache-drill-1.14.0/sample-data/nation.parquet` LIMIT 5")
#> Parsed with column specification:
#> cols(
#>   N_COMMENT = col_character(),
#>   N_NAME = col_character(),
#>   N_NATIONKEY = col_double(),
#>   N_REGIONKEY = col_double()
#> )
#> # A tibble: 5 x 4
#>   N_COMMENT            N_NAME    N_NATIONKEY N_REGIONKEY
#> * <chr>                <chr>           <dbl>       <dbl>
#> 1 haggle. carefully f  ALGERIA             0           0
#> 2 al foxes promise sly ARGENTINA           1           1
#> 3 y alongside of the p BRAZIL              2           1
#> 4 eas hang ironic, sil CANADA              3           1
#> 5 y above the carefull EGYPT               4           4

^{Created on 2019-01-09 by the reprex package (v0.2.1)}

sergeant errors and warnings keeps popping even after removing it

I had installed the sergeant package but could not figure out how to use it.
Meanwhile, strange errors started appearing mostly when I am inside the debugger and press Q. I used to get sergeant package related errors. Even though I had nothing (no code) on sergeant in my source file. So I decided to uninstall it.

Even after uninstalling the package following warning comes on sourcing my R file.

Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘sergeant’

And to top it now on entering the Q command in the Rstudio debugger the sergeant package is loaded - twice. Where is the package coming from? And how is it called by just the Q command?

Out of memory Issue reading single column from parquet file

Thanks for providing the sergeant package!

My use case is reading a single column of data from a fairly large parquet file. The column I want to read from the parquet file is called occurrenceId and the whole column fits into a character vector of length 70M in R where each value is a string of length 41 characters (it is a unique identifier, and I'd like to check it for uniqueness and presence in R). In R the whole column would occupy about 700M in memory when I inspect it with ls().

I can do this with sparklyr but the drill sergeant's approach is appealing, being more lightweight. I am struggling with an out-of-memory issue, though and I have 16 GB available, which I think should suffice, given the size of the vector in R, so now I'm wondering if this use case is supported or if I'm doing it wrong?

The dataset I'm using is public and can be viewed here https://www.gbif.org/dataset/38b4c89f-584c-41bb-bd8f-cd1def33e92f and it can be downloaded from here in .zip format: http://www.gbif.se/ipt/archive.do?r=artdata

I first tried using vroom and reading directly from the compressed .zip file (details here tidyverse/vroom#116) but the promise of being able to read directly from a comparatively smaller parquet file and being able to just read the columns I need made me turn to the sergeant. So in my attempt to read the parquet file I have first converted the .zip to parquet using sparklyr, like so:

library(sparklyr)
library(dplyr)

# first install spark 2.40 hadoop 2.7 with sparklyr::spark_install()

Sys.setenv("SPARK_MEM" = "12g")

config <- spark_config()
config$`sparklyr.shell.driver-memory` <- '12G'
config$`sparklyr.shell.executor-memory` <- '4G'
config$sparklyr.defaultPackages <- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M3"
config$spark.cassandra.cassandra.host <- "localhost"
config$spark.driver.maxResultSize <- "4G"
config$spark.executor.cores <- 3

# is pushdown option TRUE?

sc <- spark_connect(master = "local", config = config)

# for this connection, load all records

system.time(
  spark_read_csv(sc, memory = FALSE,
    name = "artdata", path = "file:///home/roger/artdata/artdata.tsv", delimiter = "\t")
)

#user   system  elapsed 
#6.154    7.060 1559.874 

# generate a parquet file based on the dataframe above

system.time(
  spark_write_parquet(
    tbl(sc, "artdata"),  
    "file:///home/roger/artdata/artdata.parquet")
)

#user   system  elapsed 
#14.634   16.586 3816.375 

# the parquet-file is 3.8 GB on disk, smaller than the zip

spark_tbl_handle <- spark_read_parquet(sc, memory = FALSE,
  "artdata", "file:///home/roger/artdata/artdata.parquet")


has_valid_bor <- function() {
  
  bor <- 
    spark_tbl_handle %>%
    count(basisOfRecord) %>%
    collect() %>%
    mutate(is_ok = basisOfRecord %in% c(
        "humanobservation", 
        "machineobservation"
      )
    )

  bor %>% pull(is_ok) %>% all
}

n_rowcount <- function() {
  
  spark_tbl_handle %>%
  summarise(n = n()) %>%
  pull(n)

}


has_valid_id <- function() {
  
  ids <- 
    spark_tbl_handle %>%
    count(occurrenceID) %>%
    filter(n > 1, is.na(occurrenceID)) %>%
    collect()
  
  nrow(ids) == 0
  
}


system.time(
  has_valid_bor()
)

system.time(
  has_valid_id()
)

system.time(
  n_rowcount()
)

sort_artdata <- function() {
  spark_tbl_handle %>%
  arrange(occurrenceID) %>%
  head(10) %>%
  collect()
}

system.time(
  sort_artdata()
)

# sorting in spark takes about 5 minutes...
#user  system elapsed 
#3.182   1.370 282.698

This gives me a parquet file on disk.

I then proceed to attempt to use the sergeant to read the occurrenceId column like so:

library(sergeant)
library(tidyverse)

if (Sys.which("docker") == "")
  stop("Please install docker first - see https://docs.docker.com/install/")

# install and run official Apache Drill software
system("docker stop drill; docker rm drill; docker run -i --name drill -e DRILL_HEAP=10G -v /home/markus/tmp/artdata:/tmp -p 8047:8047 --detach drill/apache-drill:1.16.0 /bin/bash")

dc <- drill_connection("localhost") 

df <- drill_query(dc, "SELECT occurrenceId FROM dfs.`/tmp/artdata.parquet`")

The error message that I get is the following:

  |===========================================================================| 100%
Query ==> SELECT occurrenceId FROM dfs.`/tmp/artdata.parquet`
RESOURCE ERROR: There is not enough heap memory to run this query using the web interface. 

Please try a query with fewer columns or with a filter or limit condition to limit the data returned. 
You can also try an ODBC/JDBC client. 

[Error Id: f42e398d-0afb-4931-a800-2a56baaa074c ]

I tried to set the DRILL_HEAP to 10G.

Is this use case supported with the sergeant?

Any advice on how I should proceed?

column order not respected

The columns order for my table reference seems not to be respected.

The columns order in the CSV files is

snapshot_id,altitude,heading,latitude,longitude,radar_id,speed,squawk

But I would like to see them as:

snapshot_id longitude latitude altitude heading speed radar_id squawk

I tried the following SELECT in the SQL prompt (it worked):

0: jdbc:drill:> SELECT snapshot_id, longitude, latitude, altitude, heading, speed, CAST(REGEXP_REPLACE(`squawk`, '\r', '') AS int) AS `squawk`, radar_id FROM dfs.fr24.`/*.csv` LIMIT 20;
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
| snapshot_id  | longitude  | latitude  | altitude  | heading  | speed  | squawk  | radar_id  |
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
| 1486340102   | 14.27592   | 50.10523  | 0         | 40       | 0      | 0       | 8520      |
| 1486340568   | 14.27612   | 50.10521  | 0         | 58       | 0      | 0       | 8520      |
| 1486340668   | 14.27592   | 50.10512  | 0         | 247      | 0      | 0       | 8520      |
| 1486340773   | 14.27569   | 50.10515  | 0         | 232      | 0      | 0       | 8520      |
| 1486340828   | 14.27578   | 50.10507  | 0         | 77       | 0      | 0       | 8520      |
| 1486340903   | 14.27590   | 50.10528  | 0         | 41       | 0      | 0       | 8520      |
| 1486341363   | 14.27594   | 50.10523  | 0         | 48       | 0      | 0       | 8520      |
| 1486341828   | 14.27573   | 50.10522  | 0         | 241      | 0      | 0       | 8520      |
| 1486342643   | 14.27581   | 50.10522  | 0         | 120      | 0      | 0       | 8520      |
| 1486342870   | 14.27571   | 50.10522  | 0         | 255      | 0      | 0       | 8520      |
| 1486343421   | 14.27588   | 50.10524  | 0         | 63       | 0      | 0       | 8520      |
| 1486343966   | 14.27568   | 50.10509  | 0         | 312      | 0      | 0       | 8520      |
| 1486344251   | 14.27590   | 50.10522  | 0         | 119      | 0      | 0       | 8520      |
| 1486345301   | 14.27586   | 50.10524  | 0         | 294      | 0      | 0       | 8520      |
| 1486345875   | 14.27606   | 50.10507  | 0         | 123      | 0      | 0       | 8520      |
| 1486345931   | 14.27627   | 50.10509  | 0         | 132      | 0      | 0       | 8520      |
| 1486345961   | 14.27619   | 50.10529  | 0         | 339      | 0      | 0       | 8520      |
| 1486345981   | 14.27595   | 50.10538  | 0         | 297      | 0      | 0       | 8520      |
| 1486346091   | 14.27594   | 50.10522  | 0         | 120      | 0      | 0       | 8520      |
| 1486347131   | 14.27592   | 50.10543  | 0         | 12       | 0      | 0       | 8520      |
+--------------+------------+-----------+-----------+----------+--------+---------+-----------+
20 rows selected (9.943 seconds)

But when I use the same SELECT as definition of the table reference with sergeant/dplyr I only get the column order as in the original CSV files (even if later I try to reorder them with dplyr).

Here is the sequence of commands in R:

> db <- src_drill("localhost")
> fr24pos <- tbl(db, "(SELECT snapshot_id, longitude, latitude, altitude, heading, speed, CAST(REGEXP_REPLACE(`squawk`, '\r', '') AS int) AS `squawk`, radar_id FROM dfs.fr24.`/*.csv`)")
> fr24pos
# Source:   table<(SELECT snapshot_id, longitude, latitude, altitude, heading, speed,
', '') AS int) AS `squawk`, radar_id FROM
#   dfs.fr24.`/*.csv`)> [?? x 8]
# Database: DrillConnection
   altitude squawk snapshot_id heading latitude radar_id speed longitude
      <int>  <int>       <int>   <int>    <dbl>    <int> <int>     <dbl>
 1        0      0  1486340102      40     50.1     8520     0      14.3
 2        0      0  1486340568      58     50.1     8520     0      14.3
 3        0      0  1486340668     247     50.1     8520     0      14.3
 4        0      0  1486340773     232     50.1     8520     0      14.3
 5        0      0  1486340828      77     50.1     8520     0      14.3
 6        0      0  1486340903      41     50.1     8520     0      14.3
 7        0      0  1486341363      48     50.1     8520     0      14.3
 8        0      0  1486341828     241     50.1     8520     0      14.3
 9        0      0  1486342643     120     50.1     8520     0      14.3
10        0      0  1486342870     255     50.1     8520     0      14.3
# ... with more rows
> fr24pos %>% filter(altitude > 1500) %>% head()
# Source:   lazy query [?? x 8]
# Database: DrillConnection
  altitude squawk snapshot_id heading latitude radar_id speed longitude
     <int>  <int>       <int>   <int>    <dbl>    <int> <int>     <dbl>
1    39000  30276  1486339252     114     37.9      230   506      44.0
2    39000  30276  1486339315     114     37.8      230   509      44.2
3    38975  20836  1486339327     114     37.8      230   509      44.2
4    39025  20836  1486339361     114     37.8      230   508      44.2
5    39000  20836  1486339452     114     37.7      230   505      44.6
6    39000  20836  1486339513     131     37.6      230   503      44.7
> fr24pos %>% select(snapshot_id, longitude, latitude, heading, speed, everything()) %>% filter(altitude > 1500) %>% head()
# Source:   lazy query [?? x 8]
# Database: DrillConnection
  altitude squawk snapshot_id heading latitude radar_id speed longitude
     <int>  <int>       <int>   <int>    <dbl>    <int> <int>     <dbl>
1    39000  30276  1486339252     114     37.9      230   506      44.0
2    39000  30276  1486339315     114     37.8      230   509      44.2
3    38975  20836  1486339327     114     37.8      230   509      44.2
4    39025  20836  1486339361     114     37.8      230   508      44.2
5    39000  20836  1486339452     114     37.7      230   505      44.6
6    39000  20836  1486339513     131     37.6      230   503      44.7

I file it here, but maybe it could be an issue with dplyr or Apache Drill...

Example CSV file in attachment (extension in .txt to get into the issue):
20170206_207761844.csv.txt

PS: I have to explicitly remove "\r" from squawk (last field in the CSV) to deal with line ending...probably there is a config to do that...
PPS: fr24 is a workspace of dfs with CSV files with headers and I followed "Using Apache Drill with R" recipes to get up an running: THANKS!