ianmcook / implyr Goto Github PK

View Code? Open in Web Editor NEW

81.0 81.0 15.0 455 KB

SQL backend to dplyr for Impala

License: Other

R 100.00%

apache dplyr dplyr-sql-backends hadoop impala jdbc odbc r sql tidyverse

implyr's People

Contributors

Stargazers

Watchers

Forkers

bendettasd-zz markderry bedantaguru t-sell logicsandeep dirtyarteaga han-tun karoliskascenas harshalrepo minghao2016 pedrorbf mgirlich arafathnihar liudvikasakelis hadley

implyr's Issues

No warning on union_all() of two tbl_impala objects that are arrange()d

To reproduce:

flights_aa <- tbl(impala, "flights") %>% filter(carrier == "AA") %>% arrange(dep_delay)
flights_ua <- tbl(impala, "flights") %>% filter(carrier == "UA") %>% arrange(dep_delay)
union_all(flights_aa, flights_ua)

The resulting rows are not in order by dep_delay, but the warning (Results may not be in sorted order!) is not displayed.

copy_to limit size requirement

Hi,
Is there a specific reason for implyr limitation of the size of data.frame that can be inserted into database ?

implyr/R/src_impala.R

Lines 340 to 349 in 32a3e12

 if (prod(dim(df)) > 1e3L) { 

 # TBD: consider whether to make this limit configurable, possibly using 

 # options with the pkgconfig package 

 stop( 

 "Data frame ", 

 name, 

 " is too large. copy_to currently only supports very small data frames.", 

 call. = FALSE 

 ) 

 }

I did not found any mentions in Impala Guides
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_create_table.html

A DBI::dbWriteTable on con <- dbConnect(odbc::odbc(), "Impala_DSN") has worked. As it is not as specific as implyr:::db_create_table.impala_connection , it would be better to use implyr 📦 over dbplyr 📦

Thanks for advice.

Canceling a (long-running) query

I use the implyr-package frequently. Sometimes I realize only after I have sent a query, that it will take (too) long to take the results.

I havent yet found the best way to cancel a query. The Stop-button in Rstudio doesn't do the trick, same with Ctrc-c. I sometimes terminate Rstudio or Rstudio Server, if a query takes too long.

What are the mechanics that make it difficult to cancel a query? What approach would you suggest?

Greater than 1024 rows of some variable types causes an encoding error

I'm finding if a table has more than 1,024 rows that dbWriteTable has an issue with some variable types.

library(implyr)

connect_impala <- function() {
  src_impala(drv = odbc::odbc(),
             driver = 'Cloudera ODBC Driver for Impala',
             host = 'impala.xxxxxxxxxxxxx',
             port = xxxxx,
             database = 'default',
             AuthMech = 3,
             ssl = 1,
             uid = rstudioapi::askForPassword('Username'),
             pwd = rstudioapi::askForPassword('Password'))
}

cx <- connect_impala()


any_number_you_like <- 0L
set.seed(any_number_you_like)


my_test_data_1024 <- data.frame(X = sample(1024))

dbWriteTable(cx$con,
             Id(schema = 'XXXXXXX', table = 'TEST_1024'), 
             my_test_data_1024)

dbRemoveTable(cx$con, Id(schema = 'XXXXXXX', table = 'TEST_1024'))



my_test_data_1025 <- data.frame(X = sample(1025))

dbWriteTable(cx$con,
             Id(schema = 'XXXXXXX', table = 'TEST_1025'), 
             my_test_data_1025)

The first dbWriteTable is fine, but the second throws an error like this:

Error in result_insert_dataframe(rs@ptr, values, batch_rows) : 
  nanodbc/nanodbc.cpp:1617: HY000: [Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] : AnalysisException: Target table 'XXXXXXX.TEST_1025' is incompatible with source expressions.
Expression 'cast('208À€' as string)' (type: STRING) is not compatible with column 'x' (type: INT)

I've replaced the sample with versions that produce double and logical variables with similar results, but sampling from letters did not produce this error.

Test with HBase

A useR!2017 attendee asked after the implyr talk whether implyr could be used with HBase, specifically to write a table to HBase. Create automated tests to check if this is currently possible.

Error in .valueClassTest(ans, "data.frame", "dbFetch")

I'm using implyr 0.2.3, dplyr 0.7.8, dbi 0.7, dbplyr 1.2.1

The following code works

tbl(impala, sql("select * from schema.table")

but the following codes

tbl(impala, sql("select * from schema.table;")
tbl(impala, "schema..table")

returned an error message
Error in .valueClassTest(ans, "data.frame", "dbFetch") : invalid value from generic function ‘dbFetch’, class “character”, expected “data.frame”

and the following code

tbl(impala, "schema.table")
tbl(impala, in_schema('schema', 'table'))

returned another error message
Error: is.character(vars) is not TRUE

Eliminate uses of compare_tbls() and compare_tbls2() in tests

These functions have been deprecated in dplyr (tidyverse/dplyr#4675, tidyverse/dplyr#4696)

Test with dplyr 0.6.0 release candidate

Test implyr with dplyr version 0.5.0.9002.

Check for compatibility with rlang 0.2.0

See the changes in tidyverse/dbplyr#67 and check if any similar changes are required to implyr.

Add automated tests using odbc

The automated tests currently use RJDBC. Add support for tests using odbc. Use multiple Travis CI jobs to run the same tests using either RJDBC or odbc based on an environment variable.

no existing definition for function ‘dbSendQuery’

I met the "no existing definition for function 'dbSendQuery'" error when I use src_impala like

impala = implyr::src_impala(odbc::odbc(), dsn = "some_dsn_name")

The DSN works well , like
odbc::dbConnect(odbc::odbc(), dsn = "some_dsn_name") %>% odbc::dbGetQuery("some_sql_query")

I don't have any trouble when using JDBC connectivity, but I'd prefer using ODBC as ODBC doesn't require the administrator privilege for Kerberos.

Thank you for your help!

My SessionInfo:
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] odbc_1.0.1 magrittr_1.5 R6_2.2.2 assertthat_0.2.0 RevoUtils_10.0.3
[6] DBI_0.7 tools_3.3.3 glue_1.1.0 dplyr_0.7.0 tibble_1.3.3
[11] implyr_0.2.0 Rcpp_0.12.11 blob_1.0.0 rlang_0.1.1 dbplyr_1.0.0

error when using dbSendQuery on impala connection

I am not sure if this is supported. I was trying to use that to fetch resultset in batches rather than just one shot to avoid memory footprint. dbQuery and other methods work fine. code is like the following:

library(implyr) # this is the only line for library loading
library(RJDBC)

impala <- src_impala(...) #I am using JDBC
dbSendQuery(impala, "select * from tbl1")

Error in (function (classes, fdef, mtable):
unable to find an inherited method for function 'dbSendQuery' for signature ''src_impala", "character"'

session info on windows 64bit:
R 3.4.3
rJava_0.9_9 DBI_0.8 implyr_0.2.2 dplyr_0.7. RJDBC_0.2_7

I saw the other issue you fixed the loading sequence and namespace issue with DBI so I didn't load DBI explicitly.

Thanks.

Add Function src_insert

Request a function that allows a tbl_impala, or tbl_lazy etc. to be able to be inserted back into an impala table.
Arguments would be
1: implyr object
2: target table (implyr object?)
3: overwrite = FALSE
4: partition spec (if it cannot be extracted automatically)

Potential Method... just a suggestion:
After running collapse on the implyr object, $ ops contains the SQL for the query. We just need to add the Insert Overwrite or Insert INTO statement at the beginning to make this work.

Fix literal string arguments in paste()

The following fails because dbplyr does not quote the literal strings in the SQL:

tbl(impala, "flights") %>%
  transmute(date = paste0(
    as.character(year),
    "-",
    lpad(as.character(month), 2L, "0"),
    "-",
    lpad(as.character(day), 2L, "0")
  )
)

I believe this is a dbplyr issue, but need to check and create a reproducible example. Note that you can't use SQLite to create an easily reproducible example because SQLite uses an infix operator for concatenation.

The workaround is to wrap the literal strings in parens: ("-")

read / write parquet / CSV files

can implyr read / write parquet files?

Dbplyr joins fail with default suffixes

To reproduce, join two tables that have column names in common but aren't the join key. Default suffixes (".x" and ".y") are attempted, but obviously column names in Impala (most SQLs I suspect) can't contain dots, so errors are thrown.
Default suffixes should be changed to something valid like "_x" and "_y" when we know we're joining Impala-backed tables.

src_impala from existing connection object

Hello,
would it be possible to simply the connection process, and initialize src_impala from an existing connection object?

For example, I already have an odbc connection initialized like this:

conn = odbcConnect("impaladsn")

Is there a way to call src_impala on this object directly? For example, src_impala(conn)?

Thanks in advance.

Fix detection of ORDER BY in subqueries with dplyr 0.7.0

implyr issues a warning if you apply arrange() before other verbs. This is because Impala may not return the final result in sorted order if row ordering is applied in an intermediate phase of query processing. But this warning is often a false positive, especially with the new SQL optimizer in dplyr 0.7.0. Need to stop these false positive warnings.

dbDisconnect error on CentOS

When executing the code example of connecting to Impala, running a query, and disconnecting as shown in the README there is an error "type must be a single element of type 'character'" when attempting to disconnect. OS: CentOS, Implyr Version: 0.2.2

Add rJava to Suggests

On 2018-05-18, Kurt Hornik emailed the following:

These [packages including implyr] seem to have undeclared package dependencies in their unit test code (R files in tests subdirs), see below.

Can you pls fix as necessary? (Add the missing package dependencies to
Suggests, I guess.)

Please note that these issues are currently not yet detected by the CRAN
incoming (or regular) checks.

$implyr
'::' or ':::' import not declared from: ‘rJava’

Add function src_databases()

Add a function src_databases() that does what the sparklyr function of the same name does.

show query progress in R

When executing an Impala query from HUE the progress of the query is shown. It would be a nice-to-have in R as well. Similar to the progress bars from readr for reading large datasets.

Translating expression with extracted entry.

I think I found a bug, although I cannot be sure that it is still present.
I'll mention it here because it's simple, and if needed and relevant I'll work on a reproducible example.

I am calculating a range of dates by which to filter my main frame. Let's just take the whole weeks.
And using that to filter the table.

> dates_range <- tbl(conn, "events") %>%
    group_by(a_week = floor_date(a_date, "week")) %>% 
    mutate(n_days = n()) %>% 
    select(a_week, a_date) %>% distinct() %>%  collect() %>%
    filter(n_days == 7) %$% 
    range(a_date) %>% as.Date()

> events_query <- tbl(conn, "events") %>% 
    filter(a_date >= dates_range[1]) %>% show_query()

I get

SELECT *
FROM `events`
WHERE (`a_date` >= CASE 
  WHEN (1.0) THEN (('2020-05-04', '2020-06-21')) 
  END)

Thus it's interpreting date_range[1] as a CASE WHEN clause, which does weird things.
While for my purpose I can go around and substitue the date_range value, I thought to write it here and do my share.

I must say, that I wasn't able to download the latest version of implyr since I'm on a work computer, and thus find a possibility of this being fixed already. 😮

Regards from Mexico.

use of impala as backend for rmarkdown-sql-codechunks supported?

Today I tried to use Impala as a backend-connection to execute sql-code chunks as shown in https://rmarkdown.rstudio.com/authoring_knitr_engines.html#sql

I ran into some error message, which I will post soon. Is this feature supposed to work by any chance? It would allow sql-syntax-highlighting and increase the readability over dbGetQuery(). :)

I tried

impala <- src_impala(...)
```{sql, connection = impala}
SELECT * FROM trials
`

Out of curiosity: is there a conceptual or implemented difference between DBI::dbConnect() and implyr::src_impala?

use spark with tbl() returned object

Hi Ian -

Thanks for your work on implyr! I created an impala object following these step:

impala <- implyr::src_impala(
      drv = drv,
      dsn = "Impala",
      database = "my_db"
    )
my_impala_tbl <- tbl(impala, in_schema("my_db", "mytable"))
test_df = my_impala_tbl %>% select(colA, colB) %>% filter(colA == "a")

Are there ways that can convert the returned test_df to a spark df without having to collect it to R first?

Thanks!

"Unable to locate SQLGetPrivateProfileString function" when connecting

I know this is a long shot and difficult to reproduce, but still I want to point it out, because it might be solved.

When connecting to Impala I get the following error message and no connection is made:

Error: nanodbc/nanodbc.cpp:950: HY000: [unixODBC][Simba][ODBC] (11560) Unable to locate SQLGetPrivateProfileString function.

Now strangely enough, when I connect to a different data source first (sqlite) in the R session I have no problem connecting to Impala afterwords. The following works:

library(implyr); library(odbc)

dbConnect(odbc::odbc(), dsn = "SQLite")

impala <- src_impala(
  odbc::odbc(),
  dsn = "Impala")

But it would not work if the second part does not run. I am in the dark why this is, maybe an environment variable is set correctly by connecting to sqlite first. Lately a colleague had the exact same error on a fresh system, and the same fixed work.

Again, I know it is long shot but maybe @ianmcook or @jimhester can think of a direction of resolving this.

I am on macOS 10.12.6.

My /etc/odbcinst.ini file looks like

[SQLite Driver]
Driver = /usr/local/lib/libsqlite3odbc.dylib

[Cloudera ODBC Driver for Impala]
Driver = /opt/cloudera/impalaodbc/lib/universal/libclouderaimpalaodbc.dylib

and my .Renviron file looks like

# ODBC
DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:/usr/local/lib
ODBCINI=/etc/odbc.ini
ODBCSYSINI=/etc

Let me know if you need any more information.

Cannot get unicode from impala

library(odbc)
library(dplyr)
library(implyr)
drv <- odbc::odbc()
impala <- src_impala(
drv = drv,
driver = "Cloudera ODBC Driver for Impala",
host = "impala-proxy",
port = 21050,
database = "default",
uid = "username",
pwd = "password"
)
data <- dbGetQuery(impala, "SELECT * FROM ghtk.package_logs limit 10")

Output: Th<ea>m coupon E516188e for unicode string Thêm coupon E516188e

Unable to handle database name with `tbl()` function

It would be nice to handle database name with tbl() function. In current version of implyr, it returns an error as follows:

> airports <- tbl(impala, "u_ariga.airports_pq")
Error in new_result(connection@ptr, statement) : 
  nanodbc.cpp:1344: HY000: [Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] : AnalysisException: Could not resolve table reference: 'u_ariga.airports_pq'

Unable to find an inherited method

Hi,

I'm trying to use implyr with Kerberos auth, connection itself works fine,

src_tbls(impala)

is returning list of tables, but when I'm trying to use tbl function to get any table I get:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbSendQuery’ for signature ‘"impala_connection", "sql"’

Try connecting to Impala with thriftr package

There's a fairly new R package thriftr (GitHub, CRAN) from @marekjagielski that implements Apache Thrift in R. See if it would be possible (and if so, what additional work would be required) to use this to connect to Impala (instead of ODBC or JDBC).

Add automated tests

Add automated tests that will run on Travis using an EC2 instance running Impala.

Test and document limitations for distinct() or unique()

A useR!2017 attendee asked after the implyr talk about what Impala's limitations for using DISTINCT were. Older versions of Impala (before version 2.0) allowed only one DISTINCT clause per query. Newer versions have removed this limitation. But the current version of Impala does have the limitation that you cannot use DISTINCT in more than one aggregation function in the same query. Write tests to check these behaviors in implyr, and document the practical implications of these limitations for implyr users who are using the distinct() verb or the unique() function.

Connections pane supported?

I have been able to successfully connect to Impala with implyr using the odbc-package. So far the connections pane isn't working for me (but src_tbls(impala) did).

Is the connctions-pane-functionality supported through the implyr-package?

Write to external table in non-default location?

I'm able to use DBI::dbWriteTable() to create external tables in Impala, but they are stored in the default database location. Is there an option to specify the location parameter of the external table?

ODBC ERROR: while contacting server - SSL connection

Good Morning,

i'm trying to connect to Impala, but i received:
Error: nanodbc/nanodbc.cpp:950: HY000: [Cloudera][ThriftExtension] (5) Error occurred while contacting server: No more data to read.. This could be because you are trying to establish a non-SSL connection to a SSL-enabled server.

My connection code is:
impala <- src_impala( drv = drv, driver = "Cloudera ODBC Driver for Impala", host = "server.ita.it", port = 21050, database = "dwh", uid = userId, pwd = password )
i thought the problem was to check SSL connection, but my cloudera ODBC have check "ENABLE SSL" on setting and if i connect to IMPALA with DBI library i haven't . DBI connection code:
conHD <- DBI::dbConnect(odbc::odbc(), SERVER_NAME_HDP, # CLODERA ODBC NAME uid = userIdHdp, pwd = passwordHdp)

where am I wrong?

R session info:

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Italian_Italy.1252 LC_CTYPE=Italian_Italy.1252 LC_MONETARY=Italian_Italy.1252 LC_NUMERIC=C LC_TIME=Italian_Italy.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] odbc_1.1.6 implyr_0.3.0 dplyr_0.8.3 DBI_1.0.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 zeallot_0.1.0 dbplyr_1.4.2 crayon_1.3.4 assertthat_0.2.1 R6_2.4.0 backports_1.1.5 magrittr_1.5 pillar_1.4.2
[10] rlang_0.4.0 rstudioapi_0.10 blob_1.2.0 vctrs_0.2.0 tools_3.6.1 bit64_0.9-7 glue_1.3.1 bit_1.1-14 purrr_0.3.2
[19] hms_0.5.1 compiler_3.6.1 pkgconfig_2.0.3 tidyselect_0.2.5 tibble_2.1.3
with:
Cloudera ODBC Driver for Impala version 2.5.59

thanks in advance
have nice day
MC

Change translation of paste() for consistency with dplyr

Modify the translation of paste() for consistency with dplyr as described in tidyverse/dplyr#3168

dbplyr 2.0.0 breaks implyr

Unfortunately I cannot paste any output as I have already downgraded to 1.4.4, but the problem was that "db.tbl" was not being properly escaped when using tbl(impala, "db.tbl") or tbl(impala, in_schema("db", "tbl")).

The only way I was able to query data was with dbGetQuery(impala, "select * from db.tbl") where the query string was probably not being manipulated in any way.

I am quite inexperienced with R, but maybe a quick fix would be to cap the dbplyr version dependency on 1.4.4?

Would like to connect to specific db using src_impala

It would be nice to be able to connect to a specific database using src_impala instead of needing to use dbExecute(db, "use new_db").

connection success but query failed

hi ,I used implyr and RJDBC to connect R and impala succed, but when I use dbGetQuery to select some data from database, I got en error;

the sql was below:

testdata <- dbGetQuery(
impala,
"SELECT * FROM jolly.who_orderinfo limit 10"
)

and the error was below:

Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for SELECT * FROM jolly.who_orderinfo limit 10 ([Simba]ImpalaJDBCDriver ERROR processing query/statement. Error Code: 0, SQL state: TStatus(statusCode:ERROR_STATUS, sqlState:HY000, errorMessage:AuthorizationException: User '' does not have privileges to execute 'SELECT' on: jolly.who_orderinfo
), Query: SELECT * FROM jolly.who_orderinfo limit 10.)

can anyone tell me why and how can I solve this problems, thanks!

Cannot connect to database SASL error

Whenever I try to connect to the impala database, I keep on getting this error:

Error: nanodbc/nanodbc.cpp:950: HY000: [Cloudera][ThriftExtension] (4) Error occurred while contacting server: EAGAIN (timed out). The connection has been configured to not use SASL for authentication. This error might be due to the server has been configured to use SASL for authentication.

I'm not sure if this is because often we have to connect through SSH but so far I've tried a lot of things unsuccessfully. Any idea of what might be happening here?

is there character length limit? : nchar(x) < 32700

I downloaded area json text from impala by implyr package & get broken json text when json text length is more than 32700.
Is there character length limit?

p.s.
data download result with python impyla is normal.

dplyr verbs fail on functions with dots in names when no column name is assigned

The following code fails

tbl(impala, "flights") %>% transmute(as.character(year))

with the error

Error in new_result(connection@ptr, statement) : 
  nanodbc/nanodbc.cpp:1344: HY000: [Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] : AnalysisException: Syntax error in line 1:
SELECT cast(`year` as string) AS `as`.`character(year)`
                                     ^
Encountered: .
Expected: CROSS, FROM, FULL, GROUP, HAVING, INNER, JOIN, LEFT, LIMIT, OFFSET, ON, ORDER, RIGHT, STRAIGHT_JOIN, UNION, USING, WHERE, COMMA

CAUSED BY: Exception: Syntax error

This seems to be an implyr-specific problem, because the following works without error:

dbplyr::memdb_frame(year = c(2013L)) %>% transmute(as.character(year))

Resolve this by not adding the backticks around the dot in this case.

Until it's fixed, the workaround is to specify a name for the column:

tbl(impala, "flights") %>% transmute(year_string = as.character(year))

java.sql.SQLNonTransientConnectionException: [Cloudera][JDBC](10060) Connection has been closed.

We are using implyr to connect to impala via JDBC. Usually this works fine, but in ca 25% of the attempt trying to establish the impala connection fails, we get at least three different errors that look a lot like each other:

for(i in 1:100){
  impala <- CreateImpalaConnection(SSLTrustStorePwd = SSLTrustStorePwd)
}

Error in .jcall(conn@jc, "V", "close") : 
  java.sql.SQLNonTransientConnectionException: [Cloudera][JDBC](10060) Connection has been closed.
Error in .jcall(conn@jc, "Ljava/sql/Statement;", "createStatement") : 
  java.sql.SQLNonTransientConnectionException: [Cloudera][JDBC](10060) Connection has been closed.
Error in .jcall(.rJava.class.loader, "[Ljava/lang/String;", "getClassPath") : 
  java.sql.SQLNonTransientConnectionException: [Cloudera][JDBC](10060) Connection has been closed.

I noticed via traceback() that the errors are in different locations:

The '"[Ljava/lang/String;", "getClassPath")' error is in the .jinit function:
6: stop(list(message = "java.sql.SQLNonTransientConnectionException: [Cloudera][JDBC](10060) Connection has been closed.", call = .jcall(.rJava.class.loader, "[Ljava/lang/String;", "getClassPath"), jobj = new("jobjRef", jobj = <pointer: 0xac0e2b8>, jclass = "java/sql/SQLNonTransientConnectionException"))) 5: .jcheck() 4: .jcall(.rJava.class.loader, "[Ljava/lang/String;", "getClassPath") 3: .jclassPath() 2: .jinit(classpath = impala_classpath, force.init = TRUE)
The "Ljava/sql/Statement;", "createStatement" is within the dbConnect function (https://github.com/ianmcook/implyr/blob/master/R/src_impala.R#L126)
13: .getClassesFromCache(Class) 12: getClassDef(classi, where = where) 11: validObject(.Object) 10: initialize(value, ...) 9: initialize(value, ...) 8: new("jobjRef", jobj = r, jclass = substr(returnSig, 2, nchar(returnSig) - 1)) 7: new("jobjRef", jobj = r, jclass = substr(returnSig, 2, nchar(returnSig) - 1)) 6: .jcall("java/sql/DriverManager", "Ljava/sql/Connection;", "getConnection", as.character(url)[1], as.character(user)[1], as.character(password)[1], check = FALSE) 5: .local(drv, ...) 4: dbConnect(drv, ...) 3: dbConnect(drv, ...) 2: src_impala(drv = drv, ....)
The "conn@jc, "V", "close"" error is in the db_disconnector function: https://github.com/ianmcook/implyr/blob/master/R/src_impala.R#L506
and seems to occur less often when I set auto_disconnect to FALSE.

Are these errors known, and what can we do about it? We now wrapped the CreateImpalaConnection function in a try catch loop and try multiple times, but that is not the desired way to do it.

The function CreateImpalaConnection is defined as:

CreateImpalaConnection <- function(SSLTrustStorePwd){
  #Function to create an impala connection
  
  #Imapala settings
  impala_classpath <- "file_path_of_impala_driver"
  
  .jinit(classpath = impala_classpath, force.init = TRUE)
  drv <- JDBC(
    driverClass = "com.cloudera.impala.jdbc4.Driver",
    classPath = impala_classpath,
    identifier.quote = "`"
  )
  impala <- src_impala(
    drv = drv,
    paste0(
      "jdbc:impala://serveradress:21051;",
      "AuthMech=1;KrbRealm=ourdomain;KrbHostFQDN=serveradress;KrbServiceName=impala;",
   "SSL=1;SSLTrustStore=filepath_to_jssecacerts;SSLTrustStorePwd=",SSLTrustStorePwd, ";",
      "CAIssuedCertNamesMismatch=1;"
    ),
    auto_disconnect = FALSE
  )
  return(impala)
}

Implement access to complex columns (ARRAY, MAP, STRUCT)

As a beginner it is not immediatly clear to me, how to best use implyr to access Impala-complex types (especially maps, e.g. pull out a couple of columns and join them with the existing data-frame).

The link in the Readme is helpful (to create dbGetQuery()-requests), but a short example - possibly showing dplyr-logic - would be really cool as well. :)

Warning: "object '.Cimpala_connection' not found" when initializien impala connection

Hey Ian,

on a new server with a newly set up impala connection we currently run into the following warning, when initializing the impala connection

library(implyr)

impala <- implyr::src_impala(odbc::odbc(), "IMPALA_DSN")
#> Warning in rm(list = what, pos = classWhere) :
#>  object '.__C__impala_connection' not found

Do you have an idea, what could fix this? It doesn't happen the second time impala <- implyr::src_impala(odbc::odbc(), "IMPALA_DSN") is called within the same session.

Explain case sensitivity of Impala SQL function names

This works:

flights %>% 
  group_by(origin) %>% 
  summarise(NDV(dest))

but only if NDV() is uppercase. The README should explain why this is, because function names in Impala SQL are case-insensitive.

compute() table to parquet?

Using the compute() method on an implyr table currently stores it as TEXT, creating big files which usually end up scattered across several worker nodes.

Is there an option to store them as Parquet instead, or at least compressed gz?

Thanks

copy_to dimension limit

Is there any reason why this limitation exists? I've rebuilt the package without it and was successfully able to write a 50000x5 table.

Drop db_sql_render method?

dbplyr now automatically drops ORDER BY from subqueries, so I don't think you need this any more, and I think you're the only user of the generic, which means I could (eventually) drop it from dbplyr.

Error when join key column name is the same in both tables

When using a join function to join two Impala tables on a key column that has the same name in both tables, one of two errors can occur:

If the join result is collected immediately, then dplyr gives the error "Each variable must have a unique name."
If the join result is not collected immediately, then Impala gives the error "Duplicated inline view column alias".

This is due to a behavior of Impala and its ODBC and JDBC drivers in which duplicate column names are not disambiguated. See IMPALA-421.

You can reproduce this error by attempting to join the flights and airlines tables from nycflights13:

flights_tbl <- tbl(impala, "flights")
airlines_tbl <- tbl(impala, "airlines")
inner_join(flights_tbl, airlines_tbl, by = "carrier")
inner_join(flights_tbl, airlines_tbl, by = "carrier") %>% collect()

A workaround is to rename the key column in one of the tables before joining:

airlines_tbl <- airlines_tbl %>% rename(airlines.carrier = carrier)
inner_join(flights_tbl, airlines_tbl, by = c("carrier" = "airlines.carrier"))

See if there is a way for implyr to work around this behavior.

Some other SQL engines work around this problem by prepending tablename. to the name of every column in a join.

	if (prod(dim(df)) > 1e3L) {
	# TBD: consider whether to make this limit configurable, possibly using
	# options with the pkgconfig package
	stop(
	"Data frame ",
	name,
	" is too large. copy_to currently only supports very small data frames.",
	call. = FALSE
	)
	}