databricks / databricks-sql-python Goto Github PK

View Code? Open in Web Editor NEW

126.0 22.0 73.0 1.58 MB

Databricks SQL Connector for Python

License: Apache License 2.0

Python 100.00%

databricks dwh python3 sql

databricks-sql-python's People

Contributors

Stargazers

Watchers

databricks-sql-python's Issues

Request for Async Support and Dependency Update

Hi all.

In my work we're using the databricks-sql-connector with sqlalchemy-databricks but it currently lacks support for async operations, which is causing performance issues for us. It would be appreciated if you could consider adding async support in a future release.

We have noticed that one dependency of the sqlalchemy (https://github.com/crflynn/sqlalchemy-databricks) dialect it is pyhive. The problem here is that pyhive is not being maintained anymore by dropbox. This raises security concerns. There are any additional considerations about changing that?

I would appreciate if you could let me know if there is anything I can do to help with these requests.

https://github.com/dropbox/PyHive/issues

single quote not properly escaped

What

ParamEscaper's escape_string() gives incorrect behavior on Databricks SQL and in Databricks notebooks.

It replaces a single quote ' with '', but the correct way to escape ' is with a backslash, like \'.

You can verify in PySpark with:

assert spark.sql("select 'cat''s meow' as my_col").head(1)[0]['my_col'] == "cats meow"
assert spark.sql("select 'cat\\'s meow' as my_col").head(1)[0]['my_col'] == "cat's meow"

Note that because it starts as a Python literal, we need two backslashes \\ to get Python to first escape \\' to \' and then Spark escapes to '.

I don't know what the motivation for this implementation was, but the result seems to be concatenation instead of escaping the quote character.

Reproduction in databricks-sql-python

The following demonstrates the issue in version ~~1.2.2~~ 2.0.5 of databricks-sql-python against a serverless SQL warehouse in Azure, v2022.30, plus an implementation without parameter substitution showing an escape treatment that does work :

from typing import List

from databricks import sql
import os

from databricks.sql import ServerOperationError

server_hostname= os.environ.get('DBT_DATABRICKS_HOST')
http_path=f'/sql/1.0/endpoints/{os.environ.get("DBT_DATABRICKS_ENDPOINT")}'
access_token=os.environ.get('DBT_DATABRICKS_TOKEN')
user_agent_entry = "dbt-databricks/1.2.2"
connection = sql.connect(
  server_hostname=server_hostname,
  http_path=http_path,
  access_token=access_token,
  _user_agent_entry=user_agent_entry
)


cursor = connection.cursor()


def get_result_using_parameter_bindings(p:List[str]):
  try:
    cursor.execute('select %s as my_col', p)
    result = list(cursor.fetchall())[0]['my_col']
  except ServerOperationError as exc:
    result = exc.message.strip()[:20] + '...'
  return result

def get_result_using_fstring(p:List[str]):
  try:
    escaped = p[0].replace('\\','\\\\' ).replace("'", "\\'")
    cursor.execute(f"select '{escaped}' as my_col")
    result = list(cursor.fetchall())[0]['my_col']
  except ServerOperationError as exc:
    result = exc.message.strip()[:20] + '...'
  return result


params = [
  ["cat's meow"],
  ["cat\'s meow"],
  ["cat\\'s meow"],
  ["cat''s meow"],
]



for p in params:
  # using dbt-databricks-sql's parameter substitution
  param_binding_result = get_result_using_parameter_bindings(p)

  # using manually built and escaped query
  f_string_result = get_result_using_fstring(p)



  print('\nparameter value:', p[0], 'parameter-binding result:', param_binding_result, 'round-trip ok?', p[0]==param_binding_result)
  print('parameter value:', p[0], 'f-string result:', f_string_result, 'round-trip ok?',
        p[0] == f_string_result
        )
  assert p[0] == f_string_result


cursor.close()
connection.close()

The output is:

bash_1  | parameter value: cat's meow parameter-binding result: cats meow round-trip ok? False
bash_1  | parameter value: cat's meow f-string result: cat's meow round-trip ok? True
bash_1  | 
bash_1  | parameter value: cat's meow parameter-binding result: cats meow round-trip ok? False
bash_1  | parameter value: cat's meow f-string result: cat's meow round-trip ok? True
bash_1  | 
bash_1  | parameter value: cat\'s meow parameter-binding result: [PARSE_SYNTAX_ERROR]... round-trip ok? False
bash_1  | parameter value: cat\'s meow f-string result: cat\'s meow round-trip ok? True
bash_1  | 
bash_1  | parameter value: cat''s meow parameter-binding result: cats meow round-trip ok? False
bash_1  | parameter value: cat''s meow f-string result: cat''s meow round-trip ok? True

Expected results

String parameters with single quotes and backslashes should be properly reproduced:
"cat's meow" would be escaped as "cat\\'s meow" and the resulting SQL would return cat's meow
"cat\\'s meow" would escape to "cat\\\\\\'s meow" and the SQL would return cat\'s meow

Suggested fix

I'm not sure how this is usually implemented, but in my example just doing param.replace('\\','\\\\' ).replace("'", "\\'") at least preserves single quotes and backslashes, which are probably the most common cases. It would also leave alone escaped unicode literals like \U0001F44D.

How I encountered it

I'm using dbt with Databricks and noticed on upgrading from dbt-databricks 1.0 to 1.2.2 that single quotes started disappearing from our "seeds" (csv files loaded as Delta tables). Code had changed in dbt-databricks to use the new parameter binding functionality in this library, whereas (I assume) before it must have been injecting the values as literals into the SQL.

SQLAlchemy Dialect Compliance Tests

I did a lot of work on the BigQuery SQLAlchemy dialect on 2021.

(I needed a BigQuery connection to develop/test that. 😄 )

I leaned heavily on the SQLAlchemy Dialect Compliance test suite.

It doesn't look like y'all are using the compliance suite. It might be a good way to provide a lot more testing of your dialect.

FWIW.

No COMMENT property when querying tables

When querying tables, I do not have the COMMENT property.

tables = cursor.tables().fetchall()

I see REMARKS. It might not be exactly the same as COMMENT, but it's the closest thing I found so far. But the value I get seems to always be 'UNKNOWN'.

Here is how I added a comment to a table:

I want to know how to get the comment of a table using this Python client.

Using SQL isn't convenient because I need to call DESCRIBE TABLE table for each table and I am dealing with a very large number of tables.

Thank you for your help.

Add support for Python 3.10

PySQL depends on Numpy which only added support for Python 3.10 on 22 June 22 in version 1.23.0. But we’re pinned to 1.21.1 . De facto, we don’t support Py3.10.

Error in accessing the databricks data

I am trying to connect with the data bricks using a custom script that accesses the data in the data bricks cluster and creates a pickle file in the local system. I am getting the following error-

Traceback (most recent call last):
 File "Ekata_script.py", line 34, in <module>
 key, leads, lead_ids[0:], conn_odbc, "out.pkl", data
  File "/home/ubuntu/ekata/playground2/Ekata_functions.py", line 241, in Ekata_matcher_direct_address_ygl_file_odbc
   property_df, lead_df = match_all_ekata_to_properties_odbc(pull, lead_df, open_lead, conn)
  File "/home/ubuntu/ekata/playground2/Ekata_functions.py", line 408, in match_all_ekata_to_properties_odbc
  ''', con = conn)
   File "/usr/lib/python3/dist-packages/pandas/io/sql.py", line 400, in read_sql
   chunksize=chunksize)
  File "/usr/lib/python3/dist-packages/pandas/io/sql.py", line 1444, in read_query
   cursor = self.execute(*args)
  File "/usr/lib/python3/dist-packages/pandas/io/sql.py", line 1417, in execute
   raise_with_traceback(ex)
  File "/usr/lib/python3/dist-packages/pandas/compat/__init__.py", line 385, in raise_with_traceback
   raise exc.with_traceback(traceback)
  File "/usr/lib/python3/dist-packages/pandas/io/sql.py", line 1413, in execute
   self.con.rollback()
   File "/home/ubuntu/.local/lib/python3.6/site-packages/databricks/sql/client.py", line 320, in rollback
   raise NotSupportedError("Databricks does not have transactions")
  pandas.io.sql.DatabaseError: Execution failed on sql: 
            SELECT dp.property_id
            ,property_name
            ,lpa.activity_date
            ,dp.address
            ,dg.city
            ,dg.state
            ,dg.zip
            FROM table1 AS dp 
            JOIN table2 AS dg ON dp.dim_geography_key = dg.dim_geography_key
            JOIN table3 AS lpa ON lpa.property_id = dp.property_id
            AND lpa.lead_id = lead_id
            AND lpa.activity_type_code = 'RA'
            WHERE dp.latitude BETWEEN min_lat AND max_lat
            AND dp.longitude BETWEEN min_long AND max_long
            AND dp.date_end = '20991231'
            AND dg.date_end = '20991231'
            
   Query execution failed. State: ERROR_STATE; Error code: 0; SQLSTATE: 56038; Error message: 
   org.apache.hive.service.cli.HiveSQLException: Error running query: [UC_NOT_ENABLED] 
  org.apache.spark.sql.AnalysisException: [UC_NOT_ENABLED] Unity Catalog is not enabled on this cluster.
at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:48)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$execute$1(SparkExecuteStatementOperation.scala:609)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

SQLAlchemy dialect `has_table` doesn't use the schema name.

has_table gets the schema namebut doesn't pass it to DESCRIBE.

This causes SQLAlchemy to fail to drop a given table (in a given schema).

Error Creating Tables - Is Databricks Unity Catalog?

I'm trying to turn the SqlAlchemy example from this repo on a new instance of Databricks.

I'm getting the following error:

databricks.sql.exc.ServerOperationError: [UC_COMMAND_NOT_SUPPORTED] Create sample tables/views is not supported in Unity Catalog.

For context, we're not currently using the Unity Catalog. Is that required to use the latest version of the Databricks SQL Connector?

Add example with conversion to pandas DataFrame

The example in the README.md and the documentation all print each row individually.

In practice many users will try to convert the data to a pandas dataframe. Why not include that example in the getting started. We currently use:

import pandas as pd

cursor.execute(query) 
result = cursor.fetchall() 
df = pd.DataFrame(result, columns=[x[0] for x in cursor.description])

GetOperationStatus fails without knowing the reason

I have a long running query (Avg of 30 minutes) that sometimes fails (random failures) and the only traceback I get is this:

2022-10-24 12:36:21,674 - [INFO] - databricks.sql.thrift_backend - MainThread - Error during request to server: {"method": "GetOperationStatus", "session-id": null, "query-id": "b'i\\xfbv\\xa6\\xed+J\\x13\\xbc\\x90\\x05\\xafR\\x97\\xbf\\xb4'", "http-code": 503, "error-message": "", "original-exception": "", "no-retry-reason": "non-retryable error", "bounded-retry-delay": null, "attempt": "1/30", "elapsed-seconds": "9.045876264572144/900.0"}

Version: databricks-sql-connector==2.0.4
Platform: 20.04.1-Ubuntu

test issue filing

`.columns()` crashes with DatabaseError

This is the error I see when calling Cursor.columns():

databricks.sql.exc.DatabaseError: Error operating GET_COLUMNS Unexpected character ('t' (code 116)): was expecting comma to separate Object entries

Environment looks as follows:

databricks-sql-connector==2.2.2
thrift==0.16.0
pyarrow==10.0.1

I looked at the debug data and I see this response:

Received response: TGetColumnsResp(status=TStatus(statusCode=3, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error operating GET_COLUMNS Unexpected character (\'t\' (code 116)): was expecting comma to separate Object entries\n...

The stack trace looks like this:

 org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$:hiveOperatingError:HiveThriftServerErrors.scala:66
 org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$:hiveOperatingError:HiveThriftServerErrors.scala:60
 org.apache.spark.sql.hive.thriftserver.SparkAsyncOperation$$anonfun$onError$1:applyOrElse:SparkAsyncOperation.scala:196
 org.apache.spark.sql.hive.thriftserver.SparkAsyncOperation$$anonfun$onError$1:applyOrElse:SparkAsyncOperation.scala:181
 scala.runtime.AbstractPartialFunction:apply:AbstractPartialFunction.scala:38
 org.apache.spark.sql.hive.thriftserver.SparkAsyncOperation:$anonfun$wrappedExecute$1:SparkAsyncOperation.scala:169
 scala.runtime.java8.JFunction0$mcV$sp:apply:JFunction0$mcV$sp.java:23
 com.databricks.unity.EmptyHandle$:runWith:UCSHandle.scala:103
 org.apache.spark.sql.hive.thriftserver.SparkAsyncOperation:org$apache$spark$sql$hive$thriftserver$SparkAsyncOperation$$wrappedExecute:SparkAsyncOperation.scala:144
 org.apache.spark.sql.hive.thriftserver.SparkAsyncOperation:runInternal:SparkAsyncOperation.scala:79
 org.apache.spark.sql.hive.thriftserver.SparkAsyncOperation:runInternal$:SparkAsyncOperation.scala:44
 org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation:runInternal:SparkGetColumnsOperation.scala:54
 org.apache.hive.service.cli.operation.Operation:run:Operation.java:383
 org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation:org$apache$spark$sql$hive$thriftserver$SparkOperation$$super$run:SparkGetColumnsOperation.scala:54
 org.apache.spark.sql.hive.thriftserver.SparkOperation:run:SparkOperation.scala:113
 org.apache.spark.sql.hive.thriftserver.SparkOperation:run$:SparkOperation.scala:111
 org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation:run:SparkGetColumnsOperation.scala:54

For completeness, this is how the request looks like:

TGetColumnsReq(
    sessionHandle=TSessionHandle(
        sessionId=THandleIdentifier(...), serverProtocolVersion=None
    ),
    catalogName=None,
    schemaName=None,
    tableName=None,
    columnName=None,
    getDirectResults=TSparkGetDirectResults(maxRows=100000, maxBytes=10485760),
    runAsync=False,
    operationId=None,
    sessionConf=None,
)

Delta has TINYINT now which needs to be added in?

I got:

CompileError: (in table 'dimaccount', column 'taxstatus'): Compiler  can't render element of type TINYINT

When working with a sqlalchemy engine with the databricks dialect enabled connected to a SQL Warehouse endpoint.

See: https://docs.databricks.com/sql/language-manual/sql-ref-datatypes.html#supported-data-types

SQLAlchemy 2.0 support?

When I try to install it with SQLAlchemy 2.0 installed, it throws the error of: ERROR: Cannot install databricks-sql-connector==2.4.0 and sqlalchemy==2.0.4 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested sqlalchemy==2.0.4
databricks-sql-connector 2.4.0 depends on sqlalchemy<2.0.0 and >=1.4.44

Is there a timeline to support SQLAlchemy 2.0?

Autoregistering extensions for Spark SQL?

Our use case is the registration of the Apache Sedona SQL extension to be able to execute geospatial queries:

--conf spark.sql.extensions=org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions \
--conf spark.kryo.registrator=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator

Is there any option to set these configurations with Databricks SQL Connector for Python?

Support for Python 3.11

I know it is eaarly (Python 3.11 has just been released yesterday) but we are hoping in Apache Airflow to a much faster cycle of adding new Python releases - especially that Pyhon 3.11 introduces huge performance improvements (25% is the average number claimed) due to a very focused effort to increase single-threaded Python performance (Specialized interpreter being the core of it but also many other improvements) without actually changing any of the Python code.

The databricks-sql-python is one of the dependencies of Airlfow that need to support newer Python version so this issue is here to make you aware that we are eaager (also to help if needed) to make it faster, possibly by talking to dependencies of Beam as well (who are oftten also direct dependencies of Airflow) and help them as well. I perfectly understand we need to bubble up support so that your dependencies support them first.

Nice summary of Py3.11 support is here: https://pyreadiness.org/3.11/ - it's not very green obviously, but I hope it gets greener soon.

I'd appreciate if someone in the Databricks team attempted to migrate and have a PR running (and failing possibly until all prerequisite are met so that we can also track the progress and possibly help each other to solve any problems. I know PyArrow is blocking for now, but this is going to be solved soon lilely with apache/arrow#14499 and just opening a failing PR with it might be a good start.

I just opened such PR in Apache Airflow yesterday and plan to keep it open until it gets green :)

apache/airflow#27264

I think it would be fantastic if we could as the open source community migrate to the new Python much faster.

Looking forward to cooperation on that one :)

Support saved query in SQL datawarehouse ?

similar to a stored procedure in sql server,

there may be many statements in one saved query (with parameters) in SQL data warehouse.

is this type of call supported?

how to store SQL query result data to local disk?

I am a newbie to data bricks and trying to write results into the excel/ CSV file using the below command but getting DataFrame' object has no attribute 'to_csv' errors while executing.

I am using a notebook to execute my SQL queries and now want to store results in the CSV or excel file

%python

df =spark.sql ("""select * from customer""")
and now I want to store the query results in the excel/csv file. I have tried the below code but it's not working

df.coalesce(1).write.option("header","true").option("sep",",").mode("overwrite").csv("file:///C:/New folder/mycsv.csv")
AND

df.write.option("header", "true").csv("file:///C:/New folder/mycsv.csv")

Access to Spark UI / Ganglia from connection

It would be nice to be able to get a URL to the Spark UI / Ganglia metrics from the connection object. It would be useful to log these from running jobs for debugging performance issues.

cluster_id = "XXXX-XXXXXX-testtest" # can we get this somehow?
driver_id = "12345678901234567890"  # and this?
spark_ui = f"https://adb-dp-{databricks_workspace}.6.azuredatabricks.net/sparkui/{cluster_id}/driver-{driver_id}/"
ganglia = f"https://adb-dp-{databricks_workspace}.6.azuredatabricks.net/driver-proxy/o/{databricks_workspace}/{cluster_id}/80/?c=cluster"

BINARY column type support

I have a question regarding accessing delta lake tables by python. Similar to this SQLAlchemy example, I tried to save something in a BINARY column like:

    example_bytes = Column(BINARY)  # also tried Binary, LargeBinary, ...

but that crashes with:

sqlalchemy.exc.StatementError: (builtins.AttributeError) module 'databricks.sql' has no attribute 'Binary'

I'm wondering if databricks-python-sql doesn't support binary column yet? Other basic types work fine. Also executing raw sql by the connector works fine. (I'm using hive metastore if this is relevant.)

Support transactions (and pandas.to_sql / read_sql_table)

It's a great project, and truly helps me in a variety of ways. However, I am having a few issues when utilizing it with Pandas.

When reading a table I would like to do:

    def read(
        self, catalog_name: str, schema_name: str, table_name: str
    ) -> DataFrame:
        with self._get_connection(catalog_name) as connection:
            iterator = pd.read_sql_table(
                schema=schema_name,
                table_name=table_name,
                con=connection,
            )

This should return a pandas dataframe.

However, I am required to do the following lacking features from your otherwise great connection.

def read(
        self, catalog_name: str, schema_name: str, table_name: str
    ) -> pd.DataFrame:
        with self._get_connection(catalog_name) as connection:
            query = f"SELECT * FROM {catalog_name}.{schema_name}.{table_name}"
            self.logger.debug("Running query '%s'", query)
            df = pd.read_sql(query, connection)
            self.logger.debug(df)
            return df

Similarly I cannot use the to_sql on a dataframe where i'd like to do something like the following:

    def write(
        self,
        dataframe: pd.DataFrame,
        catalog_name: str,
        schema_name: str,
        table_name: str,
    ) -> pd.DataFrame:
        with self._get_connection(catalog_name) as connection:
            dataframe.to_sql(name=table_name, con=connection, schema=schema_name)

Both fail

databricks.sql.exc.NotSupportedError: Transactions are not supported on Databricks

Versions:

python 3.10.6
databricks-sql-connector==2.2.1
pandas==1.5.2

SSL errors when pysql used across many threads

Databricks employee here 👋 We've had a few support cases in the last couple months where pysql falls with unpredictable SSL errors when many connections are opened across multiple threads (this is easily reproduced using dbt with a 10k model project running 20+ threads). The most common error message we see is Error during request to server: EOF occurred in violation of protocol (_ssl.c:992).

But we've also seen unusual 200 http response codes that nevertheless have an error "Connection Refused"

We've also seen [Errno 8] nodename nor servname provided, or not known

The fix for this is likely to implement HTTPS connection reuse by the driver. As we are a thrift-based client and thrift is unsophisticated, the driver makes many roundtrips for each RPC. We're inclined to implement this by subclassing Thrift's HTTPTransport and hooking in requests which performs connection pooling and reuse "for free".

Error during request to server: [SSL: CERTIFICATE_VERIFY_FAILED]

Getting this error when trying to use this library to connect to Databricks cluster:

Error during request to server: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1091)

How can I disable SSL ? Similar to verify=False in requests library.

Strict numpy dependency

Hi!

I'm running into an issue that this package requires a very specific numpy package 1.23.4. Aside from wondering why it needs numpy in the first place (I didn't dive into that), I'm wondering if this needs to be this specific.

Context:
We are using NixPkgs to provide reproducible environments. I'm trying to compile the latest dbt-databricks version and this is a dependency. From the latest stable NixPkgs I get numpy version 1.23.3. I can override it but building this specific version of numpy myself is quite heavy. And it makes me wonder if it can be solved in a different way by relaxing the constraint :-)

Thanks for thinking along!

Pandas 2.0 support

Any idea when Pandas 2.0 will be supported? I just tried to install it in my virtual environment and there was a conflict with the databricks-sql-python package?

Issues with using http proxy

Hi there,

i am not able to use the sql.connect() with a http proxy:

os.environ['http_proxy'] = f"http://{user}:{pwd}@10.185.190.100:8080"
os.environ['https_proxy'] = f"http://{user}:{pwd}@10.185.190.100:8080"
self.connection = sql.connect(server_hostname, http_path, access_token)

I am getting the following error:

File "C:\Users\User\Documents\GitHub\eve_data_ingestion\3.8.10 64 bit\lib\site-packages\databricks\sql\__init__.py", line 48, in connect
    return Connection(server_hostname, http_path, access_token, **kwargs)
  File "C:\Users\User\Documents\GitHub\eve_data_ingestion\3.8.10 64 bit\lib\site-packages\databricks\sql\client.py", line 109, in __init__
    self.thrift_backend = ThriftBackend(self.host, self.port, http_path,
  File "C:\Users\User\Documents\GitHub\eve_data_ingestion\3.8.10 64 bit\lib\site-packages\databricks\sql\thrift_backend.py", line 115, in __init__
    self._transport = thrift.transport.THttpClient.THttpClient(
  File "C:\Users\User\Documents\GitHub\eve_data_ingestion\3.8.10 64 bit\lib\site-packages\thrift\transport\THttpClient.py", line 86, in __init__
    self.proxy_auth = self.basic_proxy_auth_header(parsed)
  File "C:\Users\User\Documents\GitHub\eve_data_ingestion\3.8.10 64 bit\lib\site-packages\thrift\transport\THttpClient.py", line 101, in basic_proxy_auth_header
    cr = base64.b64encode(ap).strip()
  File "C:\Users\User\AppData\Local\Programs\Python\Python38\lib\base64.py", line 58, in b64encode
    encoded = binascii.b2a_base64(s, newline=False)

TypeError: a bytes-like object is required, not 'str'

Is this a known issue? Is there another way to use proxies?

Python 3.8.10 (64bit Windows)
databricks-sql-connector 2.0.2

db.sql.Connection() catalog and schema args surprisingly do not apply to metadata calls like cursor.tables

Observation // Concern
catalog and schema initial values can be set when creating Connection. These values are not applied in subsequent cursor metadata calls: catalogs(), schemas(), and tables(). These functions are unaware of that previous setting.

Current documentation states:

:param catalog: An optional initial catalog to use. Requires DBR version 9.0+
:param schema: An optional initial schema to use. Requires DBR version 9.0+

Where both the scope and duration of application is unstated. It currently only applies to executed statement when opening the session. Metadata calls are entirely ignorant of this setting, which surprised at least one user, who expected the catalog setting to carry through in those calls.

Ask
Define and document scope of these parameters. Consider whether they should be applied, possibly as default values, to metadata calls.

Library seems to ignore Schema/Database parameter

Setting the schema parameter does not have the desired effect. I have to set manually after the session is created.
This is a tad bit confusing since Databricks sort of merges the two terms (schema == database in the context of Databricks)

from databricks import sql
connection = sql.connect(server_hostname="myorg.cloud.databricks.com",
                                            http_path=/sql/protocolv1/o/<redacted>, 
                                            access_token=<redacted>, 
                                            schema="widget_schema")
cursor = connection.cursor()
cursor.execute('select current_database();')
print(cursor.fetchall())
## Prints "default" we would expect "widget_schema"

cursor.close()
connection.close()

I observed this behavior while trying to use the sqlalchemy-databricks package - it relies primarily on this package and on setting this parameter via the db url or a connect kwarg (neither work). Implementations such as Flask+SQLAlchemy do not give the ability to set the working schema in any other way. So I came here and was able to replicate the behavior purely with this package - which leads me to believe that the parameter is not working in this package.

I also saw a stipulation that it requires DBR 9.0+ - I tried this on a 10.x cluster as well as a SQL warehouse - same behavior.

I was able to reproduce strictly with this library

databricks.sql.exc.RequestError: Error during request to server

Repro steps:

pipenv --python 3.8
pipenv shell
pip install databricks-sql-connector
export DATABRICKS_SERVER_HOSTNAME="https://<redacted>.cloud.databricks.com" && export DATABRICKS_HTTP_PATH="sql/protocolv1/o/<redacted>/<redacted>" && export DATABRICKS_TOKEN="dapi<redacted>"
Code (main.py):

from databricks import sql
import os

with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
                 http_path       = os.getenv("DATABRICKS_HTTP_PATH"),
                 access_token    = os.getenv("DATABRICKS_TOKEN")) as connection:

  with connection.cursor() as cursor:
    cursor.execute("SELECT * FROM default.diamonds LIMIT 2")
    result = cursor.fetchall()

    for row in result:
      print(row)

python main.py
Traceback:

Traceback (most recent call last):
  File "main.py", line 10, in <module>
    with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
  File "/Users/paul.cornell/.local/share/virtualenvs/paul.cornell-<redacted>/lib/python3.8/site-packages/databricks/sql/__init__.py", line 48, in connect
    return Connection(server_hostname, http_path, access_token, **kwargs)
  File "/Users/paul.cornell/.local/share/virtualenvs/paul.cornell-<redacted>/lib/python3.8/site-packages/databricks/sql/client.py", line 112, in __init__
    self._session_handle = self.thrift_backend.open_session(session_configuration, catalog,
  File "/Users/paul.cornell/.local/share/virtualenvs/paul.cornell-<redacted>/lib/python3.8/site-packages/databricks/sql/thrift_backend.py", line 341, in open_session
    response = self.make_request(self._client.OpenSession, open_session_req)
  File "/Users/paul.cornell/.local/share/virtualenvs/paul.cornell-<redacted>/lib/python3.8/site-packages/databricks/sql/thrift_backend.py", line 287, in make_request
    self._handle_request_error(error_info, attempt, elapsed)
  File "/Users/paul.cornell/.local/share/virtualenvs/paul.cornell-<redacted>/lib/python3.8/site-packages/databricks/sql/thrift_backend.py", line 199, in _handle_request_error
    raise network_request_error
databricks.sql.exc.RequestError: Error during request to server

Tried export CERT_PATH=$(python -m certifi) && export SSL_CERT_FILE=${CERT_PATH} && export REQUESTS_CA_BUNDLE=${CERT_PATH} and python main.py again, but same traceback.

Environment:

Python 3.8.10
pip list:

Package Version
certifi 2022.6.15
databricks-sql-connector 2.0.2
numpy 1.23.0
pandas 1.4.3
pip 21.0.1
pyarrow 8.0.0
python-dateutil 2.8.2
pytz 2022.1
setuptools 54.1.2
six 1.16.0
thrift 0.16.0
wheel 0.36.2

PyArrow version requirement

Hello! Is the version requirement on the pyarrow package strict?

pyarrow = "^9.0.0"

We would like to use databricks-sql-python in a project that depends on a lower version of pyarrow (8.0.0). We ran into conflicting version issues when using package management tools such as poetry. In our tests it seems databricks-sql-python can work with older versions of pyarrow such as 8.0.0. It would be great if databricks-sql-python can specify a wider range of pyarrow versions for better compatibility.

SQLAlchemy dialect native bool support

Should be

class DatabricksDialect(default.DefaultDialect):
    """This dialect implements only those methods required to pass our e2e tests"""

    supports_native_boolean: bool = True

as otherwise bools are converted to ints which results in datatype errors

sqlalchemy.exc.DatabaseError: (databricks.sql.exc.ServerOperationError) [TABLE_OR_VIEW_ALREADY_EXISTS]

Hello,

I am using alembic with the sql connector and I just started getting the above error. It seems alembic is applying the migration in databricks before actually reading the migration script

I am using connector 2.5.2 and alembic 1.10.4 (i also tried 1.11.1)

ENH: Add LOAD DATA LOCAL capability

Add a LOAD DATA LOCAL type capability to take files local to client, load to dbfs/volumes/bucket/storage account staging area, then COPY INTO a delta lake table, optionally remove data from staging area after COPY INTO.
The benefit is to support bulk data loads from any client.

sqlalchemy.inspect.has_table ignores schema.

It works if I pass the schema to create_engine and call has_table with no schema. I think this could be fixed by changing line 285 to reflect the schema here:

databricks-sql-python/src/databricks/sqlalchemy/dialect/__init__.py

Line 275 in 3eaaac9

def has_table(self, connection, table_name, schema=None, **kwargs) -> bool:

Example python script:

import os
import sqlalchemy as sa

host = os.getenv("DATABRICKS_SERVER_HOSTNAME")
http_path = os.getenv("DATABRICKS_HTTP_PATH")
access_token = os.getenv("DATABRICKS_TOKEN")
catalog = os.getenv("DATABRICKS_CATALOG")

engine = sa.create_engine(f"databricks://token:{access_token}@{host}?http_path={http_path}&catalog={catalog}&schema=temp")
with engine.connect() as conn:
    conn.exec_driver_sql("drop table if exists chuck_test;")
    conn.exec_driver_sql("create table chuck_test (a int);")
    assert sa.inspect(conn).has_table("chuck_test")

engine = sa.create_engine(f"databricks://token:{access_token}@{host}?http_path={http_path}&catalog={catalog}")
with engine.connect() as conn:
    assert sa.inspect(conn).has_table("chuck_test", schema="temp")

Output:

Traceback (most recent call last):
  File "C:\temp\test_databricks.py", line 17, in <module>
    assert sa.inspect(conn).has_table("chuck_test", schema="temp")
AssertionError

[Error] numpy: cannot import name 'Connection' from partially initialized module 'databricks.sql.client'

Issue

Attempting to follow the example from the Databricks SQL Connector for Python guide
produces a circular import error.

Traceback (most recent call last):
  File "/Users/apoclyps/workspace/src/select.py", line 4, in <module>
    with sql.connect(
  File "/Users/apoclyps/.pyenv/versions/3.10.3/lib/python3.10/site-packages/databricks/sql/__init__.py", line 47, in connect
    from .client import Connection
  File "/Users/apoclyps/.pyenv/versions/3.10.3/lib/python3.10/site-packages/databricks/sql/client.py", line 8, in <module>
    import pandas
  File "/Users/apoclyps/.pyenv/versions/3.10.3/lib/python3.10/site-packages/pandas/__init__.py", line 16, in <module>
    raise ImportError(
ImportError: Unable to import required dependencies:
numpy: cannot import name 'Connection' from partially initialized module 'databricks.sql.client' (most likely due to a circular import) (/Users/apoclyps/.pyenv/versions/3.10.3/lib/python3.10/site-packages/databricks/sql/client.py)

Steps to Reproduce

pip install databricks-sql-connector==2.0.2

export DATABRICKS_SERVER_HOSTNAME="https://<redacted>.cloud.databricks.com"
export DATABRICKS_HTTP_PATH="sql/protocolv1/o/<redacted>/<redacted>"
export DATABRICKS_TOKEN="dapi<redacted>"

python main.py

contents of `main.py`:

import os

from databricks import sql

with sql.connect(
    server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
    http_path=os.getenv("DATABRICKS_HTTP_PATH"),
    access_token=os.getenv("DATABRICKS_TOKEN"),
) as connection:

    with connection.cursor() as cursor:
        cursor.execute("SELECT 1")
        result = cursor.fetchall()

        for row in result:
            print(row)

Additional Information

I believe the issue is caused by the following * import causing a circular import between databricks.sql and databricks.sql.client in databricks/sql/client.py

TLS version handshake issue

I am facing with this connector. It works 100% of the time from my desktop but when i try from an Azure VM it only randomly succeeds. I did a Wireshark trace and can see when it fails, the client tries to do a TLSv1 handshake and the Databricks Service detects this and sends back a reset. Randomly it will try with v1.3 and then the connection works. I have tried setting the http_header to tls-version : TLSv1.3 but it doesn't seem to have any effect

in this screen shot you can see a success and failure side-by-side

[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1056)

When running the code at the bottom in a Conda environment I am having no problem connecting. However when switching to a pip virtualenv I am getting an SSL Certificate error. I need the code to run in a virtualenv in order to deploy AWS Lambda.

My guess is that pip deals with ssl certificates differently. Conda seems to have additional libraries for openssl and ca-certificates that Pip does not have. How can I resolve this issue?

File "/env/venv/lib/python3.7/site-packages/databricks/sql/thrift_backend.py", line 258, in _handle_request_error raise network_request_error databricks.sql.exc.RequestError: Error during request to server: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1056)

Pip Environment

pipenv on python 3.7.2
Environment variables are correctly set.

Pip Packages:

certifi==2022.12.7
charset-normalizer==2.1.1
databricks==0.2
databricks-sql-connector==2.2.2
idna==3.4
lz4==4.3.2
numpy==1.21.1
oauthlib==3.2.2
pandas==1.3.5
pyarrow==10.0.1
python-dateutil==2.8.2
pytz==2022.7
requests==2.28.1
six==1.16.0
thrift==0.16.0
urllib3==1.26.13

Conda packages

The code:

from databricks import sql
import os

with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
http_path = os.getenv("DATABRICKS_HTTP_PATH"),
access_token = os.getenv("DATABRICKS_TOKEN")) as connection:

with connection.cursor() as cursor:
cursor.execute("SELECT * FROM default.diamonds LIMIT 2")
result = cursor.fetchall()

Any tips would be greatly appreciated.

[Question] connection to specific database

Hi guys,

Thanks for this wonderful tool.

I was wondering if there is a way to specify the connection to a specific database. My question is because, in the example, the connection is made into the SQL warehouse and that connection will give me access to all databases, but what if I want to make a connection just with a database in a way if someone makes a query to a table that is not there will get some kind of error connection.

This is my code:

with sql.connect(
        server_hostname=settings.DATABRICKS_SERVER_HOSTNAME,
        http_path=settings.DATABRICKS_HTTP_PATH,
        access_token=settings.DATABRICKS_ACCESS_TOKEN,
    ) as conn:
        with conn.cursor() as cursor:
            cursor.execute(
                f"SELECT * FROM {settings.DATABRICKS_DATABASE_URL} LIMIT 5;"
            )
            
            for row in cursor.fetchall():
                print(row)

Thi is ok if the query string does not change, but if I want a complex query I will have to change the string in the cursor.execute and I will have access to any database

Question: Is it possible to use this with a local spark cluster, for local development and CI?

I'd like to be able to develop and run tests without having to connect to databricks.

sqlalchemy > 2.0.0

please support sqlalchemy 2

Duplicated tables and views

databricks-sql-python/src/databricks/sqlalchemy/dialect/__init__.py

Line 254 in 3eaaac9

_tables = [i[TABLE_NAME] for i in data]

lists tables just from "SHOW TABLES FROM {}" but it does not keep in consideration that atm Databricks lists views as well as tables with that. A possible correction could be just to add a filter in the iteration like [i[TABLE_NAME] for i in data if i[TABLE_NAME] not in get_view_names(self, connection, schema, **kwargs)]. I am not sure if the current behaviour is the correct one, but I find it annoying in the context of library usage on Apache Superset, in which table names are multiplicated by this issue, they are listed both as view and table.

Add supported python versions shield badge to README

[Written by @susodapop]

Shields.io has a badge for supported python versions. It should work for this repository, but doesn't:

https://img.shields.io/pypi/pyversions/databricks-sql-connector

This probably requires a change to our pyproject.toml. This could take some trial-and-error since Shields.io pulls only from live PyPi. I don't want to push a bunch of no-op versions to Pypi just to make this badge work.

OAuth breaks if a bad .netrc file is present

While fixing databricks/dbt-databricks#337 I found that the same bug plagueing dbt-databricks affects pysql as well.

To reproduce this issue:

Add an intentionally bad ~/.netrc to your workstation, like this:

machine <my-workspace>.cloud.databricks.com
login token
password <expired_token>

Try to run the interactive_oauth.py example using the same host name specified in the .netrc file.
You'll receive this exception:

    access_token = oauth_response["access_token"]
KeyError: 'access_token'

The fix

Straightforward: force requests to not use the .netrc file when making requests to Databricks OAuth endpoints. These requests are unauthenticated (no auth header is required). The bug here is that if you include an auth header in the request to https://****.staging.cloud.databricks.com/oidc/v1/token the Databricks runtime will return an error response. oauth.py looks for access_token in this response and doesn't find one, so it raises an Exception.

If .netrc is present, requests always uses it. Even for these requests that are supposed to be unauthenticated. So we need to force it to not do this.

Consider changing the version specification for `thrift`

Consider changing the version specification for thrift - instead of having it as

thrift>=0.13.0,<0.14.0

consider having

thrift>=0.13.0<1.0.0

This may or may not make sense as I am not entirely sure what kind of versioning apache thrift is using as it doesn't seem to follow semver.

Supporting Sqlalchemy 1.3.X

Would it be possible to relax the requirements to include version 1.3.24 ?

Only cache result used even source data changed

I found s strange issue that only cached result is returned when using python SQL. I am testing the same query in a time span of one hour and the same result returns from the cache. The source data is changed during this time.

Exact same query works fine and returns a new result in SQL data warehouse UI SQL editor (I actually copied the query sent from API from query history)

I end up run this one to clear cache manually: cursor.execute("SET use_cached_result = false")

Source thrift file

Not really an issue, more of question.

Where is the thrift file that was used to generate the underlaying client?

I cannot see a .thrift file in the repo and the output does not seem to match the hive thrift file from https://github.com/apache/hive/blob/master/service-rpc/if/TCLIService.thrift. Does databricks have a separate file with extended features?

databricks-sql-connector 2.2.0 stopped working on arm64 (Mac M1) without gcc installed

Hi,

I am trying to run the following inside a apache/airflow:2.4.1-python3.9 container:
pip install databricks-sql-connector==2.2.0

Which gives the following error message:

Requirement already satisfied: types-cryptography>=3.3.21 in /home/airflow/.local/lib/python3.9/site-packages (from pyjwt>=2.0.0->apache-airflow>=2.2.0->apache-airflow-providers-databricks==3.3.*->-r /requirements.txt (line 1)) (3.3.23)
Building wheels for collected packages: lz4, thrift
  Building wheel for lz4 (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Building wheel for lz4 (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [19 lines of output]
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-aarch64-cpython-39
      creating build/lib.linux-aarch64-cpython-39/lz4
      copying lz4/__init__.py -> build/lib.linux-aarch64-cpython-39/lz4
      copying lz4/version.py -> build/lib.linux-aarch64-cpython-39/lz4
      creating build/lib.linux-aarch64-cpython-39/lz4/frame
      copying lz4/frame/__init__.py -> build/lib.linux-aarch64-cpython-39/lz4/frame
      creating build/lib.linux-aarch64-cpython-39/lz4/block
      copying lz4/block/__init__.py -> build/lib.linux-aarch64-cpython-39/lz4/block
      running build_ext
      building 'lz4._version' extension
      creating build/temp.linux-aarch64-cpython-39
      creating build/temp.linux-aarch64-cpython-39/lz4
      creating build/temp.linux-aarch64-cpython-39/lz4libs
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Ilz4libs -I/usr/local/include/python3.9 -c lz4/_version.c -o build/temp.linux-aarch64-cpython-39/lz4/_version.o -O3 -Wall -Wundef
      error: command 'gcc' failed: Permission denied
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for lz4
  Building wheel for thrift (setup.py) ... done
  Created wheel for thrift: filename=thrift-0.16.0-py3-none-any.whl size=155972 sha256=6a40f8a73d28441134f8e6533d0693b4f99ca5c2493ddb49ba63927e57feba37
  Stored in directory: /home/airflow/.cache/pip/wheels/78/bc/b1/791fc6ee39415f4a1843eeaf990a680472d7ff398bd64f913a
Successfully built thrift
Failed to build lz4
ERROR: Could not build wheels for lz4, which is required to install pyproject.toml-based projects

However, when I run
pip install databricks-sql-connector==2.1.0
it works

Also note that I can run pip install databricks-sql-connector==2.2.0
If I have gcc installed in the docker container
apt-get update && apt-get -y install gcc

My current work around is to force the version to be databricks-sql-connector==2.1.0 or install gcc explicitly. Any other ideas how this can be fixed?

cursor.execute(query, params) error while trying to set table name

I'm trying to dynamically set the table name on my queries. To avoid SQL Injection I'm using the option curso.execute(query, params).

When I try to set the table name I get the error:

[PARSE_SYNTAX_ERROR] Syntax error at or near ''my_table_name''(line 1, pos 14)

== SQL ==
SELECT * FROM 'my_table_name'
--------------^^^

To reproduce:

with sql.connect(server_hostname=self.hostname, http_path=self.path, access_token=self.token) as connection:
    with connection.cursor() as cursor:
         cursor.execute("SELECT * FROM %(table_name)s", {"table_name": "my_table_name"})
         result = cursor.fetchall()

         for row in result:
             print(row)

It seems the table name can't have quotes. Only way I can do this is with:

cursor.execute("SELECT * FROM {}".format("my_table_name"))

Or other unsafe string substitution.

Am I doing something wrong?

Making py connector to raise an error for wrong SQL when asking to plan a query

Hey,

My aim is to validate a given SQL string without actually running it.

I thought I could use the EXPLAIN statement to do so.

So I tried using the connector to explain a query, and so determine whether it's valid or not. Example python code:

import databricks.sql

with databricks.sql.connect(...) as connection:
   with connection.cursor() as cursor:
      cursor.execute("EXPLAIN SELECT BAD-QUERY AS FOO")
      r = cursor.fetchall()

The problem with that implementation is that the driver does not throws an error, but instead retrieves me a string containing the error details, so I need to parse the string result to distinguish if the explained query was valid or not.

So I was wondering if there some kind of setting / parameter / configuration or so I can use to change the described above result, or perhaps even suggest such feature to the connector..

Many thanks in Advance!