I would like to be able to use this library with the latest

The pin was added in: <a class="issue-link js-issue-link" data

The problematic function: <div class="

I can work around the issue by disabling pandas: <div class="highlight highlight-s

Unpin `pandas` about databricks-sql-python HOT 10 OPEN

dhirschfeld commented on June 26, 2024

Unpin `pandas`

from databricks-sql-python.

Comments (10)

dhirschfeld commented on June 26, 2024

The pin was added in:

#330

To fix the issue described in:

#326

...but that just avoids the problem whilst causing another problem; this library can't be used with the latest pandas :/

from databricks-sql-python.

dhirschfeld commented on June 26, 2024

I'm opening this issue to track any progress towards compatibility with the latest pandas version.

from databricks-sql-python.

dhirschfeld commented on June 26, 2024

Bump! I would like to upgrade to the latest version but am stuck on 3.0.1 because of this pin 😔

from databricks-sql-python.

benc-db commented on June 26, 2024

Does 3.0.1 work with latest pandas? That would be an interesting data point.

from databricks-sql-python.

dhirschfeld commented on June 26, 2024

Does 3.0.1 work with latest pandas? That would be an interesting data point.

I've been using 3.0.1 in combination with pandas 2.2.2 with no issues:

❯ pip list | rg 'pandas|databricks'
databricks-connect              14.3.1
databricks-sdk                  0.20.0
databricks-sql-connector        3.0.1
pandas                          2.2.2

...but that's apparently because I don't query all int data sources.
Running:

with engine.connect() as conn:
    res = conn.execute(sa.text("select 1")).scalar_one()

gives:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

from databricks-sql-python.

dhirschfeld commented on June 26, 2024

It seems like it doesn't like assigning a None into an integer array:

> /opt/python/envs/dev310/lib/python3.10/site-packages/pandas/core/internals/managers.py(1703)as_array()
   1701             pass
   1702         else:
-> 1703             arr[isna(arr)] = na_value
   1704 
   1705         return arr.transpose()

ipdb>  arr
array([[1]], dtype=int32)

ipdb>  isna(arr)
array([[False]])

ipdb>  na_value

ipdb>  na_value is None
True

If we go up the stack we can see we get type errors if we try to assign anything other than an integer:

> /opt/python/envs/dev310/lib/python3.10/site-packages/databricks/sql/client.py(1149)_convert_arrow_table()
   1147         )
   1148 
-> 1149         res = df.to_numpy(na_value=None)
   1150         return [ResultRow(*v) for v in res]
   1151 

ipdb>  df.to_numpy(na_value=None)
*** TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

ipdb>  df.to_numpy(na_value=float('NaN'))
*** ValueError: cannot convert float NaN to integer

ipdb>  df.to_numpy(na_value=-99)
array([[1]], dtype=int32)

Casting to object before assigning does seem to work:

ipdb>  df.astype(object).to_numpy(na_value=None)
array([[1]], dtype=object)

from databricks-sql-python.

dhirschfeld commented on June 26, 2024

The problematic function:

databricks-sql-python/src/databricks/sql/client.py

Lines 1130 to 1166 in a6e9b11

 def _convert_arrow_table(self, table): 

 column_names = [c[0] for c in self.description] 

 ResultRow = Row(*column_names) 

 if self.connection.disable_pandas is True: 

 return [ 

 ResultRow(*[v.as_py() for v in r]) for r in zip(*table.itercolumns()) 

 ] 

 # Need to use nullable types, as otherwise type can change when there are missing values. 

 # See https://arrow.apache.org/docs/python/pandas.html#nullable-types 

 # NOTE: This api is epxerimental https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html 

 dtype_mapping = { 

 pyarrow.int8(): pandas.Int8Dtype(), 

 pyarrow.int16(): pandas.Int16Dtype(), 

 pyarrow.int32(): pandas.Int32Dtype(), 

 pyarrow.int64(): pandas.Int64Dtype(), 

 pyarrow.uint8(): pandas.UInt8Dtype(), 

 pyarrow.uint16(): pandas.UInt16Dtype(), 

 pyarrow.uint32(): pandas.UInt32Dtype(), 

 pyarrow.uint64(): pandas.UInt64Dtype(), 

 pyarrow.bool_(): pandas.BooleanDtype(), 

 pyarrow.float32(): pandas.Float32Dtype(), 

 pyarrow.float64(): pandas.Float64Dtype(), 

 pyarrow.string(): pandas.StringDtype(), 

 } 

 # Need to rename columns, as the to_pandas function cannot handle duplicate column names 

 table_renamed = table.rename_columns([str(c) for c in range(table.num_columns)]) 

 df = table_renamed.to_pandas( 

 types_mapper=dtype_mapping.get, 

 date_as_object=True, 

 timestamp_as_object=True, 

 ) 

 res = df.to_numpy(na_value=None) 

 return [ResultRow(*v) for v in res]

from databricks-sql-python.

dhirschfeld commented on June 26, 2024

I can work around the issue by disabling pandas:

with engine.connect() as conn:
    cursor = conn.connection.cursor()
    cursor.connection.disable_pandas = True
    res = cursor.execute("select 1").fetchall()

>>> res
[Row(1=1)]

...but obviously the casting to numpy needs to be fixed.

from databricks-sql-python.

dhirschfeld commented on June 26, 2024

Probably casting to object before assigning a None value is the right fix.

from databricks-sql-python.

diego-jd commented on June 26, 2024

I second this. I cannot use pd.read_sql_query() because of this requirement.

Also, it would be good if you delete the distutils dependency

from databricks-sql-python.

Unpin `pandas` about databricks-sql-python HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	def _convert_arrow_table(self, table):
	column_names = [c[0] for c in self.description]
	ResultRow = Row(*column_names)

	if self.connection.disable_pandas is True:
	return [
	ResultRow([v.as_py() for v in r]) for r in zip(table.itercolumns())
	]

	# Need to use nullable types, as otherwise type can change when there are missing values.
	# See https://arrow.apache.org/docs/python/pandas.html#nullable-types
	# NOTE: This api is epxerimental https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
	dtype_mapping = {
	pyarrow.int8(): pandas.Int8Dtype(),
	pyarrow.int16(): pandas.Int16Dtype(),
	pyarrow.int32(): pandas.Int32Dtype(),
	pyarrow.int64(): pandas.Int64Dtype(),
	pyarrow.uint8(): pandas.UInt8Dtype(),
	pyarrow.uint16(): pandas.UInt16Dtype(),
	pyarrow.uint32(): pandas.UInt32Dtype(),
	pyarrow.uint64(): pandas.UInt64Dtype(),
	pyarrow.bool_(): pandas.BooleanDtype(),
	pyarrow.float32(): pandas.Float32Dtype(),
	pyarrow.float64(): pandas.Float64Dtype(),
	pyarrow.string(): pandas.StringDtype(),
	}

	# Need to rename columns, as the to_pandas function cannot handle duplicate column names
	table_renamed = table.rename_columns([str(c) for c in range(table.num_columns)])
	df = table_renamed.to_pandas(
	types_mapper=dtype_mapping.get,
	date_as_object=True,
	timestamp_as_object=True,
	)

	res = df.to_numpy(na_value=None)
	return [ResultRow(*v) for v in res]