Comments (5)
Thanks for the detailed write-up. There are a few things happening here.
First up, you cannot use autoincrement=True
with the Databricks dialect. This is discussed in our dialect REAMDE here. Support for it will come in a future release.
Second, it looks like you're using the old SQLAlchemy 1.x syntax for your model code. The dialect included in databricks-sql-connector>=3.0.0 is built for SQLAlchemy 2.x exclusively. It may work but we can't guarantee it. You can see an example of the new syntax in our e2e tests here.
Third, the actual exception is a syntax error. SQLAlchemy is writing an invalid SQL statement by omitting the column names in the INSERT. I'm not clear how this is happening as it's not something we observe in the 1000+ INSERT test cases that we run during development. But I wonder if you may be using an older SQLAlchemy version below 2.0.0.
Which SQLAlchemy version do you have installed?
from databricks-sql-python.
I tried reproducing this error locally with:
- Python 3.11.4
- SQLAlchemy==2.0.22
- databricks-sql-connector==3.0.1
and am not able to make SQLAlchemy emit an INSERT statement that omits the column names.
Can you provide a reproducible example? FWIW: I don't think the Excel file has any bearing on this behaviour. In my attempts to reproduce I used both an Excel file as input and a randomly generated pandas dataframe. It worked as expected in both cases.
from databricks-sql-python.
Thanks @susodapop . I was running SQLAlchemy==1.4.49.
I upgraded to 2.0.22, but I'm still facing the same issue...
I think the issue is that within the excel file there is no "id" column.
I want SQLAlchemy to add the "Model" to Databricks. Have Databricks + SQLAlchemy issue an id (Primary Key), and then return that id for all the future tables which I can use as a Foreign Key.
This works for a PostgreSQL database with the exact same data.
'''
for tab_name in xls.sheet_names:
print(f"Processing tab: {tab_name}")
df = pd.read_excel(excel_file_path, sheet_name=tab_name)
# Convert column names to lowercase
df.columns = df.columns.str.lower()
print("Data in the tab:")
print(df)
if tab_name == 'Model':
print("Processing Model data")
for _, row in df.iterrows():
model = ModelIntegrated(name=row['name'])
session.add(model)
session.flush() # Get the auto-generated ID
model_id = model.id # Retrieve the ID
print(f"Inserted Model with name: {model.name}, ID: {model_id}")
'''
Thank you,
Brent
from databricks-sql-python.
Just getting back to this after traveling.
Can you please provide a runnable reproduction? I have attempted to reproduce the issue going off the code you provided but the code works. I don't have access to your excel file (and as indicated above, I don't think the excel file is really the problem). Without reproduction steps we're blocked on implementing a fix.
from databricks-sql-python.
Hi @susodapop here is a runnable reproduceable code. You'll just need to add your Databricks placeholders, and location to the excel file (attached)
import pandas as pd
from datetime import datetime
from sqlalchemy import create_engine, Column, Integer, String, ForeignKey, Numeric, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship
def process_excel_data(excel_file_path, host, http_path, access_token, catalog, schema):
# Replace placeholders with actual values
HOST = host
HTTP_PATH = http_path
ACCESS_TOKEN = access_token
# SQLAlchemy setup
Base = declarative_base()
# Model class for "model" table
class ModelIntegrated(Base):
__tablename__ = 'model_integrated'
id = Column(Integer, primary_key=True)
name = Column(String)
periods = relationship('PeriodIntegrated', backref=('model_integrated'))
# Model class for "period" table
class PeriodIntegrated(Base):
__tablename__ = 'period_integrated'
id = Column(Integer, primary_key=True)
model_id = Column(Integer, ForeignKey('model_integrated.id'))
name = Column(String)
solution = Column(Numeric)
start = Column(DateTime)
end = Column(DateTime)
period_order = Column(Integer)
conn_string = (
f"databricks://token:{ACCESS_TOKEN}@{HOST}?http_path={HTTP_PATH}&catalog={catalog}&schema={schema}"
)
engine = create_engine(conn_string, echo=True)
Session = sessionmaker(bind=engine)
session = Session()
xls = pd.ExcelFile(excel_file_path)
for tab_name in xls.sheet_names:
print(f"Processing tab: {tab_name}")
df = pd.read_excel(excel_file_path, sheet_name=tab_name)
# Convert column names to lowercase
df.columns = df.columns.str.lower()
print("Data in the tab:")
print(df)
if tab_name == 'Model':
print("Processing Model data")
for _, row in df.iterrows():
# PostgreSQL solution
model = ModelIntegrated(name=row['name'])
session.add(model)
session.commit() # Commit the transaction
# Retrieve the generated ID using a separate query
model_id = session.query(ModelIntegrated.id).filter_by(name=row['name']).scalar()
print(f"Inserted Model with name: {model.name}, ID: {model_id}")
elif tab_name == 'Period':
print("Processing Period data")
# Sort the DataFrame by "start" dates in ascending order
df_sorted = df.sort_values(by='start')
# Add a new column "period_order" with ascending integer values
df_sorted['period_order'] = range(1, len(df_sorted) + 1)
for _, row in df_sorted.iterrows():
period = PeriodIntegrated(
model_id=model_id,
name=row['name'],
solution=row['solution'],
start=datetime.strptime(row['start'], '%Y-%m-%d %I:%M:%S %p'), # Convert to datetime
end=datetime.strptime(row['end'], '%Y-%m-%d %I:%M:%S %p') # Convert to datetime
)
# Set the "period_order" attribute with the value from the DataFrame
period.period_order = row['period_order']
session.add(period)
print(f"Inserted Period with name: {period.name}, Period Order: {period.period_order}")
# Commit the changes
session.commit()
session.close()
# Example usage
excel_path = "<location>/githubtest.xlsx"
process_excel_data(excel_path, "<your_host>", "<your_http_path>", "<your_access_token>", "<your_catalog>", "<your_schema>")
from databricks-sql-python.
Related Issues (20)
- TIMESTAMP_NTZ datatype used by DataFrame.to_sql but fails on read_sql_table HOT 4
- `databricks.sql.exc.RequestError` when inserting more than 255 cells via `pandas.to_sql`. HOT 12
- Support `SPARK_CONNECT_USER_AGENT` environment variable HOT 1
- Reduce log noise while managing Connection session HOT 2
- Connection without specifying catalog name in connection string causes an error HOT 2
- Double Parenthesis in Query Interpolation with List Parameters for IN clause HOT 4
- Remove requirement: openpyxl HOT 4
- tests/unit/tests.py has broken tests and does not run in CI HOT 1
- Add support for proxies per-connection that override environment variables
- How to deal with struct types in sqlalchemy HOT 8
- [sqlalchemy] Exception: Instance %s has a NULL identity key HOT 9
- [sqlalchemy] [Feature Request] Support default values HOT 6
- TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' HOT 4
- databricks.sql.exc.RequestError: Error during request to server HOT 17
- Python 3.12 incompatible? - requirements need update HOT 1
- `Cursor().description` reports NULL (VOID) and INTERVAL fields as `'string'` HOT 1
- urllib3.connectionpool logs pre-signed URLs in DEBUG level HOT 2
- Unpin `pandas` HOT 4
- Provide actionable feedback when internal operations are failing rather than silently retrying for over 10mins HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from databricks-sql-python.