Inserting Data into Databricks via the databricks-sql-python library (Leveraging SQLALCHEMY) about databricks-sql-python HOT 5 OPEN

kopitb commented on June 10, 2024

Inserting Data into Databricks via the databricks-sql-python library (Leveraging SQLALCHEMY)

from databricks-sql-python.

Comments (5)

susodapop commented on June 10, 2024

Thanks for the detailed write-up. There are a few things happening here.

First up, you cannot use autoincrement=True with the Databricks dialect. This is discussed in our dialect REAMDE here. Support for it will come in a future release.

Second, it looks like you're using the old SQLAlchemy 1.x syntax for your model code. The dialect included in databricks-sql-connector>=3.0.0 is built for SQLAlchemy 2.x exclusively. It may work but we can't guarantee it. You can see an example of the new syntax in our e2e tests here.

Third, the actual exception is a syntax error. SQLAlchemy is writing an invalid SQL statement by omitting the column names in the INSERT. I'm not clear how this is happening as it's not something we observe in the 1000+ INSERT test cases that we run during development. But I wonder if you may be using an older SQLAlchemy version below 2.0.0.

Which SQLAlchemy version do you have installed?

from databricks-sql-python.

susodapop commented on June 10, 2024

I tried reproducing this error locally with:

Python 3.11.4
SQLAlchemy==2.0.22
databricks-sql-connector==3.0.1

and am not able to make SQLAlchemy emit an INSERT statement that omits the column names.

Can you provide a reproducible example? FWIW: I don't think the Excel file has any bearing on this behaviour. In my attempts to reproduce I used both an Excel file as input and a randomly generated pandas dataframe. It worked as expected in both cases.

from databricks-sql-python.

kopitb commented on June 10, 2024

Thanks @susodapop . I was running SQLAlchemy==1.4.49.

I upgraded to 2.0.22, but I'm still facing the same issue...

I think the issue is that within the excel file there is no "id" column.

I want SQLAlchemy to add the "Model" to Databricks. Have Databricks + SQLAlchemy issue an id (Primary Key), and then return that id for all the future tables which I can use as a Foreign Key.

This works for a PostgreSQL database with the exact same data.
'''
for tab_name in xls.sheet_names:
print(f"Processing tab: {tab_name}")
df = pd.read_excel(excel_file_path, sheet_name=tab_name)

# Convert column names to lowercase
df.columns = df.columns.str.lower()
print("Data in the tab:")
print(df)

if tab_name == 'Model':
    print("Processing Model data")
    for _, row in df.iterrows():
        model = ModelIntegrated(name=row['name'])
        session.add(model)
        session.flush()  # Get the auto-generated ID
        model_id = model.id # Retrieve the ID
        print(f"Inserted Model with name: {model.name}, ID: {model_id}")

'''

Thank you,
Brent

from databricks-sql-python.

susodapop commented on June 10, 2024

Just getting back to this after traveling.

Can you please provide a runnable reproduction? I have attempted to reproduce the issue going off the code you provided but the code works. I don't have access to your excel file (and as indicated above, I don't think the excel file is really the problem). Without reproduction steps we're blocked on implementing a fix.

from databricks-sql-python.

kopitb commented on June 10, 2024

Hi @susodapop here is a runnable reproduceable code. You'll just need to add your Databricks placeholders, and location to the excel file (attached)

import pandas as pd
from datetime import datetime
from sqlalchemy import create_engine, Column, Integer, String, ForeignKey, Numeric, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship

def process_excel_data(excel_file_path, host, http_path, access_token, catalog, schema):
    # Replace placeholders with actual values
    HOST = host
    HTTP_PATH = http_path
    ACCESS_TOKEN = access_token

    # SQLAlchemy setup
    Base = declarative_base()

    # Model class for "model" table
    class ModelIntegrated(Base):
        __tablename__ = 'model_integrated'

        id = Column(Integer, primary_key=True)
        name = Column(String)
        periods = relationship('PeriodIntegrated', backref=('model_integrated'))

    # Model class for "period" table
    class PeriodIntegrated(Base):
        __tablename__ = 'period_integrated'
        id = Column(Integer, primary_key=True)
        model_id = Column(Integer, ForeignKey('model_integrated.id'))
        name = Column(String)
        solution = Column(Numeric)
        start = Column(DateTime)
        end = Column(DateTime)
        period_order = Column(Integer)

    conn_string = (
        f"databricks://token:{ACCESS_TOKEN}@{HOST}?http_path={HTTP_PATH}&catalog={catalog}&schema={schema}"
    )

    engine = create_engine(conn_string, echo=True)

    Session = sessionmaker(bind=engine)
    session = Session()

    xls = pd.ExcelFile(excel_file_path)

    for tab_name in xls.sheet_names:
        print(f"Processing tab: {tab_name}")
        df = pd.read_excel(excel_file_path, sheet_name=tab_name)

        # Convert column names to lowercase
        df.columns = df.columns.str.lower()
        print("Data in the tab:")
        print(df)

        if tab_name == 'Model':
            print("Processing Model data")
            for _, row in df.iterrows():
                # PostgreSQL solution
                model = ModelIntegrated(name=row['name'])
                session.add(model)
                session.commit()  # Commit the transaction

                # Retrieve the generated ID using a separate query
                model_id = session.query(ModelIntegrated.id).filter_by(name=row['name']).scalar()

                print(f"Inserted Model with name: {model.name}, ID: {model_id}")

        elif tab_name == 'Period':
            print("Processing Period data")

            # Sort the DataFrame by "start" dates in ascending order
            df_sorted = df.sort_values(by='start')

            # Add a new column "period_order" with ascending integer values
            df_sorted['period_order'] = range(1, len(df_sorted) + 1)

            for _, row in df_sorted.iterrows():
                period = PeriodIntegrated(
                    model_id=model_id,
                    name=row['name'],
                    solution=row['solution'],
                    start=datetime.strptime(row['start'], '%Y-%m-%d %I:%M:%S %p'),  # Convert to datetime
                    end=datetime.strptime(row['end'], '%Y-%m-%d %I:%M:%S %p')  # Convert to datetime
                )

                # Set the "period_order" attribute with the value from the DataFrame
                period.period_order = row['period_order']

                session.add(period)
                print(f"Inserted Period with name: {period.name}, Period Order: {period.period_order}")

    # Commit the changes
    session.commit()
    session.close()

# Example usage
excel_path = "<location>/githubtest.xlsx"
process_excel_data(excel_path, "<your_host>", "<your_http_path>", "<your_access_token>", "<your_catalog>", "<your_schema>")

from databricks-sql-python.

Inserting Data into Databricks via the databricks-sql-python library (Leveraging SQLALCHEMY) about databricks-sql-python HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent