Git Product home page Git Product logo

Comments (5)

susodapop avatar susodapop commented on June 10, 2024

Thanks for the detailed write-up. There are a few things happening here.

First up, you cannot use autoincrement=True with the Databricks dialect. This is discussed in our dialect REAMDE here. Support for it will come in a future release.

Second, it looks like you're using the old SQLAlchemy 1.x syntax for your model code. The dialect included in databricks-sql-connector>=3.0.0 is built for SQLAlchemy 2.x exclusively. It may work but we can't guarantee it. You can see an example of the new syntax in our e2e tests here.

Third, the actual exception is a syntax error. SQLAlchemy is writing an invalid SQL statement by omitting the column names in the INSERT. I'm not clear how this is happening as it's not something we observe in the 1000+ INSERT test cases that we run during development. But I wonder if you may be using an older SQLAlchemy version below 2.0.0.

Which SQLAlchemy version do you have installed?

from databricks-sql-python.

susodapop avatar susodapop commented on June 10, 2024

I tried reproducing this error locally with:

  • Python 3.11.4
  • SQLAlchemy==2.0.22
  • databricks-sql-connector==3.0.1

and am not able to make SQLAlchemy emit an INSERT statement that omits the column names.

Can you provide a reproducible example? FWIW: I don't think the Excel file has any bearing on this behaviour. In my attempts to reproduce I used both an Excel file as input and a randomly generated pandas dataframe. It worked as expected in both cases.

from databricks-sql-python.

kopitb avatar kopitb commented on June 10, 2024

Thanks @susodapop . I was running SQLAlchemy==1.4.49.

I upgraded to 2.0.22, but I'm still facing the same issue...

I think the issue is that within the excel file there is no "id" column.

I want SQLAlchemy to add the "Model" to Databricks. Have Databricks + SQLAlchemy issue an id (Primary Key), and then return that id for all the future tables which I can use as a Foreign Key.

This works for a PostgreSQL database with the exact same data.
'''
for tab_name in xls.sheet_names:
print(f"Processing tab: {tab_name}")
df = pd.read_excel(excel_file_path, sheet_name=tab_name)

# Convert column names to lowercase
df.columns = df.columns.str.lower()
print("Data in the tab:")
print(df)

if tab_name == 'Model':
    print("Processing Model data")
    for _, row in df.iterrows():
        model = ModelIntegrated(name=row['name'])
        session.add(model)
        session.flush()  # Get the auto-generated ID
        model_id = model.id # Retrieve the ID
        print(f"Inserted Model with name: {model.name}, ID: {model_id}")

'''

Thank you,
Brent

from databricks-sql-python.

susodapop avatar susodapop commented on June 10, 2024

Just getting back to this after traveling.

Can you please provide a runnable reproduction? I have attempted to reproduce the issue going off the code you provided but the code works. I don't have access to your excel file (and as indicated above, I don't think the excel file is really the problem). Without reproduction steps we're blocked on implementing a fix.

from databricks-sql-python.

kopitb avatar kopitb commented on June 10, 2024

Hi @susodapop here is a runnable reproduceable code. You'll just need to add your Databricks placeholders, and location to the excel file (attached)

import pandas as pd
from datetime import datetime
from sqlalchemy import create_engine, Column, Integer, String, ForeignKey, Numeric, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship

def process_excel_data(excel_file_path, host, http_path, access_token, catalog, schema):
    # Replace placeholders with actual values
    HOST = host
    HTTP_PATH = http_path
    ACCESS_TOKEN = access_token

    # SQLAlchemy setup
    Base = declarative_base()

    # Model class for "model" table
    class ModelIntegrated(Base):
        __tablename__ = 'model_integrated'

        id = Column(Integer, primary_key=True)
        name = Column(String)
        periods = relationship('PeriodIntegrated', backref=('model_integrated'))

    # Model class for "period" table
    class PeriodIntegrated(Base):
        __tablename__ = 'period_integrated'
        id = Column(Integer, primary_key=True)
        model_id = Column(Integer, ForeignKey('model_integrated.id'))
        name = Column(String)
        solution = Column(Numeric)
        start = Column(DateTime)
        end = Column(DateTime)
        period_order = Column(Integer)

    conn_string = (
        f"databricks://token:{ACCESS_TOKEN}@{HOST}?http_path={HTTP_PATH}&catalog={catalog}&schema={schema}"
    )

    engine = create_engine(conn_string, echo=True)

    Session = sessionmaker(bind=engine)
    session = Session()

    xls = pd.ExcelFile(excel_file_path)

    for tab_name in xls.sheet_names:
        print(f"Processing tab: {tab_name}")
        df = pd.read_excel(excel_file_path, sheet_name=tab_name)

        # Convert column names to lowercase
        df.columns = df.columns.str.lower()
        print("Data in the tab:")
        print(df)

        if tab_name == 'Model':
            print("Processing Model data")
            for _, row in df.iterrows():
                # PostgreSQL solution
                model = ModelIntegrated(name=row['name'])
                session.add(model)
                session.commit()  # Commit the transaction

                # Retrieve the generated ID using a separate query
                model_id = session.query(ModelIntegrated.id).filter_by(name=row['name']).scalar()

                print(f"Inserted Model with name: {model.name}, ID: {model_id}")

        elif tab_name == 'Period':
            print("Processing Period data")

            # Sort the DataFrame by "start" dates in ascending order
            df_sorted = df.sort_values(by='start')

            # Add a new column "period_order" with ascending integer values
            df_sorted['period_order'] = range(1, len(df_sorted) + 1)

            for _, row in df_sorted.iterrows():
                period = PeriodIntegrated(
                    model_id=model_id,
                    name=row['name'],
                    solution=row['solution'],
                    start=datetime.strptime(row['start'], '%Y-%m-%d %I:%M:%S %p'),  # Convert to datetime
                    end=datetime.strptime(row['end'], '%Y-%m-%d %I:%M:%S %p')  # Convert to datetime
                )

                # Set the "period_order" attribute with the value from the DataFrame
                period.period_order = row['period_order']

                session.add(period)
                print(f"Inserted Period with name: {period.name}, Period Order: {period.period_order}")

    # Commit the changes
    session.commit()
    session.close()

# Example usage
excel_path = "<location>/githubtest.xlsx"
process_excel_data(excel_path, "<your_host>", "<your_http_path>", "<your_access_token>", "<your_catalog>", "<your_schema>")

from databricks-sql-python.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.