Git Product home page Git Product logo

bi_sql_assistant's Introduction

BI Assistant

๐Ÿฆœ๐Ÿ”— LangChain SQL BI Assistant for IMDB

An LLM-based application that communicates with a DB to extract information.

The application takes a user input (question related to or inquiry about the DB) in natural language > [converts to SQL query > queries the DB > returns result to LLM] (iteratively until the LLM is satisfied) > returns final output to user

LangChain SQL Agent

Instructions:

To view this Streamlit app on a browser, run it with the following command:

streamlit run PATH\main.py [ARGUMENTS]
  • Note: This code connects to the DB using postgres. To connect to a different DB, please replace the type of DB and adapter to match your SQL engine of choice (in the pg_uri variable in the 'helper_funcs.py' module). The rest of the code should work just as fine ๐Ÿ‘๐Ÿผ.

Additional notes: Make sure to create a .env file for storing your OpenAI api key and db-connection variables.

Use the following format:

  • OPENAI_API_KEY = "your_openai_api_key"
  • DB_USER = "your_db_user"
  • DB_PASSWORD = "your_db_password"
  • DB_HOST = "your_db_host"
  • DB_PORT = int
  • DB_NAME = "your_db_name"

Description of High-Level Architecture

Frameworks: LangChain alongside OpenAI's GPT (NLP) + Streamlit (front-end)

LangChain ๐Ÿฆœ๐Ÿ”— (a framework to develop applications on LLMs):

  • Actively developing SQL agent ability
  • Seamless connection with OpenAI
  • Open-source
  • Under constant development
  • Well-documented
  • Extensive usage
  • Extensive community support
  • Reliable
  • Text-to-SQL query VS Text-to-SQL query and execution VS SQL Agent

Streamlit Logo:

  • Open-source
  • Great documentation
  • Community support
  • Customization
  • Pure Python
  • Seamless integrations
  • Deployment
  • LLM and ML support

Gradio Logo:

  • A good alternative that shares most of Streamlit's pros.
  • A suitable a good substitute.
  • I personally like Streamlit, and since it was recently acquired by SnowFlake, believe in its potential to grow much faster. Yet, both are great options.

Models & Algorithms

  • Used: model='gpt-3.5-turbo-1106'

openai_logo:

  • Seamless connection with LangChain
  • Best performing model in the market
  • Regularly updated offering continuous improvements
  • Hallucinations

Alternatives: Open-source LLMs ๐Ÿฆ™ (pros & cons):

  • A plethora of fine-tuned open-source models for sql-generation
  • Much more cost-effective
  • Docker
  • Hallucinations
  • Incorporate a combined approach of using an open-source LLM that was fine-tuned for SQL generation alongside LangChain's SQL query execution.
  • This is a valid choice, yet it means waiving the main advantages of the SQL Agent. Namely, recovering from errors + answering questions based on the databasesโ€™ schema as well as on the databasesโ€™ content.

Database: postgres_logo

  • Open-source and free
  • Good support and documentation
  • Reliable
  • Supports a wide range of extensions
  • Seamlessly integrates with Python

Preprocessing Stages - Description & Considerations

Files

  • Download the non-commercial IMDB dataset from IMDB datasets
  • See IMDB Non-commercial datasets for file description
  • Unzip files (tsv.gz files)
  • Preprocess files before uploading to the Database
    • Handling null values correctly
    • Handling mixed data types in the same column

Database

  • Choose an appropriate SQL Engine

  • Set-up a new database

  • Create tables and columns that correspond to the TSV files

  • Column data types: When deciding on dtypes take into account performance and storage issues and considerations:

    • str - check df['col'].str.len().max() to decide on n for VARCHAR(n)
    • int - INTEGER vs SMALLINT
      • SMALLINT relevant for: title.basics.tsv (startYear and endYear columns) and name.basics.tsv (birthYear and deathYear columns)
    • float - DOUBLE PRECISION
    • bool - BOOLEAN
    • List[str] - VARCHAR(n)[] unless behaves like VARCHAR(n)
      • Relevant for: title.akas.tsv (types and attribuites columns) and title.basics.tsv (genres column)
  • Import the preprocess TSV files into the newly created tables

    • Exceptionally large files should be broken into smaller files and uploaded sequentially in batches.
    • Write errors to file and inspect error_file.txt to further handle any problematic records.
    • Insert inconvertible values as null (mark the tick-box in the 'Data Import' dialogue box)

Libraries

  • Problem: LangChain-OpenAI incompatibility issues due to OpenAI's ChatCompletion endpoint change not yet integrated in LangChain.
    • Solution: Downgrade to openai==0.28 (versions 1.0.0+ will not work until LangChain is updated accordingly).
  • Issue: LangChain return_intermediate_steps param isn't yet applicable for LangChain SQL Agents.
    • Solution: Consider using a Text-to-SQL query (and Execution) instead to access intermediate steps.
    • Considerations: LangChain's SQL Agent provides a more flexible and advanced way of interacting with SQL Databases (see Part I above - Frameworks)
  • OpenAI temperature param - set to value of 0 to block model creativeness when translating free text client Qs to SQL queries.

Description of the Entire Question-Answering Flow

User prompt > LangChain employs a SQL Agent to handle the back-and-forth deliberations between the LLM and the DB (using the SQL tool by proxy) that queries and returns responses from the DB > Until the LLM is satisfied with the result > Return response to client

Process Flow

Future improvements:

  • Set primary and foreign keys (Database)
  • Improve prompt to LLM with few-show examples of natural language requests and corresponding SQL queries (LangChain)
  • Improve LLM retrieval with vectorBD storage and similarity comparison to other relevant SQL queries.
  • Export the DB to a compressed file
    • Bash command - pg_dump -U [username] -h [hostname] -d [databasename] | gzip > [full_path_of_output_file.sql.gz]
  • Docker
  • Set params in secret.py to environment variables
  • Integrate a chatbot to interact with the user, accept secret params and set them accordingly in the virtual environment to establish a live connection to the database.

example_queries.pdf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.