Git Product home page Git Product logo

zsql-postgres-dpo's Introduction

zsql-postgres-dpo

This is a dataset for training machine learning models to convert natural English language text into Postgres dialect SQL queries.

This dataset comprises 200,000 DPO pairs curated to support the rapid development of text-to-SQL generation models. The uniqueness of this dataset lies in its optimization process. The "chosen" field within each data pair contains SQL queries that have been canonicalized, optimized, and which are chosen from the candidate set which minimizes syntactic cyclomatic and asymptotic complexity against the given schema.

Direct Preference Optimization (see Rafailov et al, 2023) is a novel approach to refinement learning from positive and negative samples to modify the behavior of large-scale unsupervised language models to align with human preferences This method simplifies the fine-tuning process, making it more stable and computationally efficient without the need for extensive hyperparameter tuning or LM sampling, and has been shown to effectively control model outputs, matching or surpassing existing methods.

The source data is cleaned and filtered based on the following criteria:

  • Remove queries which are not in English.
  • Remove queries which are not valid SQL queries.
  • Remove queries which are not executable against the given schema.
  • Remove queries which are executed against tables with non-Latin characters.
  • Remove queries which use features not supported by the given database.
  • Remove long queries which contain domain-specific knowledge which cause model confusion.
  • Remove queries which do not fit within a 4096 token context window.

Usage

To load the dataset using the HuggingFace datasets library:

from datasets import load_dataset

dataset = load_dataset("zerolink/zsql-postgres-dpo")

To use in model fine-tuning, apply the following chat tokenizer:

tokenizer = AutoTokenizer.from_pretrained(model)

def tokenize(element):
    schema = element["schema"]
    question = element["question"]
    answer = element["chosen"]

    prompt = f"""
    Using the schema:
    {schema}
    Generate SQL for the following question:
    {question}
    """

    system = "Translate English to Postgres SQL."
    message = [
        {"role": "system", "content": system},
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": answer},
    ]
    output = tokenizer.apply_chat_template(
        message, add_generation_prompt=False, tokenize=True
    )
    return {"text": output}

Fields

The fields in this dataset are as follows:

Field Name Description
schema The schema of the database.
question The natural language question.
chosen The DPO preferred SQL query.
rejected The DPO rejected SQL query.
weight The weight of the query in the reward function.

Sources

This dataset is derived from the following sources:

  • datetime - Use of Postgres date and time functions.
  • json - Use of Postgres JSON functions.
  • math - Use of Postgres math functions.
  • postgis - Use of Postgres GIS functions.
  • re - Use of Postgres regular expression functions.
  • rollup - Use of Postgres rollup functions.
  • set - Use of Postgres set functions.
  • string - Use of Postgres string functions.
  • vector - Use of PGVector functions.
  • window - Use of Postgres window functions.
Source License External Link
wikisql BSD 3-Clause https://github.com/salesforce/WikiSQL
spider CC-BY-SA-4.0 https://huggingface.co/datasets/spider
sql_create_context CC-BY-4.0 https://huggingface.co/datasets/b-mc2/sql-create-context
squall CC-BY-SA-4.0 https://github.com/tzshi/squall
sede Apache-2.0 https://github.com/hirupert/sede
nvbench MIT https://github.com/TsinghuaDatabaseGroup/nvBench
imdb Not Found https://github.com/jkkummerfeld/text2sql-data
advising CC-BY-4.0 https://github.com/jkkummerfeld/text2sql-data
atis Not Found https://github.com/jkkummerfeld/text2sql-data
restaurants Not Found https://github.com/jkkummerfeld/text2sql-data
scholar Not Found https://github.com/jkkummerfeld/text2sql-data
yelp Not Found https://github.com/jkkummerfeld/text2sql-data
academic Not Found https://github.com/jkkummerfeld/text2sql-data
criteria2sql Apache-2.0 https://github.com/xiaojingyu92/Criteria2SQL
eICU CC-BY-4.0 https://github.com/glee4810/EHRSQL
mimic_iii CC-BY-4.0 https://github.com/glee4810/EHRSQL
mimicsql_data MIT https://github.com/wangpinggl/TREQS
worldsoccerdatabase CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
whatcdhiphop CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
studentmathscore CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
pesticide CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
thehistoryofbaseball CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
uswildfires CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
geonucleardata CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA
greatermanchestercrime CC-BY-SA-4.0 https://github.com/chiahsuan156/KaggleDBQA

Composition:

Composition

License

This dataset is provided for academic and research purposes. Please adhere to the specified license terms and conditions for usage and distribution.

zsql-postgres-dpo's People

Contributors

sdiehl avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.