Git Product home page Git Product logo

gap-text2sql's Introduction

mRAT-SQL-FIT - A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention

Code and model from the paper paper published in Springer-Nature - International Journal of Information Technology, here the SharedIt link.

mRAT-SQL+GAP - Multilingual version of the RAT-SQL+GAP

Code and model from our BRACIS 2021 paper published in Springer Lecture Notes in Computer Science, here the pre-print in arXiv.

Based on: RAT-SQL+GAP: Github. Paper: AAAI 2021 paper

Abstract

mRAT-SQL+GAP is a multilingual version of the RAT-SQL+GAP, wich start with Portuguese Language. Here is available the code, dataset and the results.

Directory Structure

Go to the directory where you want to install the structure

git clone https://github.com/C4AI/gap-text2sql
cd gap-text2sql/mrat-sql-gap 

Download Spider Dataset

Go to you browser and download:

https://drive.google.com/uc?id=1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0

Put the spider.zip file into the directory: gap-text2sql/mrat-sql-gap

Conda mtext2slq Environment Setup

conda create --name mtext2sql python=3.7
conda activate mtext2sql
conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch
pip install -U "huggingface_hub[cli]"
pip install hf-transfer
export HF_HUB_ENABLE_HF_TRANSFER=0
pip install gdown
conda install -c conda-forge jsonnet
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"
conda install jupyter notebook
conda install -c conda-forge jupyter_contrib_nbextensions

Setup Script

Just run this script below, it will copy the datasets. The original version of the Spider dataset is distributed under the CC BY-SA 4.0 license. The modified versions (translated to Portuguese, Spanish, French, double-size(English and Portuguese) and quad-size (English, Portuguese, Spanish and French)) of train_spider.json, train_others.json, and dev.json are distributed under the CC BY-SA 4.0 license, respecting ShareAlike.

chmod +x setup.sh
./setup.sh

Specific setup

The models and checkpoints have big files (GBytes), so if you have enough disk space you can run all shell scripts. To understand how things work, run just BART_large.sh and after run the others.

./BART_large.sh
./mBART50MtoM-large.sh
./mT5_large.sh
./BERTimbau-base.sh
./BERTimbau-large.sh

Environment Test

Now the environment is ready for training (fine-tune) and inferences. The training is very slow more than 60 hours for BART, BERTimbau, mBART50, and more than 28 hours for mT5. Therefore I recommend testing the environment with the inference.

Preprocess dataset

This preprocess step is necessary both for inference and for training. It will take some time, maybe 40 minutes. I will use the script for BART, but you can use the other, look the directory experiments/spider-configs

python run.py preprocess experiments/spider-configs/spider-BART-large-en-train_en-eval.jsonnet

You can see the files processed in the paths: data/spider-en/nl2code-1115,output_from=true,fs=2,emb=bart,cvlink

Inference

I will use the script for BART again. Note: We are making inferences using the checkpoint already trained (directory logdir) and defined in: experiments/spider-configs/spider-BART-large-en-train_en-eval.jsonnet logdir: "logdir/BART-large-en-train", and
eval_steps: [40300],

python run.py eval experiments/spider-configs/spider-BART-large-en-train_en-eval.jsonnet

You then get the inference results and evaluation results in the paths:

ie_dirs/BART-large-en-train/bart-large-en_run_1_true_1-step41000.infer

and

ie_dirs/BART-large-en-train/bart-large-en_run_1_true_1-step41000.eval.

Training

Execute if it is really necessary, if you want to fine-tune the model, this will take a long time. But if you have a good machine available and want to see different checkpoints in the logdir, do it.

python run.py train experiments/spider-configs/spider-BART-large-en-train_en-eval.jsonnet

You then get the training checkpoints in the paths: logdir/BART-large-en-train

Checkpoints and Inferences

The checkpoints are available here (ESM - Exact Set Matching Accuracy): Paper mRAT-SQL+GAP - Multilingual version of the RAT-SQL+GAP

Future work of the paper mRAT-SQL+GAP

Paper mRAT-SQL-FIT

Other Best Results

  • T5-v1_1-large trained in English FIT 150Ksteps

  • mT5-large trained in English, Portuguese, Spanish and French (together) + Non Linear Data Augmentation by rules for extra question 3enr-3ptr-3esr-3frr FIT 150Ksteps

Results

All intermediate files of the results are in the directory inference-results.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

gap-text2sql's People

Contributors

marchanjo avatar panoskorovesis avatar impavidity avatar pnpnpn avatar amazon-auto avatar giannisfourfouris avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.