This repository contains the official implementation of experiments conducted in
- TAMR - A Lightweight AMR Toolkit For Enhancing NL2SQL Solutions (ICDE 2025 Submission)
configs
: Json files for various running configurtionsseq2seq
: Codebase for NL2SQL experiments, which is adapted from Picard for the file structure.datasets
: Dataset related python filesmetrics
: Metrics related python filesutils
: Folder with python utility files
T5
: A modified T5 architecture with AMR augmented, which is derived from Huggingface Transformers.run_seq2seq_internal.py
: A python file for training main experiments.
For GPT4-based, please refer to https://github.com/causalNLP/amr_llm. We modified the prompt as indicated in Table III to V.
- python3.7 or above
Install python packages
pip install -r requirements.txt
Train AMRT5-large (LN) model
python run_seq2seq_internal.py --config_files=configs/train_amr.json
Train AMRT5-large (SC) model
python run_seq2seq_internal.py --config_files=configs/train_amr_sc.json
Train AMRT5-3B (SC) model
python run_seq2seq_internal.py --config_files=configs/train_amr_3B.json
Eval the model A
python run_seq2seq_internal_eval.py --model_path=path/to/ckpt_A
We also provide preprocessed datasets in the folder data
.
AMR is a comprehensive semantic graph representation of a sentence. It utilizes a directed acyclic graph structure with a root node and represents important concepts as nodes and semantic relationships as edges.
AMR can help PLM to augment their semantics to strive a better trade off between efficiency and effectiveness.
The parser we choose is orginal from here. We modified this to be suitable for Natural Language Questions (NLQs) by retraining it using a corpus of NLQ-AMR pairs.
@inproceedings{Yu&al.18c,
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev}
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
year = 2018
}
@inproceedings{gan-etal-2021-towards,
title = "Towards Robustness of Text-to-{SQL} Models against Synonym Substitution",
author = "Gan, Yujian and
Chen, Xinyun and
Huang, Qiuping and
Purver, Matthew and
Woodward, John R. and
Xie, Jinxia and
Huang, Pengsheng",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.195",
doi = "10.18653/v1/2021.acl-long.195",
pages = "2505--2515",
}
@misc{gan2021exploring,
title={Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization},
author={Yujian Gan and Xinyun Chen and Matthew Purver},
year={2021},
eprint={2109.05157},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{deng2020structure,
title={Structure-Grounded Pretraining for Text-to-SQL},
author={Deng, Xiang and Awadallah, Ahmed Hassan and Meek, Christopher and Polozov, Oleksandr and Sun, Huan and Richardson, Matthew},
journal={arXiv preprint arXiv:2010.12773},
year={2020}
}
MIT