Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and The Case of Information Extraction

This repository contains the PyTorch implementation for the models and experiments in Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

@article{josifoski2023exploiting,
  title={Exploiting Asymmetry for Synthetic Training Data Generation: {S}ynth{IE} and The Case of Information Extraction},
  author={Josifoski, Martin and Sakota, Marija and Peyrard, Maxime and West, Robert},
  journal={arXiv preprint arXiv:2303.04132},
  year={2023}
}

Please consider citing our work, if you found the provided resources useful.

1. The Idea and the Repository in a Nutshell

The corresponding paper builds on the idea that even for hard tasks of interest (with input X and Y) -- for which human-annotation is not practical and high-quality annotated data is not available -- by reversing the task (from Y to X), useful data can be synthetically generated even when that original task cannot be solved directly by the LLM. The idea is illustrated in the following figure:

This process enables the creation of a high-quality dataset of X-Y pairs that will enable the training/fine-tuning of models for the original task of interest. In particular, the paper studies the idea in the context of closed information extraction (IE), where a model is tasked with extracting the exhaustive set of facts expressed in natural language text. The synthetic data generation pipeline proposed in the paper, depicted on the figure below, comprises three primary components: (i) construction of a knowledge graph containing the entities and relations of interest; (ii) sampling of coherent triplet sets from the KG with comprehensive coverage of the entities and relations, and (iii) generation of high-quality text, expressing the triplets without any supplementary information. For more details regarding the dataset construction procedure, see the paper.

We used this pipeline to generate two large high-quality datasets:

SynthIE-code: consisting of around 1.8M training, 10K validation, and 50K test samples generated with code-davinci-002
SynthIE-text: consisting of 10K validation and 50K test samples generated with text-davinci-003
The text for the validation and test data points in SynthIE-code and SynthIE-text corresponds to the same triplet sets.

The resulting data is then used to train SynthIE, a series of T5-based versions of GenIE -- a recently proposed autoregressive closed IE system. As a baseline, T5 versions of GenIE are trained on the same dataset, REBEL, as the original work. This repository contains the code and instructions for downloading and using the data and models, as well as reproducing all the experiments in the paper.

The codebase is built upon PytorchLightning and Hydra.

2. Environment Setup

With the repository cloned, we recommend creating a new conda virtual environment:

conda env create -n synthie
conda activate synthie

Install PyTorch 1.13.0. For example using pip and with CUDA 11.3 support:

pip install torch==1.13.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 torchmetrics==0.10.3 -f https://download.pytorch.org/whl/torch_stable.html

Finally, install the remaining packages using pip:

pip install -r pip_requirements.txt

3. Downloading the Data and Models, Usage Instructions & Examples

The demo notebook provides a full review of the provided resources and instruction on how to use them.

4. Training, Inference & Evaluation

Training

Each of the provided models is associated with a Hydra configuration file that reproduces the training. For instance, to run the training for the synthie_base_fe model run:

MODEL=synthie_base_fe # synthie_base_fe, synthie_base_sc, synthie_large_fe, genie_base_fe
python run_train.py +experiment/train=$MODEL

Inference

Hydra provides a clean interface for evaluation. You just need to specify the model to be evaluated and the dataset to evaluate it on:

MODEL=synthie_base_fe # synthie_base_fe, synthie_base_sc, synthie_large_fe, genie_base_fe

# The name of the dataset
DATAMODULE=rebel_pc  # "rebel_pc", "sdg_code_davinci_002_pc", "sdg_text_davinci_003_pc"

python run_inference.py +experiment/inference=$MODEL datamodule=$DATAMODULE

# For unconstrained generation add: ++model.constraint_module=null at the end of the call

The generated prediction sequences will be logged to Weights and Biases.

Evaluation

To compute the micro and macro performance, as well as the performance bucketed by relation frequency and number of target triplets, you only need the run's WandB path and to execute:

python run_process_predictions.py +experiment/process_predictions=complete_rebel wandb_run_path=$WANDB_PATH

License

This project is licensed under the terms of the MIT license.

run inference failed

Reproduction:

Following the example given in the inference section in the README.md

MODEL=synthie_base_fe # synthie_base_fe, synthie_base_sc, synthie_large_fe, genie_base_fe

# The name of the dataset
DATAMODULE=rebel_pc  # "rebel_pc", "sdg_code_davinci_002_pc", "sdg_text_davinci_003_pc"

python run_inference.py +experiment/inference=$MODEL datamodule=$DATAMODULE

Error message:

Traceback (most recent call last):
  File "/home/saibo/.conda/envs/synthie/lib/python3.8/site-packages/hydra/_internal/utils.py", line 639, in _locate
    obj = getattr(obj, part)
AttributeError: module 'src' has no attribute 'models'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/saibo/.conda/envs/synthie/lib/python3.8/site-packages/hydra/_internal/utils.py", line 645, in _locate
    obj = import_module(mod)
  File "/home/saibo/.conda/envs/synthie/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/saibo/Research/Rotation2/SynthIE/src/models/__init__.py", line 1, in <module>
    from .genie_flan_t5 import GenIEFlanT5PL
  File "/home/saibo/Research/Rotation2/SynthIE/src/models/genie_flan_t5.py", line 15, in <module>
    from src.metrics import TSF1, TSPrecision, TSRecall
  File "/home/saibo/Research/Rotation2/SynthIE/src/metrics/__init__.py", line 1, in <module>
    from .triplet_set_f1 import TSF1
  File "/home/saibo/Research/Rotation2/SynthIE/src/metrics/triplet_set_f1.py", line 4, in <module>
    from src.metrics.abstract import IEAbstractTorchMetric
  File "/home/saibo/Research/Rotation2/SynthIE/src/metrics/abstract.py", line 9, in <module>
    class IEAbstractTorchMetric(Metric, ABC):
  File "/home/saibo/Research/Rotation2/SynthIE/src/metrics/abstract.py", line 72, in IEAbstractTorchMetric
    ) -> Union[float, dict[str, float]]:
TypeError: 'type' object is not subscriptable

Fix

In SynthIE/src/metrics/abstract.py:72, in IEAbstractTorchMetric()
replace Union[float, dict[str, float]] by Union[float, Dict[str, float]]

because dict is not a type, but Dict

epfl-dlab / synthie Goto Github PK

synthie's Introduction

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and The Case of Information Extraction

1. The Idea and the Repository in a Nutshell

2. Environment Setup

3. Downloading the Data and Models, Usage Instructions & Examples

4. Training, Inference & Evaluation

Training

Inference

Evaluation

License

synthie's People

Stargazers

Watchers

Forkers

synthie's Issues

Reproduction:

Fix

Recommend Projects

Recommend Topics

Recommend Org