Git Product home page Git Product logo

synthie's Introduction

Python 3.9 MIT License arXiv

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and The Case of Information Extraction

This repository contains the PyTorch implementation for the models and experiments in Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

@article{josifoski2023exploiting,
  title={Exploiting Asymmetry for Synthetic Training Data Generation: {S}ynth{IE} and The Case of Information Extraction},
  author={Josifoski, Martin and Sakota, Marija and Peyrard, Maxime and West, Robert},
  journal={arXiv preprint arXiv:2303.04132},
  year={2023}
}

Please consider citing our work, if you found the provided resources useful.


1. The Idea and the Repository in a Nutshell

The corresponding paper builds on the idea that even for hard tasks of interest (with input X and Y) -- for which human-annotation is not practical and high-quality annotated data is not available -- by reversing the task (from Y to X), useful data can be synthetically generated even when that original task cannot be solved directly by the LLM. The idea is illustrated in the following figure:

This process enables the creation of a high-quality dataset of X-Y pairs that will enable the training/fine-tuning of models for the original task of interest. In particular, the paper studies the idea in the context of closed information extraction (IE), where a model is tasked with extracting the exhaustive set of facts expressed in natural language text. The synthetic data generation pipeline proposed in the paper, depicted on the figure below, comprises three primary components: (i) construction of a knowledge graph containing the entities and relations of interest; (ii) sampling of coherent triplet sets from the KG with comprehensive coverage of the entities and relations, and (iii) generation of high-quality text, expressing the triplets without any supplementary information. For more details regarding the dataset construction procedure, see the paper.

We used this pipeline to generate two large high-quality datasets:

  • SynthIE-code: consisting of around 1.8M training, 10K validation, and 50K test samples generated with code-davinci-002
  • SynthIE-text: consisting of 10K validation and 50K test samples generated with text-davinci-003
    The text for the validation and test data points in SynthIE-code and SynthIE-text corresponds to the same triplet sets.

The resulting data is then used to train SynthIE, a series of T5-based versions of GenIE -- a recently proposed autoregressive closed IE system. As a baseline, T5 versions of GenIE are trained on the same dataset, REBEL, as the original work. This repository contains the code and instructions for downloading and using the data and models, as well as reproducing all the experiments in the paper.

The codebase is built upon PytorchLightning and Hydra.

2. Environment Setup

With the repository cloned, we recommend creating a new conda virtual environment:

conda env create -n synthie
conda activate synthie

Install PyTorch 1.13.0. For example using pip and with CUDA 11.3 support:

pip install torch==1.13.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 torchmetrics==0.10.3 -f https://download.pytorch.org/whl/torch_stable.html

Finally, install the remaining packages using pip:

pip install -r pip_requirements.txt

3. Downloading the Data and Models, Usage Instructions & Examples

The demo notebook provides a full review of the provided resources and instruction on how to use them.

4. Training, Inference & Evaluation

Training

Each of the provided models is associated with a Hydra configuration file that reproduces the training. For instance, to run the training for the synthie_base_fe model run:

MODEL=synthie_base_fe # synthie_base_fe, synthie_base_sc, synthie_large_fe, genie_base_fe
python run_train.py +experiment/train=$MODEL

Inference

Hydra provides a clean interface for evaluation. You just need to specify the model to be evaluated and the dataset to evaluate it on:

MODEL=synthie_base_fe # synthie_base_fe, synthie_base_sc, synthie_large_fe, genie_base_fe

# The name of the dataset
DATAMODULE=rebel_pc  # "rebel_pc", "sdg_code_davinci_002_pc", "sdg_text_davinci_003_pc"

python run_inference.py +experiment/inference=$MODEL datamodule=$DATAMODULE

# For unconstrained generation add: ++model.constraint_module=null at the end of the call

The generated prediction sequences will be logged to Weights and Biases.

Evaluation

To compute the micro and macro performance, as well as the performance bucketed by relation frequency and number of target triplets, you only need the run's WandB path and to execute:

python run_process_predictions.py +experiment/process_predictions=complete_rebel wandb_run_path=$WANDB_PATH

License

This project is licensed under the terms of the MIT license.

synthie's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

synthie's Issues

Code for sampling triplet sets

Hello,thanks for your amazing work! And I want to know how to sample triplet sets with your method from scratch?Do you provide any codes to fullfill this goal?

Rebel clean questions

Hello! Thank you for you work. Really good job working on the higher quality datasets.

However, I have a question on the Rebel-clean dataset. In the original paper you mentioned that you manually created the dataset consisting of 360 samples with input texts that fit two specified criteria. But in the rebel dataset on the hugging face there are much more data points, several of which by my manual exploration still have the problems that you tried to tackle using filtering by above-mentioned criteria. Can you please clarify, is the rebel version on hugging face supposed to be rebel_clean mentioned in the paper? Otherwise, can you please provide rebel_clean 360 quality data points for the reproducibility of your work?

run inference failed

Reproduction:

Following the example given in the inference section in the README.md

MODEL=synthie_base_fe # synthie_base_fe, synthie_base_sc, synthie_large_fe, genie_base_fe

# The name of the dataset
DATAMODULE=rebel_pc  # "rebel_pc", "sdg_code_davinci_002_pc", "sdg_text_davinci_003_pc"

python run_inference.py +experiment/inference=$MODEL datamodule=$DATAMODULE

Error message:

Traceback (most recent call last):
  File "/home/saibo/.conda/envs/synthie/lib/python3.8/site-packages/hydra/_internal/utils.py", line 639, in _locate
    obj = getattr(obj, part)
AttributeError: module 'src' has no attribute 'models'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/saibo/.conda/envs/synthie/lib/python3.8/site-packages/hydra/_internal/utils.py", line 645, in _locate
    obj = import_module(mod)
  File "/home/saibo/.conda/envs/synthie/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/saibo/Research/Rotation2/SynthIE/src/models/__init__.py", line 1, in <module>
    from .genie_flan_t5 import GenIEFlanT5PL
  File "/home/saibo/Research/Rotation2/SynthIE/src/models/genie_flan_t5.py", line 15, in <module>
    from src.metrics import TSF1, TSPrecision, TSRecall
  File "/home/saibo/Research/Rotation2/SynthIE/src/metrics/__init__.py", line 1, in <module>
    from .triplet_set_f1 import TSF1
  File "/home/saibo/Research/Rotation2/SynthIE/src/metrics/triplet_set_f1.py", line 4, in <module>
    from src.metrics.abstract import IEAbstractTorchMetric
  File "/home/saibo/Research/Rotation2/SynthIE/src/metrics/abstract.py", line 9, in <module>
    class IEAbstractTorchMetric(Metric, ABC):
  File "/home/saibo/Research/Rotation2/SynthIE/src/metrics/abstract.py", line 72, in IEAbstractTorchMetric
    ) -> Union[float, dict[str, float]]:
TypeError: 'type' object is not subscriptable

Fix

In SynthIE/src/metrics/abstract.py:72, in IEAbstractTorchMetric()
replace Union[float, dict[str, float]] by Union[float, Dict[str, float]]

because dict is not a type, but Dict

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.