johngiorgi / seq2rel-ds Goto Github PK

This is a companion repository to seq2rel (https://github.com/JohnGiorgi/seq2rel) which aims to make it easy to generate training data.

Python 100.00%

seq2rel seq2seq information-extraction relation-extraction entity-extraction coreference-resolution

seq2rel-ds's Introduction

seq2rel: Datasets

This is a companion repository to seq2rel, which makes it easy to preprocess training data.

Installation

This repository requires Python 3.8 or later.

Setting up a virtual environment

Before installing, you should create and activate a Python virtual environment. If you need pointers on setting up a virtual environment, please see the AllenNLP install instructions.

Installing the library and dependencies

If you do not plan on modifying the source code, install from git using pip

pip install git+https://github.com/JohnGiorgi/seq2rel-ds.git

Otherwise, clone the repository and install from source using Poetry:

# Install poetry for your system: https://python-poetry.org/docs/#installation
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python

# Clone and move into the repo
git clone https://github.com/JohnGiorgi/seq2rel-ds
cd seq2rel-ds

# Install the package with poetry
poetry install

Usage

Installing this package gives you access to a simple command-line tool, seq2rel-ds. To see the list of available commands, run:

seq2rel-ds --help

Note, you can also call the underlying python files directly, e.g. python path/to/seq2rel_ds/main.py --help.

To preprocess a dataset (and in most cases, download it), call one of the commands, e.g.

seq2rel-ds cdr main "path/to/cdr"

Note, you have to include main because typer does not support default commands.

This will create the preprocessed tsv files under the specified output directory, e.g.

cdr
 ┣ train.tsv
 ┣ valid.tsv
 ┗ test.tsv

which can then be used to train a seq2rel model.

seq2rel-ds's People

Contributors

Stargazers

Watchers

Forkers

tonycsoka

seq2rel-ds's Issues

Gold vs. pipeline based entity hinting is untested.

Add tests for parse_pubtator

There are a couple of strange formatting choices used in certain corpora that are in the pubtator format (or a pubtator like format). Examples include:

In BC5CDR, there are compound entities, separated by "|"
In GDA, some abstracts are actually titles only.
Test sort_ents

It would be great to add individual unit tests for each of these cases (and any other we discover) to ensure that we are handling them properly.

Add tests for pubtator_to_seq2rel

pubtator_to_seq2rel is currently untested. Opening this issue so I don't forget to test it.

Once things are a little more stable, add integration tests for all the CLI commands. This will effectively just call the command within a unit test. The test won't assert anything but will fail if the CLI command throws an error.

Preprocessing an existing dataset should also output its labels

Because seq2rel needs the "special" tokens and the relation classes used in serialization, the preprocess command in this repo should save this information somewhere along with the preprocessed data. Perhaps in a JSON file in the same directory as the preprocessed data.

Standardize corpus preprocessing

All the preprocess commands should first convert their datasets to the PubTator format, so that subsequent processing can all use the same functions and methods we have written. This will simplify things like adding entity hints, computing corpus statistics, etc. The general steps are:

Rename the PubtatorAnnotation schema to something more general.
For each of the preprocess commands, first convert the corpus to PubTator format. Then use the existing parse_pubtator function to convert it to the soon to be renamed PubtatorAnnotation schema.

Commands to update

BC5CDR
GDA
ADE
DocRED

Rename clusters to ents

Rename clusters attribute of PubTatorAnnotation object to ents.

Mismatch between organisms in BioGRID and PubTator

I am finding a number of mismatches in the organism ID for the same PMID between BioGRID and PubTator. Take for instance, https://pubmed.ncbi.nlm.nih.gov/10924150/, where PubTator correctly identifies the proteins as belonging to mice, while BioGRID uses the human entrez gene IDs. This causes us to miss the alignment.

There are a couple of solutions that I see:

If there is a way to automatically detect these examples, I might be able to compile a list, and (assuming there is a mechanism for doing so) reporting the errors to PubTator and BioGRID respectively.
Leverage the mygene API to resolve discrepancies. Basically, we would query mygene with the entrez ID in BioGRID, then check its homologous genes for the organism identified by PubTator, and try to match this entrez ID to one of the proteins identified by PubTator.

(2) Has a much higher chance of success because it can be added to the existing pipeline and automated.

How to set a default typer command?

Because of the nesting of our typer apps, we end up with a weird situation where one has to specify a final command, when intuitively we would like a default. E.g. seq2rel-ds preprocess <command> main. Ideally, we would like to be able to drop main, with the assumption that this is the subcommand we want if no other subcommand is specified. I have tried to figure out how to to this with typer with no luck. Keep an eye on this issue in case it is ever resolved.

Add back scispacy support

Sometimes, PubTator misses annotations that cause an alignment to fail. Ideally, we could extend PubTators annotation using another service, like scispacy. Assuming that scispacy at least occasionally catches annotations that are missed by PubTator, this would lead to less missed alignments and therefore more training data. It would also likely improve the quality of the alignments, as it should lead to less missed interactions and coreferent mentions.

Number of examples produced by BC5CDR script doesn't add up

For some reason, the number of examples produced by the bc5cdr command in the valid and test sets are each off by 1 compared to the gold standard. Investigate.

Alignment scoring misses coreferent mentions

The simple alignment scoring we introduced in #6 does not catch missed coreferent mentions. This will lead to some less-than-perfect alignments with scores of 1.0. I don't currently have a proposal to solve this, but it is the last kink to work out before we can really trust the scores computed on the validation and test sets.

Compute corpus statistics

I created a branch, compute-corpus-statistics, that has code to compute corpus statistics, mainly in order to compute the fraction of inter-sentence relations which we report in the paper. This code isn't particularly pretty and I don't really have the time to clean it up to merge into main, so I am just going to leave it on its own branch in case it's needed in the future. Opening this issue so that I don't forget.

Re-enable MyPy type checking in CI

I disabled mypy type checking in the CI for now. There are 10's of type check errors that I don't want to slow down development. Will re-enable later.