Git Product home page Git Product logo

seq2rel-ds's Introduction

seq2rel: Datasets

ci codecov Checked with mypy GitHub

This is a companion repository to seq2rel, which makes it easy to preprocess training data.

Installation

This repository requires Python 3.8 or later.

Setting up a virtual environment

Before installing, you should create and activate a Python virtual environment. If you need pointers on setting up a virtual environment, please see the AllenNLP install instructions.

Installing the library and dependencies

If you do not plan on modifying the source code, install from git using pip

pip install git+https://github.com/JohnGiorgi/seq2rel-ds.git

Otherwise, clone the repository and install from source using Poetry:

# Install poetry for your system: https://python-poetry.org/docs/#installation
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python

# Clone and move into the repo
git clone https://github.com/JohnGiorgi/seq2rel-ds
cd seq2rel-ds

# Install the package with poetry
poetry install

Usage

Installing this package gives you access to a simple command-line tool, seq2rel-ds. To see the list of available commands, run:

seq2rel-ds --help

Note, you can also call the underlying python files directly, e.g. python path/to/seq2rel_ds/main.py --help.

To preprocess a dataset (and in most cases, download it), call one of the commands, e.g.

seq2rel-ds cdr main "path/to/cdr"

Note, you have to include main because typer does not support default commands.

This will create the preprocessed tsv files under the specified output directory, e.g.

cdr
 ┣ train.tsv
 ┣ valid.tsv
 ┗ test.tsv

which can then be used to train a seq2rel model.

seq2rel-ds's People

Contributors

johngiorgi avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

tonycsoka

seq2rel-ds's Issues

Add tests for parse_pubtator

There are a couple of strange formatting choices used in certain corpora that are in the pubtator format (or a pubtator like format). Examples include:

  • In BC5CDR, there are compound entities, separated by "|"
  • In GDA, some abstracts are actually titles only.
  • Test sort_ents

It would be great to add individual unit tests for each of these cases (and any other we discover) to ensure that we are handling them properly.

Add integration tests

Once things are a little more stable, add integration tests for all the CLI commands. This will effectively just call the command within a unit test. The test won't assert anything but will fail if the CLI command throws an error.

Preprocessing an existing dataset should also output its labels

Because seq2rel needs the "special" tokens and the relation classes used in serialization, the preprocess command in this repo should save this information somewhere along with the preprocessed data. Perhaps in a JSON file in the same directory as the preprocessed data.

Standardize corpus preprocessing

All the preprocess commands should first convert their datasets to the PubTator format, so that subsequent processing can all use the same functions and methods we have written. This will simplify things like adding entity hints, computing corpus statistics, etc. The general steps are:

  1. Rename the PubtatorAnnotation schema to something more general.
  2. For each of the preprocess commands, first convert the corpus to PubTator format. Then use the existing parse_pubtator function to convert it to the soon to be renamed PubtatorAnnotation schema.

Commands to update

  • BC5CDR
  • GDA
  • ADE
  • DocRED

Mismatch between organisms in BioGRID and PubTator

I am finding a number of mismatches in the organism ID for the same PMID between BioGRID and PubTator. Take for instance, https://pubmed.ncbi.nlm.nih.gov/10924150/, where PubTator correctly identifies the proteins as belonging to mice, while BioGRID uses the human entrez gene IDs. This causes us to miss the alignment.

There are a couple of solutions that I see:

  1. If there is a way to automatically detect these examples, I might be able to compile a list, and (assuming there is a mechanism for doing so) reporting the errors to PubTator and BioGRID respectively.
  2. Leverage the mygene API to resolve discrepancies. Basically, we would query mygene with the entrez ID in BioGRID, then check its homologous genes for the organism identified by PubTator, and try to match this entrez ID to one of the proteins identified by PubTator.

(2) Has a much higher chance of success because it can be added to the existing pipeline and automated.

How to set a default typer command?

Because of the nesting of our typer apps, we end up with a weird situation where one has to specify a final command, when intuitively we would like a default. E.g. seq2rel-ds preprocess <command> main. Ideally, we would like to be able to drop main, with the assumption that this is the subcommand we want if no other subcommand is specified. I have tried to figure out how to to this with typer with no luck. Keep an eye on this issue in case it is ever resolved.

Add back scispacy support

Sometimes, PubTator misses annotations that cause an alignment to fail. Ideally, we could extend PubTators annotation using another service, like scispacy. Assuming that scispacy at least occasionally catches annotations that are missed by PubTator, this would lead to less missed alignments and therefore more training data. It would also likely improve the quality of the alignments, as it should lead to less missed interactions and coreferent mentions.

Alignment scoring misses coreferent mentions

The simple alignment scoring we introduced in #6 does not catch missed coreferent mentions. This will lead to some less-than-perfect alignments with scores of 1.0. I don't currently have a proposal to solve this, but it is the last kink to work out before we can really trust the scores computed on the validation and test sets.

Compute corpus statistics

I created a branch, compute-corpus-statistics, that has code to compute corpus statistics, mainly in order to compute the fraction of inter-sentence relations which we report in the paper. This code isn't particularly pretty and I don't really have the time to clean it up to merge into main, so I am just going to leave it on its own branch in case it's needed in the future. Opening this issue so that I don't forget.

Re-enable MyPy type checking in CI

I disabled mypy type checking in the CI for now. There are 10's of type check errors that I don't want to slow down development. Will re-enable later.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.