obss / trapper Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 5.0 329 KB

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

License: MIT License

Python 98.26% Jsonnet 1.74%

allennlp deep-learning natural-language-processing nlp python pytorch pytorch-transformers transformer transformers

trapper's People

Contributors

Stargazers

Watchers

Forkers

tanerak xuanhan863 techthiyanes sophylax sanyaade-teachings

trapper's Issues

Provide better output from code style check

When I run the following code from the project root directory,
python tests/run_code_style.py check

I get the error below.

ERROR: SOME_FOLDERS/trapper/trapper/data/dataset_readers/dataset_reader.py Imports are incorrectly sorted and/or formatted.
Skipped 1 files
Traceback (most recent call last):
  File "tests/run_code_style.py", line 10, in <module>
    assert_shell("isort . --check --settings setup.cfg")
  File "SOME_FOLDERS/trapper/tests/utils.py", line 19, in assert_shell
    ), f"Unexpected exit code {str(actual_exit_status)}"
AssertionError: Unexpected exit code 256

Can't we just print messages for each incorrectly formatted code instead of throwing an exception?

Refactor pipelines

HuggingFace transformers' pipelinies underwent in a major major refactor in v4.11.0. Thanks to the new design, we can refactor our pipeline factory and existing pipelines as well as possibly wrapping the pipeline as a Registrable class. However, allennlp's latest release does not have that yet although their master branch has. Therefore, we may need to wait till their new release (I expect it won't take long) or use that commit to prevent a dependency mismatch.

Remove `overrides` decorator from the `Run` command

Currently the commands does not work as the commands.py module imports the overrides package whereas it is not in the requirements, hence not installed. The only reason it is used is to decorate the add_subparser method in Run class class.

black is not formatting several directories

Folders like examples/ scripts/ and tests/ are not formatted by black due to current pyproject.toml config.

Fix _create_compute_metrics method in TransformerTrainer

It should accept data_adapter and also we need to move _LABELS property to data_adapter

Add ignored_labels attribute to the LabelMapper class

Update transformers and allennlp dependencies to receive the fix on question answering pipeline

QuestionAnsweringPipeline in transformers sometimes throws unexpected errors, as detailed here. The fix released with v4.13.0.

Move test_data to the root folder

The test_data directory should be better under the root, since there might be confusion with tests of the data package.

Add jury or datasets/metrics integration

It would be really nice if we provide automatic evaluation metrics from datasets/metrics module, preferably through the jury library.

Can we wrap the metrics as Registrable classes?
Can we come up with some reosable default metrics to the common tasks e.g. F1, precision and recall so that a metric is chosen automatically unless the user specifies another metric (using the same mechanisms we will provide)? This part is not strictly required, but still might be nice to provide.

Use dataset.set_format in dataset collator

Instead of doing the tensor conversion manually, we can do something like
dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
as shown in datasets doc. This would also enable removing the unused columns (the columns not required by the models) conveniently.

Fix HuggingFace datasets integration

There is an issue occurring when there are invalid instances.

Correct in `_chop_excess_context_tokens` method in `SquadDataProcessor`

The _chop_excess_context_tokens method in SquadDataProcessor does not take into account the special tokens. Typically, we use N + 1 special tokens (such as BOS, SEP, EOS etc) if there are N sequences (e.g. context, answer etc in a question answering task).

POS Tagging Example Not Working

When trying to run the POS Tagging Example experiment, I get the following exception:

Traceback (most recent call last):
  File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/params.py", line 211, in pop
    value = self.params.pop(key)
KeyError: 'metric_input_handler'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sophylax/anaconda3/envs/trapper/bin/trapper", line 8, in <module>
    sys.exit(run())
  File "/home/sophylax/Documents/Git/Github/trapper/trapper/__main__.py", line 12, in run
    main(prog="trapper")
  File "/home/sophylax/Documents/Git/Github/trapper/trapper/commands.py", line 178, in main
    args.func(args)
  File "/home/sophylax/Documents/Git/Github/trapper/trapper/commands.py", line 101, in run_experiment_from_args
    run_experiment(args.config_path, args.overrides)
  File "/home/sophylax/Documents/Git/Github/trapper/trapper/training/train.py", line 41, in run_experiment
    return _run_experiment_from_params(params)
  File "/home/sophylax/Documents/Git/Github/trapper/trapper/training/train.py", line 64, in _run_experiment_from_params
    trainer = TransformerTrainer.from_params(params)
  File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/from_params.py", line 608, in from_params
    **extras,
  File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/from_params.py", line 636, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/from_params.py", line 207, in create_kwargs
    cls.__name__, param_name, annotation, param.default, params, **extras
  File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/from_params.py", line 310, in pop_and_construct_arg
    popped_params = params.pop(name, default) if default != _NO_DEFAULT else params.pop(name)
  File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/params.py", line 216, in pop
    raise ConfigurationError(msg)
allennlp.common.checks.ConfigurationError: key "metric_input_handler" is required

Fix setup.py by limiting the released packages

test_fixtures and scripts should have been excluded but it is not the case.

Train a POS tagging model

This will use the example project added previously.

Add Seq2SeqTrainer support from transformers

The default trainer is not suitable for training conditional text generation models.

Improve the HF datasets integration using dataset.map()

We can do the pre-processing using data processors as callable argument to dataset.map()

Investigate relaxing seqeval and tensorboardX requirements.

Right now we are fixed on a certain version for these two packages. But we might be able to support a range of versions, increasing compatibility with other projects.

Implement CONLL2003 POS tagging as an example for trapper

This will be used for showing how trapper can be used as a library in dowstream tasks

transformers' callback integration

Adding support for callbacks, transformers' callbacks supports many 3rd party tools as well.

Refactor tests

Currently, the tests are too complicated and lack of structure and reuse. This makes it especially harder to write tests for custom classes in new tasks.

We can use a class and convert some functions to methods. Maybe, some methods can be abstract (if pytest does not complain) so that the child test classes can override them.
We can move some common fixtures to conftest.py files or place them in the test classes I described previously.

Enable reading a dataset from datasets library using name, split and other optional arguments

An arguent handler class can be created to determine if datasets library is used and store the other arguments e.g. path, dataset name, split name etc. Then, the dataset reader could be given the handler instead of just the dataset path as currently done. The dataset reader should know how to read the dataset in both cases.

Improve fixture dataset caching and removal

Renew the fixture dataset caches automatically whenever cache_hf_dataset_fixtures.py is called

_run_experiment_from_params crashes during distributed training due to directory creation multiple times

In _run_experiment_from_params, the creation of serialization directories and saving the experiment configs should only done by the main process in the case of distributed training. Currently, training crashes there while doing distributed training as other processes tries to create a non-empy directory when the quickest process already creates one.

Support global plugin files as well

Support single plugin file e.g. in ~/.trapper_plugins containing the modules that can be used in multiple projects. allennlp already supports that, so it should be easy to add.

Add Tests for Commands

We missed commands not working for a while due to not testing them in our test suite.