obss / trapper Goto Github PK
View Code? Open in Web Editor NEWState-of-the-art NLP through transformer models in a modular design and consistent APIs.
License: MIT License
State-of-the-art NLP through transformer models in a modular design and consistent APIs.
License: MIT License
When I run the following code from the project root directory,
python tests/run_code_style.py check
I get the error below.
ERROR: SOME_FOLDERS/trapper/trapper/data/dataset_readers/dataset_reader.py Imports are incorrectly sorted and/or formatted.
Skipped 1 files
Traceback (most recent call last):
File "tests/run_code_style.py", line 10, in <module>
assert_shell("isort . --check --settings setup.cfg")
File "SOME_FOLDERS/trapper/tests/utils.py", line 19, in assert_shell
), f"Unexpected exit code {str(actual_exit_status)}"
AssertionError: Unexpected exit code 256
Can't we just print messages for each incorrectly formatted code instead of throwing an exception?
HuggingFace transformers' pipelinies underwent in a major major refactor in v4.11.0. Thanks to the new design, we can refactor our pipeline factory and existing pipelines as well as possibly wrapping the pipeline as a Registrable class. However, allennlp's latest release does not have that yet although their master branch has. Therefore, we may need to wait till their new release (I expect it won't take long) or use that commit to prevent a dependency mismatch.
Currently the commands does not work as the commands.py
module imports the overrides
package whereas it is not in the requirements, hence not installed. The only reason it is used is to decorate the add_subparser
method in Run
class class.
Folders like examples/ scripts/ and tests/ are not formatted by black due to current pyproject.toml
config.
It should accept data_adapter and also we need to move _LABELS property to data_adapter
The test_data directory should be better under the root, since there might be confusion with tests of the data package.
It would be really nice if we provide automatic evaluation metrics from datasets/metrics module, preferably through the jury library.
Instead of doing the tensor conversion manually, we can do something like
dataset.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
as shown in datasets doc. This would also enable removing the unused columns (the columns not required by the models) conveniently.
There is an issue occurring when there are invalid instances.
The _chop_excess_context_tokens
method in SquadDataProcessor
does not take into account the special tokens. Typically, we use N + 1 special tokens (such as BOS, SEP, EOS etc) if there are N sequences (e.g. context, answer etc in a question answering task).
When trying to run the POS Tagging Example experiment, I get the following exception:
Traceback (most recent call last):
File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/params.py", line 211, in pop
value = self.params.pop(key)
KeyError: 'metric_input_handler'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/sophylax/anaconda3/envs/trapper/bin/trapper", line 8, in <module>
sys.exit(run())
File "/home/sophylax/Documents/Git/Github/trapper/trapper/__main__.py", line 12, in run
main(prog="trapper")
File "/home/sophylax/Documents/Git/Github/trapper/trapper/commands.py", line 178, in main
args.func(args)
File "/home/sophylax/Documents/Git/Github/trapper/trapper/commands.py", line 101, in run_experiment_from_args
run_experiment(args.config_path, args.overrides)
File "/home/sophylax/Documents/Git/Github/trapper/trapper/training/train.py", line 41, in run_experiment
return _run_experiment_from_params(params)
File "/home/sophylax/Documents/Git/Github/trapper/trapper/training/train.py", line 64, in _run_experiment_from_params
trainer = TransformerTrainer.from_params(params)
File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/from_params.py", line 608, in from_params
**extras,
File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/from_params.py", line 636, in from_params
kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/from_params.py", line 207, in create_kwargs
cls.__name__, param_name, annotation, param.default, params, **extras
File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/from_params.py", line 310, in pop_and_construct_arg
popped_params = params.pop(name, default) if default != _NO_DEFAULT else params.pop(name)
File "/home/sophylax/anaconda3/envs/trapper/lib/python3.7/site-packages/allennlp/common/params.py", line 216, in pop
raise ConfigurationError(msg)
allennlp.common.checks.ConfigurationError: key "metric_input_handler" is required
test_fixtures and scripts should have been excluded but it is not the case.
This will use the example project added previously.
The default trainer is not suitable for training conditional text generation models.
We can do the pre-processing using data processors as callable argument to dataset.map()
Right now we are fixed on a certain version for these two packages. But we might be able to support a range of versions, increasing compatibility with other projects.
This will be used for showing how trapper can be used as a library in dowstream tasks
Adding support for callbacks, transformers' callbacks supports many 3rd party tools as well.
Currently, the tests are too complicated and lack of structure and reuse. This makes it especially harder to write tests for custom classes in new tasks.
An arguent handler class can be created to determine if datasets library is used and store the other arguments e.g. path, dataset name, split name etc. Then, the dataset reader could be given the handler instead of just the dataset path as currently done. The dataset reader should know how to read the dataset in both cases.
Renew the fixture dataset caches automatically whenever cache_hf_dataset_fixtures.py is called
In _run_experiment_from_params
, the creation of serialization directories and saving the experiment configs should only done by the main process in the case of distributed training. Currently, training crashes there while doing distributed training as other processes tries to create a non-empy directory when the quickest process already creates one.
Support single plugin file e.g. in ~/.trapper_plugins
containing the modules that can be used in multiple projects. allennlp already supports that, so it should be easy to add.
We missed commands not working for a while due to not testing them in our test suite.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.