Git Product home page Git Product logo

allennlp's Introduction


An Apache 2.0 NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.


CI PyPI License Codecov Optuna

⚠️ NOTICE: The AllenNLP library is now in maintenance mode. That means we are no longer adding new features or upgrading dependencies. We will still respond to questions and address bugs as they arise up until December 16th, 2022. If you have any concerns or are interested in maintaining AllenNLP going forward, please open an issue on this repository.

AllenNLP has been a big success, but as the field is advancing quickly it's time to focus on new initiatives. We're working hard to make AI2 Tango the best way to organize research codebases. If you are an active user of AllenNLP, here are some suggested alternatives:

  • If you like the trainer, the configuration language, or are simply looking for a better way to manage your experiments, check out AI2 Tango.
  • If you like AllenNLP's modules and nn packages, check out delmaksym/allennlp-light. It's even compatible with AI2 Tango!
  • If you like the framework aspect of AllenNLP, check out flair. It has multiple state-of-art NLP models and allows you to easily use pretrained embeddings such as those from transformers.
  • If you like the AllenNLP metrics package, check out torchmetrics. It has the same API as AllenNLP, so it should be a quick learning curve to make the switch.
  • If you want to vectorize text, try the transformers library.
  • If you want to maintain the AllenNLP Fairness or Interpret components, please get in touch. There is no alternative to it, so we are looking for a dedicated maintainer.
  • If you are concerned about other AllenNLP functionality, please create an issue. Maybe we can find another way to continue supporting your use case.

Quick Links

In this README

Getting Started Using the Library

If you're interested in using AllenNLP for model development, we recommend you check out the AllenNLP Guide for a thorough introduction to the library, followed by our more advanced guides on GitHub Discussions.

When you're ready to start your project, we've created a couple of template repositories that you can use as a starting place:

  • If you want to use allennlp train and config files to specify experiments, use this template. We recommend this approach.
  • If you'd prefer to use python code to configure your experiments and run your training loop, use this template. There are a few things that are currently a little harder in this setup (loading a saved model, and using distributed training), but otherwise it's functionality equivalent to the config files setup.

In addition, there are external tutorials:

And others on the AI2 AllenNLP blog.

Plugins

AllenNLP supports loading "plugins" dynamically. A plugin is just a Python package that provides custom registered classes or additional allennlp subcommands.

There is ecosystem of open source plugins, some of which are maintained by the AllenNLP team here at AI2, and some of which are maintained by the broader community.

Plugin Maintainer CLI Description
allennlp-models AI2 No A collection of state-of-the-art models
allennlp-semparse AI2 No A framework for building semantic parsers
allennlp-server AI2 Yes A simple demo server for serving models
allennlp-optuna Makoto Hiramatsu Yes Optuna integration for hyperparameter optimization

AllenNLP will automatically find any official AI2-maintained plugins that you have installed, but for AllenNLP to find personal or third-party plugins you've installed, you also have to create either a local plugins file named .allennlp_plugins in the directory where you run the allennlp command, or a global plugins file at ~/.allennlp/plugins. The file should list the plugin modules that you want to be loaded, one per line.

To test that your plugins can be found and imported by AllenNLP, you can run the allennlp test-install command. Each discovered plugin will be logged to the terminal.

For more information about plugins, see the plugins API docs. And for information on how to create a custom subcommand to distribute as a plugin, see the subcommand API docs.

Package Overview

allennlp An open-source NLP research library, built on PyTorch
allennlp.commands Functionality for the CLI
allennlp.common Utility modules that are used across the library
allennlp.data A data processing module for loading datasets and encoding strings as integers for representation in matrices
allennlp.fairness A module for bias mitigation and fairness algorithms and metrics
allennlp.modules A collection of PyTorch modules for use with text
allennlp.nn Tensor utility functions, such as initializers and activation functions
allennlp.training Functionality for training models

Installation

AllenNLP requires Python 3.6.1 or later and PyTorch.

We support AllenNLP on Mac and Linux environments. We presently do not support Windows but are open to contributions.

Installing via conda-forge

The simplest way to install AllenNLP is using conda (you can choose a different python version):

conda install -c conda-forge python=3.8 allennlp

To install optional packages, such as checklist, use

conda install -c conda-forge allennlp-checklist

or simply install allennlp-all directly. The plugins mentioned above are similarly installable, e.g.

conda install -c conda-forge allennlp-models allennlp-semparse allennlp-server allennlp-optuna

Installing via pip

It's recommended that you install the PyTorch ecosystem before installing AllenNLP by following the instructions on pytorch.org.

After that, just run pip install allennlp.

⚠️ If you're using Python 3.7 or greater, you should ensure that you don't have the PyPI version of dataclasses installed after running the above command, as this could cause issues on certain platforms. You can quickly check this by running pip freeze | grep dataclasses. If you see something like dataclasses=0.6 in the output, then just run pip uninstall -y dataclasses.

If you need pointers on setting up an appropriate Python environment or would like to install AllenNLP using a different method, see below.

Setting up a virtual environment

Conda can be used set up a virtual environment with the version of Python required for AllenNLP. If you already have a Python 3 environment you want to use, you can skip to the 'installing via pip' section.

  1. Download and install Conda.

  2. Create a Conda environment with Python 3.8 (3.7 or 3.9 would work as well):

    conda create -n allennlp_env python=3.8
    
  3. Activate the Conda environment. You will need to activate the Conda environment in each terminal in which you want to use AllenNLP:

    conda activate allennlp_env
    

Installing the library and dependencies

Installing the library and dependencies is simple using pip.

pip install allennlp

To install the optional dependencies, such as checklist, run

pip install allennlp[checklist]

Or you can just install all optional dependencies with pip install allennlp[all].

Looking for bleeding edge features? You can install nightly releases directly from pypi

AllenNLP installs a script when you install the python package, so you can run allennlp commands just by typing allennlp into a terminal. For example, you can now test your installation with allennlp test-install.

You may also want to install allennlp-models, which contains the NLP constructs to train and run our officially supported models, many of which are hosted at https://demo.allennlp.org.

pip install allennlp-models

Installing using Docker

Docker provides a virtual machine with everything set up to run AllenNLP-- whether you will leverage a GPU or just run on a CPU. Docker provides more isolation and consistency, and also makes it easy to distribute your environment to a compute cluster.

AllenNLP provides official Docker images with the library and all of its dependencies installed.

Once you have installed Docker, you should also install the NVIDIA Container Toolkit if you have GPUs available.

Then run the following command to get an environment that will run on GPU:

mkdir -p $HOME/.allennlp/
docker run --rm --gpus all -v $HOME/.allennlp:/root/.allennlp allennlp/allennlp:latest

You can test the Docker environment with

docker run --rm --gpus all -v $HOME/.allennlp:/root/.allennlp allennlp/allennlp:latest test-install 

If you don't have GPUs available, just omit the --gpus all flag.

Building your own Docker image

For various reasons you may need to create your own AllenNLP Docker image, such as if you need a different version of PyTorch. To do so, just run make docker-image from the root of your local clone of AllenNLP.

By default this builds an image with the tag allennlp/allennlp, but you can change this to anything you want by setting the DOCKER_IMAGE_NAME flag when you call make. For example, make docker-image DOCKER_IMAGE_NAME=my-allennlp.

If you want to use a different version of Python or PyTorch, set the flags DOCKER_PYTHON_VERSION and DOCKER_TORCH_VERSION to something like 3.9 and 1.9.0-cuda10.2, respectively. These flags together determine the base image that is used. You can see the list of valid combinations in this GitHub Container Registry: github.com/allenai/docker-images/pkgs/container/pytorch.

After building the image you should be able to see it listed by running docker images allennlp.

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
allennlp/allennlp   latest              b66aee6cb593        5 minutes ago       2.38GB

Installing from source

You can also install AllenNLP by cloning our git repository:

git clone https://github.com/allenai/allennlp.git

Create a Python 3.7 or 3.8 virtual environment, and install AllenNLP in editable mode by running:

pip install -U pip setuptools wheel
pip install --editable .[dev,all]

This will make allennlp available on your system but it will use the sources from the local clone you made of the source repository.

You can test your installation with allennlp test-install. See https://github.com/allenai/allennlp-models for instructions on installing allennlp-models from source.

Running AllenNLP

Once you've installed AllenNLP, you can run the command-line interface with the allennlp command (whether you installed from pip or from source). allennlp has various subcommands such as train, evaluate, and predict. To see the full usage information, run allennlp --help.

You can test your installation by running allennlp test-install.

Issues

Everyone is welcome to file issues with either feature requests, bug reports, or general questions. As a small team with our own internal goals, we may ask for contributions if a prompt fix doesn't fit into our roadmap. To keep things tidy we will often close issues we think are answered, but don't hesitate to follow up if further discussion is needed.

Contributions

The AllenNLP team at AI2 (@allenai) welcomes contributions from the community. If you're a first time contributor, we recommend you start by reading our CONTRIBUTING.md guide. Then have a look at our issues with the tag Good First Issue.

If you would like to contribute a larger feature, we recommend first creating an issue with a proposed design for discussion. This will prevent you from spending significant time on an implementation which has a technical limitation someone could have pointed out early on. Small contributions can be made directly in a pull request.

Pull requests (PRs) must have one approving review and no requested changes before they are merged. As AllenNLP is primarily driven by AI2 we reserve the right to reject or revert contributions that we don't think are good additions.

Citing

If you use AllenNLP in your research, please cite AllenNLP: A Deep Semantic Natural Language Processing Platform.

@inproceedings{Gardner2017AllenNLP,
  title={AllenNLP: A Deep Semantic Natural Language Processing Platform},
  author={Matt Gardner and Joel Grus and Mark Neumann and Oyvind Tafjord
    and Pradeep Dasigi and Nelson F. Liu and Matthew Peters and
    Michael Schmitz and Luke S. Zettlemoyer},
  year={2017},
  Eprint = {arXiv:1803.07640},
}

Team

AllenNLP is an open-source project backed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering. To learn more about who specifically contributed to this codebase, see our contributors page.

allennlp's People

Contributors

akshitab avatar arjunsubramonian avatar bratao avatar brendan-ai2 avatar bryant1410 avatar danieldeutsch avatar deneutoy avatar dependabot-preview[bot] avatar dependabot[bot] avatar dirkgr avatar eladsegal avatar epwalsh avatar eric-wallace avatar harshtrivedi avatar joelgrus avatar johngiorgi avatar kl2806 avatar maksymdel avatar matt-gardner avatar matt-peters avatar nafitzgerald avatar nelson-liu avatar nicola-decao avatar oyvindtafjord avatar pdasigi avatar sai-prasanna avatar scarecrow1123 avatar schmmd avatar wrran avatar zhaofengwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

allennlp's Issues

Add a `Model.load_from_file` method (or similar)

Currently, if you want to load a model, you first need to load the vocab, then construct the model from_params, then load the state dict, etc. We should just have a method that does this, given the base serialization directory.

Parse the log file for _actual_ parameters used to save in the model archive

Instead of copying the input parameter file when archiving a model, we should parse the log file to get the actual parameters that were used (including defaults). This will make model archiving more robust to changes in default parameters (as recently happened with the tokenizer). Seems like quite a bit of work to be sure the parameters are parsed out correctly, though, and it's not super high priority.

Error when installing requirements in a Conda environment.

This issue was brought up to me by Nikket.

Hi Michael,

I followed the steps in the readme, and got stuck in step 4.

  1. Download and install Conda.

  2. Create a Conda environment with Python 3.

conda create -n allennlp python=3.5

  1. Now activate the Conda environment.

source activate allennlp

  1. Install the required dependencies.

INSTALL_TEST_REQUIREMENTS="true" ./scripts/install_requirements.sh

  1. Visit http://pytorch.org/ and install the relevant pytorch package.

  2. Set the PYTHONHASHSEED for repeatable experiments.

export PYTHONHASHSEED=2157

I get stuck on step 4, whose remedy seems to be https://stackoverflow.com/questions/1449396/how-to-install-setuptools, but I just wanted to be sure that I am not doing anything wrong.

(allennlp) nikett:allennlp nikett$ INSTALL_TEST_REQUIREMENTS="true" ./scripts/install_requirements.sh

 Collecting git+git://github.com/mkorpela/overrides.git@40f8bd1fae7a3364a1 (from -r requirements.txt (line 23))
  Cloning git://github.com/mkorpela/overrides.git (to 40f8bd1fae7a3364a1) to /private/var/folders/hj/kby3swx56l9bf_1v93z87b840000gp/T/pip-rc43o34j-build
  Could not find a tag or branch '40f8bd1fae7a3364a1', assuming commit.
Could not import setuptools which is required to install from a source distribution.


Please install setuptools.
/Users/nikett/anaconda/envs/allennlp/bin/python: Error while finding module specification for 'nltk.downloader' (ImportError: No module named 'nltk')
/Users/nikett/anaconda/envs/allennlp/bin/python: Error while finding module specification for 'spacy.en.download' (ImportError: No module named 'spacy')


Collecting git+git://github.com/PyCQA/pylint.git@2561f539d60a3563d6507e7a22e226fb10b58210 (from -r requirements_test.txt (line 6))
  Cloning git://github.com/PyCQA/pylint.git (to 2561f539d60a3563d6507e7a22e226fb10b58210) to /private/var/folders/hj/kby3swx56l9bf_1v93z87b840000gp/T/pip-gd8xgqm9-build
  Could not find a tag or branch '2561f539d60a3563d6507e7a22e226fb10b58210', assuming commit.
Could not import setuptools which is required to install from a source distribution.
Please install setuptools.

Have Tokenizers return a Token object

This will let us get rid of the nasty offset return value, because it will just be a field on the Token, and it will let us include POS tags, for POS tag embeddings.

It's probably easiest to just return spacy's token representation directly, rather than trying to roll our own, and have other word splitters mimic spacy's API. Or we could just have them crash; not sure we really need the other word splitters at this point - we could just simplify things a lot by putting spacy directly into WordTokenizer. Anybody have any thoughts on that?

Simplify / centralize `TokenIndexer.from_params()`

We have blocks like this in several places:

token_indexers = {}
token_indexer_params = params.pop('token_indexers', Params({}))
for name, indexer_params in token_indexer_params.items():
token_indexers[name] = TokenIndexer.from_params(indexer_params)
# The default parameters are contained within the class,
# so if no parameters are given we must pass None.
if token_indexers == {}:
token_indexers = None

These should all be put in one spot, probably something like TokenIndexer.dict_from_params (not thrilled with that name, but something similar).

Figure out non-determinism due to PYTHONHASHSEED

Reported by @schmmd. I'm not really sure what could be causing this, because I didn't think there was any randomness in model.forward() after model.eval() has been called. But here are steps to reproduce:

$ git checkout schmmd/weird-bug
$ set -x PYTHONHASHSEED 2157
$ allennlp/run serve

> “spaceship”

1.  Navigate to http://localhost:8000.
2.  Click the MC Model tab.
3.  Submit the last example (The Millennium Falcon…)

$ git checkout schmmd/weird-bug
$ set -x PYTHONHASHSEED 4563123
$ allennlp/run serve

1.  Navigate to http://localhost:8000.
2.  Click the MC Model tab.
3.  Submit the last example (The Millennium Falcon…)

> “variety of Star Wars expanded …”

Allow for different handling of OOV words

We currently give all OOV tokens the same embedding at both training and test time. It'd be nice to be able to have some different options here:

  • At test time, see if the OOV token is in glove, and use that embedding instead
  • At both training and test, use random vectors for each unique OOV token, as suggested here

There are probably some other options I'm forgetting right now. These would be pretty tricky to implement in our current data pipeline, though.

Notebook Checklist

  • Vocabulary

  • Data API - Fields, Instances, Dataset.

  • Iterators and Training a model.

  • Tokens -> Tokenizers -> TokenIndexers Abstraction.

  • Writing a DatasetReader example.

  • Writing a Model, differences between torch.nn.Module.

  • Why have Params? How to build things which run from JSON.

  • TokenEmbedders -> TextFields -> representation Abstraction.

  • Seq2SeqEncoders and Seq2VecEncoders and how to use them.

  • How to make your model Servable and deploy a server via Docker.

torch.Tensor type annotations

We should work out whether these are actually the correct type annotation for many of our functions. For many functions, we are actually only ever passing torch.autograd.Variables, including some which actually require this, e.g:

# raises
torch.nn.functional.softmax(torch.rand([3,4]))

# fine
torch.nn.functional.softmax(torch.autograd.Variable(torch.rand([3,4])))

I think we can still keep the tensor types by doing something like:
torch.autograd.Variable[torch.FloatTensor] etc.

Figure out how to get spacy to tokenize wiki text correctly

SQuAD has plenty of paragraphs that have wiki notes, formatting like "This was a protest.[note 4]". Spacy for some reason does not tokenize these strings correctly, giving "protest.[note" as a single token. We should be able to improve performance on SQuAD at least a little bit by fixing these issues, as it affects a fair number of our training examples, and some of the dev set.

A test that currently fails, but should pass (goes in word_splitter_test.py):

 def test_tokenize_handles_wiki_notes(self):
     passage = "McWhorter writes of Lee, \"for a white person from the South to write a " +\
             "book like this in the late 1950s is really unusual\u2014by its very existence " +\
             "an act of protest.\"[note 4] Author James McBride calls Lee brilliant but " +\
             "stops short of calling her brave: \"I think by calling Harper Lee brave you " +\
             "kind of absolve yourself of your own racism.\""
     tokens, offsets = self.word_splitter.split_words(passage)
     assert "protest" in tokens

Web demo does not work on Firefox

ReferenceError: event is not defined[Learn More] demo.allennlp.org:686:13
    onClick http://demo.allennlp.org/:686:13
    onClick self-hosted:987:17
    [55]</ReactErrorUtils.invokeGuardedCallback http://demo.allennlp.org/lib/react-dom.js:9036:7
    executeDispatch http://demo.allennlp.org/lib/react-dom.js:2996:5
    executeDispatchesInOrder http://demo.allennlp.org/lib/react-dom.js:3019:5
    executeDispatchesAndRelease http://demo.allennlp.org/lib/react-dom.js:2427:5
    executeDispatchesAndReleaseTopLevel http://demo.allennlp.org/lib/react-dom.js:2438:10
    forEach self-hosted:267:13
    forEachAccumulated http://demo.allennlp.org/lib/react-dom.js:15456:5
    processEventQueue http://demo.allennlp.org/lib/react-dom.js:2638:7
    runEventQueueInBatch http://demo.allennlp.org/lib/react-dom.js:9060:3
    handleTopLevel http://demo.allennlp.org/lib/react-dom.js:9070:5
    handleTopLevelImpl http://demo.allennlp.org/lib/react-dom.js:9147:5
    perform http://demo.allennlp.org/lib/react-dom.js:14760:13
    batchedUpdates http://demo.allennlp.org/lib/react-dom.js:8825:14
    batchedUpdates http://demo.allennlp.org/lib/react-dom.js:12895:10
    dispatchEvent http://demo.allennlp.org/lib/react-dom.js:9222:7
    dispatchEvent self-hosted:987:17

pytest -v Issues

I installed python 3.6 into an Anaconda environment installed all the requirements.txt and requirements_test.txt packages, pytorch, etc.

When I run pytest -v, the tests are failing.
It is accessing the python 2.7 packages (not the python 3.6 ones):

(py36) David-Laxers-MacBook-Pro:allennlp davidlaxer$ pytest -v
============================= test session starts ==============================
platform darwin -- Python 2.7.13, pytest-3.1.1, py-1.4.33, pluggy-0.4.0 -- /Users/davidlaxer/anaconda/bin/python
cachedir: .cache
rootdir: /Users/davidlaxer/allennlp, inifile: pytest.ini
collected 0 items / 68 errors 

==================================== ERRORS ====================================
___________________ ERROR collecting tests/notebooks_test.py ___________________
../anaconda/lib/python2.7/site-packages/_pytest/python.py:408: in _importtestmodule
    mod = self.fspath.pyimport(ensuresyspath=importmode)
../anaconda/lib/python2.7/site-packages/py/_path/local.py:662: in pyimport
    __import__(modname)
E     File "/Users/davidlaxer/allennlp/tests/notebooks_test.py", line 17
E       def execute_notebook(notebook_path: str):
E                                         ^
E   SyntaxError: invalid syntax
[...]
ImportError while importing test module '/Users/davidlaxer/allennlp/tests/training/metrics/span_based_f1_measure_test.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/training/metrics/span_based_f1_measure_test.py:2: in <module>
    import torch
E   ImportError: No module named torch
!!!!!!!!!!!!!!!!!!! Interrupted: 68 errors during collection !!!!!!!!!!!!!!!!!!!
=========================== 68 error in 5.49 seconds ===========================

Add ability to append or prepend tokens to the `Tokenizers`

In three places now we've implemented appending and/or prepending tokens to what goes into a TextField (a null token for the SNLI reader, a stop token for SQuAD, and sentence boundary tokens for the language model reader). This should just be a basic functionality of the tokenizer.

Figure out cause of slow imports

And fix it, if possible.

If you just import something from the library, like from allennlp.data import Vocabulary, there's a several second delay. Not sure what the cause is, but it seems like some __init__.py somewhere is doing more than it should, or something is getting run on import when it shouldn't be.

Allow config file overrides in `evaluate`

And other places where we load models. Actually, because this is going through to load_archive, that method itself has to allow overrides, and any entry point that reaches load_archive needs a way to pass in overrides.

Decompose `Trainer.train()` into smaller methods

It's grown to be a ~150 line method that's pretty hard to reason about. It's a bit tricky to decompose, though, because of all of the dependencies between different parts of the method, but we should be able to pull out a bunch of it into separate methods.

Move the call to `model.cuda()` to before optimizer creation

Because of this issue. The optimizer might have state that's initialized from the model parameters, and needs to be on the right device.

This means we should either:

  1. Have cuda_device be a top-level key in the experiment config, so we can move the model over in commands.train() before constructing the optimizer.
  2. Move the optimizer creation and the call to model.cuda() into Trainer.from_params().

I could go either way. Calling model.cuda() inside of from_params() in the second option is a little bit more logic than we like to have in those methods, but not much. The optimizer conceptually seems like it's part of the trainer, so having the optimizer params inside of the trainer params makes sense.

Change `Instance.metadata` into a `MetadataField`

This would remove the need for all of the special casing and reflection that I did to pass the metadata through correctly. Basically we take all of the metadata-related code from this PR and replace it with a MetadataField. There will still be a little bit of special casing, unless we also move the array creation code into a class method on the Field objects (probably on Field itself, overriden by MetadataField). In particular, I mean this code:

if field_name == 'metadata':
continue
if isinstance(field_array_list[0], dict):
# This is creating a dict of {token_indexer_key: batch_array} for each
# token indexer used to index this field. This is mostly utilised by TextFields.
token_indexer_key_to_batch_dict = defaultdict(list) # type: Dict[str, List[numpy.ndarray]]
for namespace_dict in field_array_list:
for indexer_name, array in namespace_dict.items():
token_indexer_key_to_batch_dict[indexer_name].append(array)
field_arrays[field_name] = {indexer_name: numpy.asarray(array_list) for # type: ignore
indexer_name, array_list in token_indexer_key_to_batch_dict.items()}
else:
field_arrays[field_name] = numpy.asarray(field_array_list)

Add a `text_to_instance` method on `DatasetReader`

The Predictors currently have to pull out the tokenizer and the token indexers from the DatasetReader and recreate what the DatasetReader does internally for every instance. This means that if we want to change (or add parameters to) what the DatasetReader does, we have to change the Predictors to match. Instead, we should just add a text_to_instance method on the DatasetReader itself, so that the Predictor can just keep around the DatasetReader, and pass off all processing to it.

Have some parameter versioning, or something

It'd be nice to have some way of managing config file changes, so that, e.g., if we add a new required parameter, or change the name of a flag, config files don't break mysteriously for an end user. Even better if we can make things backwards compatible when they change. Not sure at all how to make this happen in a reasonable way, though.

ImportError: dlopen: cannot load any more object with static TLS

I installed AllenNLP from source, and when I followed the steps on the Getting Started page to run the command python -m allennlp.run, it occured the following errors:

Traceback (most recent call last):
  File "/data/bo718.wang/anaconda3/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/bo718.wang/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/run.py", line 10, in <module>
    from allennlp.commands import main  # pylint: disable=wrong-import-position
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/commands/__init__.py", line 3, in <module>
    from allennlp.commands.serve import add_subparser as add_serve_subparser
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/commands/serve.py", line 27, in <module>
    from allennlp.service import server_sanic
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/service/server_sanic.py", line 19, in <module>
    from allennlp.models.archival import load_archive
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/__init__.py", line 6, in <module>
    from allennlp.models.archival import archive_model, load_archive
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/archival.py", line 10, in <module>
    from allennlp.models.model import Model, _DEFAULT_WEIGHTS
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/model.py", line 12, in <module>
    from allennlp.data import Vocabulary
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/__init__.py", line 1, in <module>
    from allennlp.data.dataset import Dataset
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/dataset.py", line 13, in <module>
    from allennlp.data.instance import Instance
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/instance.py", line 3, in <module>
    from allennlp.data.fields.field import DataArray, Field
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/fields/__init__.py", line 12, in <module>
    from allennlp.data.fields.text_field import TextField
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/fields/text_field.py", line 11, in <module>
    from allennlp.data.token_indexers.token_indexer import TokenIndexer, TokenType
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/token_indexers/__init__.py", line 6, in <module>
    from allennlp.data.token_indexers.token_characters_indexer import TokenCharactersIndexer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/token_indexers/token_characters_indexer.py", line 10, in <module>
    from allennlp.data.tokenizers.character_tokenizer import CharacterTokenizer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/__init__.py", line 7, in <module>
    from allennlp.data.tokenizers.word_tokenizer import WordTokenizer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_tokenizer.py", line 13, in <module>
    class WordTokenizer(Tokenizer):
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_tokenizer.py", line 39, in WordTokenizer
    word_splitter: WordSplitter = SpacyWordSplitter(),
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_splitter.py", line 144, in __init__
    import spacy
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/__init__.py", line 5, in <module>
    from .deprecated import resolve_model_name
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/deprecated.py", line 8, in <module>
    from .cli import download
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/cli/__init__.py", line 5, in <module>
    from .train import train, train_config
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/cli/train.py", line 8, in <module>
    from ..scorer import Scorer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/scorer.py", line 4, in <module>
    from .gold import tags_to_entities
ImportError: dlopen: cannot load any more object with static TLS

I googled the error message dlopen: cannot load any more object with static TLS and found that this issuse seems to be related to the import of spacy package, and changing the importing order might help. But when I looked into the source code I found nothing that can get fixed. Up to now I haven't successfully run any demo on my machine. Anybody also encountered the same problem?

Figure out the right way to instantiate Vocabulary `from_params`

We need this so we can configure the vocabulary properly from an experiment config. We need a method that takes a Dataset and a Params and instantiates the object. Not sure what the most sane way is to do this - should we just only allow certain ways of constructing the vocab?

Upgrade to python 3.6

This will let us use variable type annotations, and remove all of the unused-imports in the code.

I don't think there are any big issues with just changing the python version in our images and build settings, so I'm labeling this as easy, but it's possible there is some library we're using that's not compatible and it will end up being hard.

Make explicit wrappers for LearningRateSchedulers

Like our Init wrapper. This will allow us to remove the type checks that were necessary in Trainer by handling the API differences between pytorch's LRSchedulers with separate wrappers (or just a single wrapper that does some inspection of the wrapped object...).

Move `replace_none` into `Params`

We only use it when passing something into the constructor of Params - we should just put it inside the constructor and not make the caller have to call this method every time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.