Git Product home page Git Product logo

bilm-tf's Introduction

bilm-tf

Tensorflow implementation of the pretrained biLM used to compute ELMo representations from "Deep contextualized word representations".

This repository supports both training biLMs and using pre-trained models for prediction.

We also have a pytorch implementation available in AllenNLP.

You may also find it easier to use the version provided in Tensorflow Hub if you just like to make predictions.

Citation:

@inproceedings{Peters:2018,
  author={Peters, Matthew E. and  Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},
  title={Deep contextualized word representations},
  booktitle={Proc. of NAACL},
  year={2018}
}

Installing

Install python version 3.5 or later, tensorflow version 1.2 and h5py:

pip install tensorflow-gpu==1.2 h5py
python setup.py install

Ensure the tests pass in your environment by running:

python -m unittest discover tests/

Installing with Docker

To run the image, you must use nvidia-docker, because this repository requires GPUs.

sudo nvidia-docker run -t allennlp/bilm-tf:training-gpu

Using pre-trained models

We have several different English language pre-trained biLMs available for use. Each model is specified with two separate files, a JSON formatted "options" file with hyperparameters and a hdf5 formatted file with the model weights. Links to the pre-trained models are available here.

There are three ways to integrate ELMo representations into a downstream task, depending on your use case.

  1. Compute representations on the fly from raw text using character input. This is the most general method and will handle any input text. It is also the most computationally expensive.
  2. Precompute and cache the context independent token representations, then compute context dependent representations using the biLSTMs for input data. This method is less computationally expensive then #1, but is only applicable with a fixed, prescribed vocabulary.
  3. Precompute the representations for your entire dataset and save to a file.

We have used all of these methods in the past for various use cases. #1 is necessary for evaluating at test time on unseen data (e.g. public SQuAD leaderboard). #2 is a good compromise for large datasets where the size of the file in #3 is unfeasible (SNLI, SQuAD). #3 is a good choice for smaller datasets or in cases where you'd like to use ELMo in other frameworks.

In all cases, the process roughly follows the same steps. First, create a Batcher (or TokenBatcher for #2) to translate tokenized strings to numpy arrays of character (or token) ids. Then, load the pretrained ELMo model (class BidirectionalLanguageModel). Finally, for steps #1 and #2 use weight_layers to compute the final ELMo representations. For #3, use BidirectionalLanguageModel to write all the intermediate layers to a file.

Shape conventions

Each tokenized sentence is a list of str, with a batch of sentences a list of tokenized sentences (List[List[str]]).

The Batcher packs these into a shape (n_sentences, max_sentence_length + 2, 50) numpy array of character ids, padding on the right with 0 ids for sentences less then the maximum length. The first and last tokens for each sentence are special begin and end of sentence ids added by the Batcher.

The input character id placeholder can be dimensioned (None, None, 50), with both the batch dimension (axis=0) and time dimension (axis=1) determined for each batch, up the the maximum batch size specified in the BidirectionalLanguageModel constructor.

After running inference with the batch, the return biLM embeddings are a numpy array with shape (n_sentences, 3, max_sentence_length, 1024), after removing the special begin/end tokens.

Vocabulary file

The Batcher takes a vocabulary file as input for efficency. This is a text file, with one token per line, separated by newlines (\n). Each token in the vocabulary is cached as the appropriate 50 character id sequence once. Since the model is completely character based, tokens not in the vocabulary file are handled appropriately at run time, with a slight decrease in run time. It is recommended to always include the special <S> and </S> tokens (case sensitive) in the vocabulary file.

ELMo with character input

See usage_character.py for a detailed usage example.

ELMo with pre-computed and cached context independent token representations

To speed up model inference with a fixed, specified vocabulary, it is possible to pre-compute the context independent token representations, write them to a file, and re-use them for inference. Note that we don't support falling back to character inputs for out-of-vocabulary words, so this should only be used when the biLM is used to compute embeddings for input with a fixed, defined vocabulary.

To use this option:

  1. First create a vocabulary file with all of the unique tokens in your dataset and add the special <S> and </S> tokens.
  2. Run dump_token_embeddings with the full model to write the token embeddings to a hdf5 file.
  3. Use TokenBatcher (instead of Batcher) with your vocabulary file, and pass use_token_inputs=False and the name of the output file from step 2 to the BidirectonalLanguageModel constructor.

See usage_token.py for a detailed usage example.

Dumping biLM embeddings for an entire dataset to a single file.

To take this option, create a text file with your tokenized dataset. Each line is one tokenized sentence (whitespace separated). Then use dump_bilm_embeddings.

The output file is hdf5 format. Each sentence in the input data is stored as a dataset with key str(sentence_id) where sentence_id is the line number in the dataset file (indexed from 0). The embeddings for each sentence are a shape (3, n_tokens, 1024) array.

See usage_cached.py for a detailed example.

Training a biLM on a new corpus

Broadly speaking, the process to train and use a new biLM is:

  1. Prepare input data and a vocabulary file.
  2. Train the biLM.
  3. Test (compute the perplexity of) the biLM on heldout data.
  4. Write out the weights from the trained biLM to a hdf5 file.
  5. See the instructions above for using the output from Step #4 in downstream models.

1. Prepare input data and a vocabulary file.

To train and evaluate a biLM, you need to provide:

  • a vocabulary file
  • a set of training files
  • a set of heldout files

The vocabulary file is a a text file with one token per line. It must also include the special tokens <S>, </S> and <UNK> (case sensitive) in the file.

IMPORTANT: the vocabulary file should be sorted in descending order by token count in your training data. The first three lines should be the special tokens (<S>, </S> and <UNK>), then the most common token in the training data, ending with the least common token.

NOTE: the vocabulary file used in training may differ from the one use for prediction.

The training data should be randomly split into many training files, each containing one slice of the data. Each file contains pre-tokenized and white space separated text, one sentence per line. Don't include the <S> or </S> tokens in your training data.

All tokenization/normalization is done before training a model, so both the vocabulary file and training files should include normalized tokens. As the default settings use a fully character based token representation, in general we do not recommend any normalization other then tokenization.

Finally, reserve a small amount of the training data as heldout data for evaluating the trained biLM.

2. Train the biLM.

The hyperparameters used to train the ELMo model can be found in bin/train_elmo.py.

The ELMo model was trained on 3 GPUs. To train a new model with the same hyperparameters, first download the training data from the 1 Billion Word Benchmark. Then download the vocabulary file. Finally, run:

export CUDA_VISIBLE_DEVICES=0,1,2
python bin/train_elmo.py \
    --train_prefix='/path/to/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/*' \
    --vocab_file /path/to/vocab-2016-09-10.txt \
    --save_dir /output_path/to/checkpoint

3. Evaluate the trained model.

Use bin/run_test.py to evaluate a trained model, e.g.

export CUDA_VISIBLE_DEVICES=0
python bin/run_test.py \
    --test_prefix='/path/to/1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-000*' \
    --vocab_file /path/to/vocab-2016-09-10.txt \
    --save_dir /output_path/to/checkpoint

4. Convert the tensorflow checkpoint to hdf5 for prediction with bilm or allennlp.

First, create an options.json file for the newly trained model. To do so, follow the template in an existing file (e.g. the original options.json and modify for your hyperpararameters.

Important: always set n_characters to 262 after training (see below).

Then Run:

python bin/dump_weights.py \
    --save_dir /output_path/to/checkpoint
    --outfile /output_path/to/weights.hdf5

Frequently asked questions and other warnings

Can you provide the tensorflow checkpoint from training?

The tensorflow checkpoint is available by downloading these files:

How to do fine tune a model on additional unlabeled data?

First download the checkpoint files above. Then prepare the dataset as described in the section "Training a biLM on a new corpus", with the exception that we will use the existing vocabulary file instead of creating a new one. Finally, use the script bin/restart.py to restart training with the existing checkpoint on the new dataset. For small datasets (e.g. < 10 million tokens) we only recommend tuning for a small number of epochs and monitoring the perplexity on a heldout set, otherwise the model will overfit the small dataset.

Are the softmax weights available?

They are available in the training checkpoint above.

Can you provide some more details about how the model was trained?

The script bin/train_elmo.py has hyperparameters for training the model. The original model was trained on 3 GTX 1080 for 10 epochs, taking about two weeks.

For input processing, we used the raw 1 Billion Word Benchmark dataset here, and the existing vocabulary of 793471 tokens, including <S>, </S> and <UNK>. You can find our vocabulary file here. At the model input, all text used the full character based representation, including tokens outside the vocab. For the softmax output we replaced OOV tokens with <UNK>.

The model was trained with a fixed size window of 20 tokens. The batches were constructed by padding sentences with <S> and </S>, then packing tokens from one or more sentences into each row to fill completely fill each batch. Partial sentences and the LSTM states were carried over from batch to batch so that the language model could use information across batches for context, but backpropogation was broken at each batch boundary.

Why do I get slightly different embeddings if I run the same text through the pre-trained model twice?

As a result of the training method (see above), the LSTMs are stateful, and carry their state forward from batch to batch. Consequently, this introduces a small amount of non-determinism, expecially for the first two batches.

Why does training seem to take forever even with my small dataset?

The number of gradient updates during training is determined by:

  • the number of tokens in the training data (n_train_tokens)
  • the batch size (batch_size)
  • the number of epochs (n_epochs)

Be sure to set these values for your particular dataset in bin/train_elmo.py.

What's the deal with n_characters and padding?

During training, we fill each batch to exactly 20 tokens by adding <S> and </S> to each sentence, then packing tokens from one or more sentences into each row to fill completely fill each batch. As a result, we do not allocate space for a special padding token. The UnicodeCharsVocabulary that converts token strings to lists of character ids always uses a fixed number of character embeddings of n_characters=261, so always set n_characters=261 during training.

However, for prediction, we ensure each sentence is fully contained in a single batch, and as a result pad sentences of different lengths with a special padding id. This occurs in the Batcher see here. As a result, set n_characters=262 during prediction in the options.json.

How can I use ELMo to compute sentence representations?

Simple methods like average and max pooling of the word level ELMo representations across sentences works well, often outperforming supervised methods on benchmark datasets. See "Evaluation of sentence embeddings in downstream and linguistic probing tasks", Perone et al, 2018 arxiv link.

I'm seeing a WARNING when serializing models, is it a problem?

The below warning can be safely ignored:

2018-08-24 13:04:08,779 : WARNING : Error encountered when serializing lstm_output_embeddings.
Type is unsupported, or the types of the items don't match field type in CollectionDef.
'list' object has no attribute 'name'

bilm-tf's People

Contributors

deneutoy avatar matt-peters avatar philipmay avatar stephanheijl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bilm-tf's Issues

How to fine tune

In the paper "Deep contextualized word representations" there was a supplemental section about fine tuning biLM.

I would like to know how to do it, specifically:

  • How to load pre-trained model on a data set and train further, for example 1 epoch, on a different data set?

I guess restart_ckpt_file argument can be used, but don't know how to use it.
Thanks in advance!

The perplexity for bi-direction LM

In the paper, you said that the perplexity of the forward and backward lm is 39.6. Did you use golden word when you are validating the bi-lm?

question about cnn embedding dim and lstm dim

In code both cnn embedding dim and individual lstm outputs dim are 512.
The paper says it would compute a task specific weighting of all biLM layers.
The biLM layers embedding is concatenation of [foreward-lstm, backward-lstm] , so the dim should be 1024.
So how to compute a weighting between biLM layers embedding(1024) and cnn embedding(512)? How to add them with different dim

Can you please release the training code?

Hi,
I would like to train the elmo for my own dataset. Can you please release the training code so that I can use the weights generated by it into the bilm-tf application? I would be thankful, if I get some meanings from your training code although it is not ready for github?

Thank you.

Training does not end

I am training Elmo on a 30k sentence dataset for the last 24 hours and it is still not finished. The training perplexity is 2.12 for a while and it is not changing. I am also not sure what the output log means. I am getting something like this.

Batch 85900, train_perplexity=2.1179967
Total time: 87088.18443918228
Loading data from: ./data/elmo/small/data/part2.txt
Loaded 1014 sentences.
Finished loading
Loading data from: ./data/elmo/small/data/part1.txt
Loaded 29000 sentences.
Finished loading
Loading data from: ./data/elmo/small/data/part1.txt
Loaded 29000 sentences.
Finished loading

When the training is going to stop? Do I need to terminate training of my own?

Also, I have 2 files, part1.txt (for training) and part2.txt (for validation). I am not sure if Elmo is actually using part1 for training and part2 for validation. How can I ensure that?

Invalid shape initializing char_embed.

Hi!
After saving a checkpoint i tried load weights.hdf5 and got this error:
...
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/some_path/bilm-tf/bilm/model.py", line 238, in ret
varname_in_file, shape, weights.shape)
ValueError: Invalid shape initializing char_embed, got [261, 16], expected (262, 16)

Can anybody help me?

Generate ELMO from Glove

I have a model that performs sentiment analysis task and that uses Glove as word embedding, in the beginning, I load the Glove file glove.xxxB.yyyd.txt(xxx---token,yyy---dimension). Now I need instead of that to load the ELMO file that's equivalent to this glove. In another word, I need to map between Glove and ELMO one to one mapping is that possible? And if that possible what's the exported dimension of ELMO?

Using ELMo in SQuAD with spaCy tokenizing way

Hi, I meet a problem about using ELMo with spaCy.

I use spaCy to preprocessing the text data, and without ELMo, the result looks fine. However, when I use the model with both spaCy and ELMo, I have gotten a very bad result, 0.08. There are many NAN and inf occurring when I see TensorFlow debugger. If I use NLTK and ELMo, the result is what I expect.

I think maybe there is something wrong when I using ELMo. However, when I saw the source code about ELMo, I didn't think there is relationship between ELMo and the tokenizing way(NLTK, spaCy). And I used the pre-training ELMo data for SQuAD. I've been plagued by this problem for a long time, I really want to know if it's something I missed. Is it necessary to train new ELMo data when I change into spaCy?

Problem with handling encoding failure

I noticed that the method _convert_word_to_char_ids found in bilm/data.py can't handle encoding errors under certain conditions. The problem is in the code chunk below:

        word_encoded = word.encode('utf-8', 'ignore')[:(self.max_word_length-2)]
        code[0] = self.bow_char
        for k, chr_id in enumerate(word_encoded, start=1):
            code[k] = chr_id
        code[k + 1] = self.eow_char

As you can see, if a token consisted in a single character that failed to encode, then the word_encoded variable is going to be an empty string. When this goes into the enumerate for-loop, it exists without initializing the k variable and therefore the last line fails with the following error:

UnboundLocalError: local variable 'k' referenced before assignment

This can be handled with an exception, which could flag the failed token and print a warning. Since I haven't gone deep into the specifics of the library, I am not sure if this is a proper solution, so I thought I might as well bring this to your attention.

EDIT:

Another thing I have noticed is that empty files in the training data folder would cause the training to fail, once processed; meaning the training could go on for days, only to fail on an empty file. So just to save users the trouble, it would be very kind of you to notify them that empty files will cause a problem, or may be add some logic to safely skip such failures.

Code loads only two of the data shards while training.

I have a gigantic dataset to train Elmo on. So I split the training set into 1000 separate files. While loading data for training I see that only two of the files are loaded (reverse=True and False). Why is that? Or Am I missing something?

And btw Congratulations on winning the best paper award at NAACL!

Thanks,

OOM when allocating tensor for large vocabulary

Hi!
When I try to run bin / run_test.py on gpus, I get:

....
018-06-18 12:27:48.890298: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:                                                                                                             [35/1118]
Limit:                 15922230068
InUse:                 15585278208
MaxInUse:              15922230016
NumAllocs:                     376
MaxAllocSize:          15173454848

2018-06-18 12:27:48.890313: W tensorflow/core/common_runtime/bfc_allocator.cc:277] *****************************************************************************xxxxxxxxxxxxxxxxxxxxxxx
2018-06-18 12:27:48.890342: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[512,5555540]
Traceback (most recent call last):
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1139, in _do_call
    return fn(*args)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[512,5555540]
         [[Node: lm/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](lm/softmax/W/read/_111, lm/transpose/sub_1)]]
         [[Node: lm/mul_8/_141 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_nam
e="edge_565_lm/mul_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/run_test.py", line 42, in <module>
    main(args)
  File "bin/run_test.py", line 29, in main
    test(options, ckpt_file, data, batch_size=args.batch_size)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/bilm-0.1-py3.6.egg/bilm/training.py", line 1024, in test
    feed_dict=feed_dict
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[512,5555540]
         [[Node: lm/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](lm/softmax/W/read/_111, lm/transpose/sub_1)]]
         [[Node: lm/mul_8/_141 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_nam
e="edge_565_lm/mul_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op 'lm/transpose', defined at:
  File "bin/run_test.py", line 42, in <module>
    main(args)
  File "bin/run_test.py", line 29, in main
    test(options, ckpt_file, data, batch_size=args.batch_size)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/bilm-0.1-py3.6.egg/bilm/training.py", line 970, in test
    model = LanguageModel(test_options, False)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/bilm-0.1-py3.6.egg/bilm/training.py", line 71, in __init__
    self._build()
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/bilm-0.1-py3.6.egg/bilm/training.py", line 425, in _build
    self._build_loss(lstm_outputs)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/bilm-0.1-py3.6.egg/bilm/training.py", line 507, in _build_loss
    tf.transpose(self.softmax_W)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1278, in transpose
    ret = gen_array_ops.transpose(a, perm, name=name)
  File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3658, in transpose
  result = _op_def_lib.apply_op("Transpose", x=x, perm=perm, name=name)
File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
  op_def=op_def)
File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
  original_op=self._default_original_op, op_def=op_def)
File "<some_path>/env/allen_elmo/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
  self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[512,5555540]
       [[Node: lm/transpose = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](lm/softmax/W/read/_111, lm/transpose/sub_1)]]
       [[Node: lm/mul_8/_141 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_nam
e="edge_565_lm/mul_8", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

I have 5555540 tokens of my vocabulary.
It runs on the processor (export CUDA_VISIBLE_DEVICES =" "), but is too slow. I can not change the size of the vocabulary.

Tensorflow version down-compatibility

Constrained by the computational resource available, we have to work with TF v1.0 (the manager of supercomputer cluster wants to keep it that way for the benefit of other users). I was wondering if there's any way we could still be able to use ELMo with v1.0.

Thanks!

Why are embeddings of feature length 1024?

(More of a question than an issue)
The embeddings are of shape (None, 3, None, 1024). Is there any specific reason why embeddings have a size of 1024? Which hyper parameter should I change if I want to reduce the embedding size?

NER performance with Ontonotes and number-related ELMo embeddings

Thanks a lot for this work and making it available!

I used ELMo contextualized embeddings in my Keras framework (DeLFT) and I could reproduce the excellent results for CoNLL 2003 NER task - actually slightly better than what you reported in your NAACL 2018 paper (92.47 averaged over 10 training, using the 5.5B ELMo model, warm-up, concatenation with Glove embeddings with a Lample 2016 BiLSTM-CRF architecture).

However when using ELMo embeddings with NER Ontonotes CoNLL-2012 dataset, I have a large drop of -5.0 points for f-score as compared to Glove only. The drop is the same when using ELMo only or ELMo embeddings concatenated with Glove.

Here is the evaluation with Glove without ELMo:

Evaluation on test set:
        f1 (micro): 86.17
                 precision    recall  f1-score   support

       QUANTITY     0.7321    0.7810    0.7558       105
          EVENT     0.6275    0.5079    0.5614        63
           NORP     0.9193    0.9215    0.9204       841
       CARDINAL     0.8294    0.7487    0.7870       935
        ORDINAL     0.7982    0.9128    0.8517       195
            ORG     0.8451    0.8635    0.8542      1795
       LANGUAGE     0.7059    0.5455    0.6154        22
           TIME     0.6000    0.5943    0.5972       212
        PRODUCT     0.7333    0.5789    0.6471        76
            FAC     0.6630    0.4519    0.5374       135
           DATE     0.8015    0.8571    0.8284      1602
          MONEY     0.8714    0.8631    0.8672       314
            LAW     0.6786    0.4750    0.5588        40
        PERCENT     0.8808    0.8682    0.8745       349
    WORK_OF_ART     0.6480    0.4880    0.5567       166
            LOC     0.7500    0.7709    0.7603       179
            GPE     0.9494    0.9388    0.9441      2240
         PERSON     0.9038    0.9306    0.9170      1988

    avg / total     0.8618    0.8615    0.8617     11257

And here are the results with ELMo:

Evaluation on test set:
	f1 (micro): 79.62
             precision    recall  f1-score   support

WORK_OF_ART     0.5510    0.6506    0.5967       166
    PRODUCT     0.6582    0.6842    0.6710        76
      MONEY     0.8116    0.8503    0.8305       314
        FAC     0.7130    0.5704    0.6337       135
   LANGUAGE     0.7778    0.6364    0.7000        22
   QUANTITY     0.1361    0.8000    0.2327       105
       TIME     0.6370    0.4387    0.5196       212
        GPE     0.9535    0.9437    0.9486      2240
      EVENT     0.6316    0.7619    0.6906        63
    PERCENT     0.8499    0.8596    0.8547       349
        ORG     0.9003    0.8758    0.8879      1795
        LOC     0.7611    0.7654    0.7632       179
     PERSON     0.9297    0.9452    0.9374      1988
    ORDINAL     0.8148    0.1128    0.1982       195
        LAW     0.5405    0.5000    0.5195        40
       NORP     0.9191    0.9322    0.9256       841
   CARDINAL     0.8512    0.1102    0.1951       935
       DATE     0.8537    0.5137    0.6415      1602

avg / total     0.8423    0.7548    0.7962     11257

I see that the drop is always for named entity classes related somehow to numbers (ORDINAL -65, CARDINAL -58, QUANTITY -53, DATE -18, etc.), and the recognition of all the other classes are actually improving with ELMo.

I am wondering what could cause this behavior (apart an implementation error from me), did you observe something similar?
Are you using special normalization of numbers on the corpus before training the BiLM?
I am using the default tokenization of Onotnotes/CoNLL-2012, should I use maybe another particular tokenization?

LSTM final states as the initial states of next batch

Hi!

It seems to me from the code provided that the final states of each batch are fed as the initial states of the next batch. However, in data.py the examples in a batch seem to be the continuation of the previous example in the same batch (when the sentence is greater than the BPTT rollout steps). If what I'm saying is correct, we are feeding the final states to a new batch that is not the continuation of the sentences in the previous batch.

Why is that so? What am I missing?

Value of "n_characters" in char embedding

In train_model.py, "n_characters" is defined as 261. However, in pretrained models's configs, n_characters is set to 262. Any particular reason?

Test model : https://raw.githubusercontent.com/allenai/bilm-tf/master/tests/fixtures/model/options.json
Pretrained model : https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json

Both models have n_characters=262

Moreover, while reading a pre-trained model, we increase the size by one to add padding

# Have added a special 0 index for padding not present

But we already have a special char for padding
self.pad_char = 260 # <padding>

(n_sentences, 3, max_sentence_length, 1024)

After running inference with the batch, the return biLM embeddings are a numpy array with shape (n_sentences, 3, max_sentence_length, 1024), after removing the special begin/end tokens.
I assume that "(n_sentences, 3, max_sentence_length, 1024)" should be "(n_sentences, max_sentence_length, 1024)"?

Hub version to train new embedding

May I use hub version from tensorflow to train my own elmo embedding? And my corpus is Chinese.
If it's OK , can you give me a simple example?
Thank you so much.

L2 Norm

Sorry to borrow again!
In the paper, you said that the L2 norm is add while training the model, but I didn't find the code in the training code(training.py). Would you like to tell me where the L2 norm is added in your training code?

How to concatenate a ELMo vector with the corresponding context-independent token representation?

As described in the paper "deep contextualized word representations", before being fed into NLP tasks, elmo vectors, ELMo, are concatenated with context-independent token representations X like this: [X; ELMo]

But, how exactly are they concatenated? is it element-wise or we just combine the two vectors end-to-end?

I saw from the source codes that the lstm layers' outputs in the bilm are concatenated element-wise with tf.concat([lstm_output1, lstm_output2], axis=-1), so I feel like the concatenation between ELMo and X should be also element-wise.
But, if it is combined element-wise, then does X always have to follow the dimension of ELMo's internal lstm layers?
For example, i see that given 2 sentences and max_length of sentences being 10, vectors created by weight_layers are in shape of (2, 10, 32) with 32 being the concatenated unit of two lstm layers(forward and backward) whose dimension is 16(16x2 = 32). However, if we were to combine ELMO with X element-wise as introduced in the paper, X also needs to be in shape of (num_sentences, max_sentence_length, 32), which sort of limits the probability of X's embedding dimension size being different than 32.

As far as I understand options.json file correctly, "projection_dim" hyperparameter determines the internal lstm layer dimension.
Then, is they any way to manipulate the lstm layer dimension in the bilm (possibly through lstm { ... projection_dim = ? ... } in options.json file)? or am I missing something?
(I ask this question because when I tried to change projection_dim and ran, I came across the following error)

======================================================================
ERROR: test_weighted_layers (main.TestWeightedLayers)

Traceback (most recent call last):
File "elmo.py", line 136, in test_weighted_layers
self._check_weighted_layer(1.0, do_layer_norm=True, use_top_only=False)
File "elmo.py", line 36, in _check_weighted_layer
bilm_ops = model(character_ids)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/bilm-0.1-py3.5.egg/bilm/model.py", line 97, in call
max_batch_size=self._max_batch_size)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/bilm-0.1-py3.5.egg/bilm/model.py", line 286, in init
self._build()
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/bilm-0.1-py3.5.egg/bilm/model.py", line 290, in _build
self._build_word_char_embeddings()
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/bilm-0.1-py3.5.egg/bilm/model.py", line 415, in _build_word_char_embeddings
dtype=DTYPE)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 1317, in get_variable
constraint=constraint)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 1079, in get_variable
constraint=constraint)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 417, in get_variable
return custom_getter(**custom_getter_kwargs)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/bilm-0.1-py3.5.egg/bilm/model.py", line 275, in custom_getter
return getter(name, *args, **kwargs)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 394, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 786, in _get_single_variable
use_resource=use_resource)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 2220, in variable
use_resource=use_resource)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 2210, in
previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 2193, in default_variable_creator
constraint=constraint)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 235, in init
constraint=constraint)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 343, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 770, in
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/home/youngmokcho/note_recognition/venv/lib/python3.5/site-packages/bilm-0.1-py3.5.egg/bilm/model.py", line 246, in ret
varname_in_file, shape, weights.shape)
ValueError: Invalid shape initializing CNN_proj/W_proj, got [124, 8], expected (124, 16)


Ran 1 test in 0.099s

FAILED (errors=1)

I'm currently studying CNN so it was kinda hard for me to trace back through this error, but it looks like projection_dim depends on some other value.

To sum up, all I want to know is how to manipulate elmo's embedding dimension in order to match the size of ELMo with that of context-independent token representations.

Please correct or ask me if any of my questions is unclear or mistaken.
Thank you for any help you may provide!!

Details about the training process.

I'm really sorry to broth you again. There are two ways to get the perplexity of your language model. (1) you input the really words in the sentence to the model and the model is going to predict the next word, which is called as training. (2) you input the word that the model just predict to the model and the model predict the next, which is called as inference.
So would you like to tell me, which way do you use in getting the perplexity of 39.4?
I'm not good at expressing my view in English and thanks for your patience!
By the way, would you like to tell me about the learning rate you use in training elmo?

InvalidArgumentError (see above for traceback): Sampler's range is too small.

hello, I encounter an problem, “tensorflow.python.framework.errors_impl.InvalidArgumentError: Sampler's range is too small.
[[Node: lm/sampled_softmax_loss_1/LogUniformCandidateSampler = LogUniformCandidateSampler[num_sampled=8192, num_true=1, range_max=6603, seed=0, seed2=0, unique=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](lm/Reshape_6, ^lm/dropout_1/mul)]]
”。
I want to train based on words without chars.So I changed the code like "load_vocab(args.vocab_file)"
And I remove CNN in dict.I don't know why.

Question about the weight_layers

Hi, thanks for the great paper and nice implementation.

After reading through the paper and the code, I still feel confused about the weight_layers.

I saw that the dump_token_embeddings only return the intermediate LSTM status. It is not the final weighted ELMo vector. As mentioned in the paper, the weight_layers should be trained with the downstream task. However, in usage_token.py, the code directly creates the weight_layers and use it without training. Then, I was confused by the usage of the weight layers. Could you please explain it a little bit?

Also, in the test_elmo.py, why is the expected_elmo calculated this way? I don't understand why the actual_elmo will be close to the values following these calculations. Could you please also explain it?

Thanks a lot.

max_batch_size issue

Hi,

While I am trying to create embeddings for Questions like shown in usage_token.py, I am getting an error from the tensor because of having a different size than the max_batch_size?

How can I handle that case? Have you encountered such a problem in the project?

Thank you

How long does it take to train the model?

How long does it take to train the model from the ELMO paper? I read that you used 3 GPUs. Which ones?
I want to get a rough idea before I can train my own.
This is not an issue per se, so if there's a different forum to discuss these things please let me know.

And congratulations on winning the best paper award at NAACL!

Release

Looks like the project hasn't been released. Is that correct?

Thank you

a strange error occurred

'ResourceExhaustedError (see above for traceback): OOM when allocating tensor of shape [16384] and type float'

I modified the code of train_elmo.py like this:

options = {
'bidirectional': True,

 # 'char_cnn': {'activation': 'relu',
 #  'embedding': {'dim': 16},
 #  'filters': [[1, 32],
 #   [2, 32],
 #   [3, 64],
 #   [4, 128],
 #   [5, 256],
 #   [6, 512],
 #   [7, 1024]],
 #  'max_characters_per_token': 50,
 #  'n_characters': 6707,
 #  'n_highway': 2},

Cause I want to run this code only based on word_emb without char_emb
Then this error called 'ResourceExhaustedError' occurred
Could you tell me how to fix that?
THX !!!

How to fine tune the existing weights on new data ?

I converted the hdf5 file back as a ckpt file (using the custom_getter method in bilm/model.py) and tried to use it with architecture in bilm/training.py but the loaded weights give very bad perplexity on heldout data when I do run_test.py. Are the architectures in bilm/model.py and bilm/training.py compatible. If you feel I m doing something wrong, is it possible for you to share the ckpt file of the given hdf5 file.

Thanks

Trainable parameters in TF Hub release

I had a few questions about the set of trainable parameters in the TF Hub releases of ELMo. The initial release mentions that the LSTM cell parameters are trainable (and this is what I expected, fine-tuning on the downstream task's supervised labels). However, I recently came across this paper which mentioned that the LSTM parameters in ELMo are fixed, and it also seems to be the case in the current release of ELMo on TF Hub.

  1. Were the LSTM parameters kept fixed during the experiments described in the paper? (and the only fine-tuning done ignored the supervised labels, and used the training set of the downstream task for language modelling?)
  2. Did you notice significant performance drops keeping the LSTM cell parameters trainable during the downstream task?

Converging to 25 ppl after 7 days?

Hi,

I've been training the model on the 1 million benchmark for 7 days now on 4 tesla k80 gpus and it seems to be converging to a perplexity around 25 (it has not improved for 24h now). See tail of log below.
Is this expected behaviour? Has it converged?

Batch 142200, train_perplexity=24.746037
Total time: 585740.8309390545
Batch 142300, train_perplexity=25.46843
Total time: 586129.8147296906
Batch 142400, train_perplexity=25.55357
Total time: 586523.1840500832
Loading data from: ../../data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00026-of-00100
Loaded 306324 sentences.
Finished loading
Loading data from: ../../data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00028-of-00100
Loaded 305485 sentences.
Finished loading
Batch 142500, train_perplexity=26.139242
Total time: 586992.946965456
WARNING:tensorflow:Error encountered when serializing lstm_output_embeddings.
Type is unsupported, or the types of the items don't match field type in CollectionDef.
'list' object has no attribute 'name'
Batch 142600, train_perplexity=24.84199
Total time: 587395.0743260384
Batch 142700, train_perplexity=25.43104
Total time: 587794.4523823261
Batch 142800, train_perplexity=25.182297
Total time: 588190.2893879414
Batch 142900, train_perplexity=24.556465
Total time: 588584.6505479813
Batch 143000, train_perplexity=25.966608
Total time: 588982.1930603981
Batch 143100, train_perplexity=25.03588
Total time: 589376.4338204861
Batch 143200, train_perplexity=25.981043
Total time: 589773.4447641373
Loading data from: ../../data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00065-of-00100
Loaded 305213 sentences.
Finished loading
Batch 143300, train_perplexity=25.373167
Total time: 590195.4948370457
Loading data from: ../../data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/news.en-00041-of-00100
Loaded 306092 sentences.

use before assign error

in data.py

j=0
for k, chr_id in enumerate(word_encoded, start=1):
code[k] = chr_id
j=k
k=j

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.