lajanugen / s2v Goto Github PK

View Code? Open in Web Editor NEW

205.0 12.0 64.0 47 KB

ICLR 2018 Quick-Thought vectors

Home Page: https://arxiv.org/pdf/1803.02893.pdf

License: Apache License 2.0

Shell 3.92% Python 96.08%

quick-thought sent2vec sentence representations efficient

s2v's Introduction

Quick-Thought Vectors

This is a TensorFlow implementation accompanying our paper

Lajanugen Logeswaran, Honglak Lee, An efficient framework for learning sentence representations. In ICLR, 2018.

This codebase is based on Chris Shallue's Tensorflow implementation of the SkipThought model. The data preparation, vocabulary expansion and evaluation scripts have been adopted with minor changes. Other code files have been modified and re-structured with changes specific to our model.

Model configuration files
Pretrained Models
Training a Model
Evaluating a Model

Model configuration files

We use json configuration files to describe models. These configuration files provide a concise description of a model. They also make it easy to concatenate representations from different models/types of models at evaluation time.

The description of a sentence encoder has the following format.

{
        "encoder": "gru",                            # Type of encoder
        "encoder_dim": 1200,                         # Dimensionality of encoder
        "bidir": true,                               # Uni/bi directional
        "checkpoint_path": "",                       # Path to checkpoint
        "vocab_configs": [                           # Configuration of vocabulary/word embeddings
        {
                "mode": "trained",                   # Vocabulary mode: fixed/trained/expand
                "name": "word_embedding",
                "dim": 620,                          # Word embedding size
                "size": 50001,                       # Size of vocabulary
                "vocab_file": "BC_dictionary.txt",   # Dictionary file
                "embs_file": ""                      # Provide external embeddings file
        }
        ]
}

Vocabulary mode can be one of fixed, trained or expand. These modes represent the following cases.

fixed - Use fixed, pre-trained embeddings.
trained - Train word embeddings from scratch.
expand - Use an expanded vocabulary. This mode is only used during evaluation on downstream tasks.

checkpoint_path and vocab_file have to be specified only for evaluation.

For concatenating representations from multiple sentence encoders at evaluation time, the json file can be a list of multiple encoder specifications. See model_configs/BC/eval.json for an example.

Pretrained Models

Models trained on the BookCorpus and UMBC datasets can be downloaded from https://bit.ly/2uttm2j. These models are the multi-channel variations (MC-QT) discussed in the paper. If you are interested in evaluating these models or using them in your tasks, jump to Evaluation on downstream tasks.

Training a Model

Prepare the Training Data

The training script requires data to be in (sharded) TFRecord format. scripts/data_prep.sh can be used to generate these files. The script requires a dictionary file and comma-separated paths to files containing tokenized sentences.

The dictionary file should have a single word in each line. We assume that the first token ("<unk>") represets OOV words.
The data files are expected to have a tokenized sentence in each line, in the same order as the source document.

The following datasets were used for training out models.

The dictionary files we used for training our models are available at https://bit.ly/2G6E14q.

Run the Training Script

Use the run.sh script to train a model. The following variables have to be specified.

* DATA_DIR      # Path to TFRecord files
* RESULTS_HOME  # Directory to store results
* CFG           # Name of model configuration 
* MDL_CFGS      # Path to model configuration files
* GLOVE_PATH    # Path to GloVe dictionary and embeddings

Example configuration files are provided in the model_configs folder. During training, model files will be stored under a directory named $RESULTS_HOME/$CFG.

Training using pre-trained word embeddings

The implementation supports using fixed pre-trained GloVe word embeddings. The code expects a numpy array file consisting of the GloVe word embeddings named glove.840B.300d.npy in the $GLOVE_PATH folder.

Evaluating a Model

Expanding the Vocabulary

Once the model is trained, the vocabulary used for training can be optionally expanded to a larger vocabulary using the technique proposed by the SkipThought paper. The voc_exp.sh script can be used to perform expansion. Since Word2Vec embeddings are used for expansion, you will have to download the Word2Vec model. You will also need the gensim library to run the script.

Evaluation on downstream tasks

Use the eval.sh script for evaluation. The following variables need to be set.

* SKIPTHOUGHTS  # Path to SkipThoughts implementation
* DATA          # Data directory for downstream tasks
* TASK          # Name of the task
* MDLS_PATH     # Path to model files
* MDL_CFGS      # Path to model configuration files
* CFG           # Name of model configuration 
* GLOVE_PATH    # Path to GloVe dictionary and embeddings

Evaluation scripts for the downstream tasks from the authors of the SkipThought model are used. These scripts train a linear layer on top of the sentence embeddings for each task. You will need to clone or download the skip-thoughts GitHub repository by ryankiros. Set the DATA variable to the directory containing data for the downstream tasks. See the above repository for further details regarding downloading and setting up the data.

To evaluate the pre-trained models, set the directory variables appropriately. Set MDLS_PATH to the directory of downloaded models. Set the configuration variable CFG to one of

MC-BC (Multi-channel BookCorpus model) or
MC-UMBC (Multi-channel BookCorpus + UMBC model)

Set the TASK variable to the task of interest.

Reference

If you found our code useful, please cite us 1.

@inproceedings{
logeswaran2018an,
  title={An efficient framework for learning sentence representations},
  author={Lajanugen Logeswaran and Honglak Lee},
  booktitle={International Conference on Learning Representations},
  year={2018},
  url={https://openreview.net/forum?id=rJvJXZb0W},
}

Contact: [email protected]

s2v's People

Stargazers

Watchers

s2v's Issues

Input/Output Nodes for Freezing the Model

I'm trying to freeze the Quick-thoughts model but am having trouble identifying the input and output nodes. I believe the output node should be "word_embeddings" but am not 100% sure.

My end goal is to be able to use the encoder from a frozen version of the model.

Can you provide the model about sick_data

missed file : eval_classification, eval_msrp, eval_sick, eval_trec

I can not find these files in your git, can you upload these?

Your code connot be runing , Can you please provide some example code to run ?

Easiest way to use the model as a "library"?

Hi,

I am interested in comparing to your QuickThoughts method by evaluating it on the full SentEval benchmark. To do that I need to write something like the following:

def batcher(params, batch):
    batch = [sent if sent != [] else ["."] for sent in batch]
    embeddings = params.model.encode(batch)
    embeddings = np.vstack(embeddings)
    return embeddings

Is there a way for me to import your pre-trained model, and use it similar to the above code? E.g. is there a workflow for embedding lists of strings?

The results using pre-trained word embedding is worse than the one using random embedding ?

It is very incredible that the version using pre-trained word embedding is worse than the one using random word embedding. I don't know if I had wrong configurations, configurations are as follows:
train.json:
{
"encoder": "gru",
"encoder_dim": 1200,
"bidir": true,
"case_sensitive": true,
"checkpoint_path": "",
"vocab_configs": [
{
"mode": "fixed",
"name": "word_embedding",
"cap": false,
"dim": 200,
"size": 1133884,
"vocab_file": "/nfs/private/FST/models/embeddings/glove.840B.300d.txt",
"embs_file": ""
}
]
}

run.sh:
RESULTS_HOME="results"
MDL_CFGS="model_configs"
GLOVE_PATH="/nfs/private/FST/models/embeddings/"

DATA_DIR="data/CS_10M_pretrained/TFRecords"
NUM_INST=10000000 # Number of sentences

CFG="CS_10M_pretrained"

BS=400
SEQ_LEN=30

export CUDA_VISIBLE_DEVICES=0
python src/train.py
--input_file_pattern="$DATA_DIR/train-?????-of-00100"
--train_dir="$RESULTS_HOME/$CFG/train"
--learning_rate_decay_factor=0
--batch_size=$BS
--sequence_length=$SEQ_LEN
--nepochs=1
--num_train_inst=$NUM_INST
--save_model_secs=1800
--Glove_path=$GLOVE_PATH
--model_config="$MDL_CFGS/$CFG/train.json" &

export CUDA_VISIBLE_DEVICES=1
python src/eval.py
--input_file_pattern="$DATA_DIR/validation-?????-of-00001"
--checkpoint_dir="$RESULTS_HOME/$CFG/train"
--eval_dir="$RESULTS_HOME/$CFG/eval"
--batch_size=$BS
--model_config="$MDL_CFGS/$CFG/train.json"
--eval_interval_secs=1800
--sequence_length=$SEQ_LEN &

Unclear Instructions

The instructions are very vague. It's not even clear if this project is self-containing. In eval.sh, there is a parameter for the link to the directory of SkipThoughts implementation. Does it mean SkipThoughts need to be installed separately or it's somewhere in the repo? If it's the former, is that still compatible with the rest of your codebase?

Can you provide more details about the installation process and an example of using the pre-trained models?

Cannot load ckpt model file

Trained the model using the existing code but when trying to load the model using Skip-thought code:

encoder.load_model(configuration.model_config(bidirectional_encoder=False),
                   vocabulary_file=VOCAB_FILE,
                   embedding_matrix_file=EMBEDDING_MATRIX_FILE,
                   checkpoint_path=CHECKPOINT_PATH)

It's throwing me exception:

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

NFO:tensorflow:Reading vocabulary from C:\Users\DATA_DIR\QTV\Ver_1\exp_vocab\word_embedding.txt
INFO:tensorflow:Loaded vocabulary with 929088 words.
INFO:tensorflow:Loading embedding matrix from C:\Users\DATA_DIR\QTV\Ver_1\exp_vocab\word_embedding.npy
INFO:tensorflow:Loaded embedding matrix with shape (929088, 300)
INFO:tensorflow:Building model.
INFO:tensorflow:Loading model from checkpoint: C:\Users\DATA_DIR\QTV\Ver_1\train_dir\model.ckpt-10000
INFO:tensorflow:Restoring parameters from C:\Users\DATA_DIR\QTV\Ver_1\train_dir\model.ckpt-10000

NotFoundError Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\envs\skipthoughtenv\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1333 try:
-> 1334 return fn(*args)
1335 except errors.OpError as e:

~\AppData\Local\Continuum\anaconda3\envs\skipthoughtenv\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1318 return self._call_tf_sessionrun(
-> 1319 options, feed_dict, fetch_list, target_list, run_metadata)
1320

~\AppData\Local\Continuum\anaconda3\envs\skipthoughtenv\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1406 self._session, options, feed_dict, fetch_list, target_list,
-> 1407 run_metadata)
1408

NotFoundError: Key encoder/gru_cell/candidate/layer_norm/u/beta not found in checkpoint
[[{{node save/RestoreV2}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Any hint would be appreciable...

Loss values during training

I would like to know what you loss values during the training are. For me the loss start at around 4-5 and then decreases to around 2 after some minutes. After that, it doesn't seem to get better and fluctuate between 1-3. It seems that the loss doesn't decrease after a short time of training. What are you loss values during training and at the end of the training?

No modules named eval_classification, eval_msrp, eval_sick, eval_trec, ...

I couldn't find some modules, i.e, eval_classification, eval_msrp, eval_sick, eval_trec, ... in evaluate.py.

Can run on GPU but not CPU

I can successfully run the model on a virtual machine with GPUs but am unable to run the same code on my local computer using a CPU.

I am getting the following error:
InvalidArgumentError (see above for traceback): /Users/angela.zhao/Documents/Projects/text-similarity/S2V/s2v_models/BS400-W300-S1200-Glove-BC-bidir/train/model.ckpt-114466.data-00000-of-00001; Invalid argument
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Pre-trained word embeddings

Can you please provide us with the glove.840B.300d.npy file that you used with the pre-trained GloVe word embeddings?