huggingface / hmtl Goto Github PK

🌊HMTL: Hierarchical Multi-Task Learning - A State-of-the-Art neural network model for several NLP tasks based on PyTorch and AllenNLP

License: MIT License

Python 98.67% Shell 1.33%

natural-language-processing multi-task-learning nlp pytorch

hmtl's Introduction

HMTL (Hierarchical Multi-Task Learning model)

***** New November 20th, 2018: Online web demo is available *****

We released an online demo (along with pre-trained weights) so that you can play yourself with the model. The code for the web interface is also available in the demo folder.

To download the pre-trained models, please install git lfs and do a git lfs pull. The weights of the model will be saved in the model_dumps folder.

A Hierarchical Multi-Task Approach for Learning Embeddings from Semantic Tasks
Victor SANH, Thomas WOLF, Sebastian RUDER
Accepted at AAAI 2019

About

HMTL is a Hierarchical Multi-Task Learning model which combines a set of four carefully selected semantic tasks (namely Named Entity Recoginition, Entity Mention Detection, Relation Extraction and Coreference Resolution). The model achieves state-of-the-art results on Named Entity Recognition, Entity Mention Detection and Relation Extraction. Using SentEval, we show that as we move from the bottom to the top layers of the model, the model tend to learn more complex semantic representation.

For further details on the results, please refer to our paper.

We released the code for training, fine tuning and evaluating HMTL. We hope that this code will be useful for building your own Multi-Task models (hierarchical or not). The code is written in Python and powered by Pytorch.

Dependecies and installation

The main dependencies are:

AllenNLP
PyTorch
SentEval (only for evaluating the embeddings)

The code works with Python 3.6. A stable version of the dependencies is listed in requirements.txt.

You can quickly setup a working environment by calling the script ./script/machine_setup.sh. It installs Python 3.6, creates a clean virtual environment, and installs all the required dependencies (listed in requirements.txt). Please adapt the script depending on your needs.

Example usage

We based our implementation on the AllenNLP library. For an introduction to this library, you should check these tutorials.

An experiment is defined in a json configuration file (see configs/*.json for examples). The configuration file mainly describes the datasets to load, the model to create along with all the hyper-parameters of the model.

Once you have set up your configuration file (and defined custom classes such DatasetReaders if needed), you can simply launch a training with the following command and arguments:

python train.py --config_file_path configs/hmtl_coref_conll.json --serialization_dir my_first_training

Once the training has started, you can simply follow the training in the terminal or open a Tensorboard (please make sure you have installed Tensorboard and its Tensorflow dependecy before):

tensorboard --logdir my_first_training/log

Evaluating the embeddings with SentEval

We used SentEval to assess the linguistic properties learned by the model. hmtl_senteval.py gives an example of how we can create an interface between SentEval and HMTL. It evaluates the linguistic properties learned by every layer of the hiearchy (shared based word embeddings and encoders).

Data

To download the pre-trained embeddings we used in HMTL, you can simply launch the script ./script/data_setup.sh.

We did not attach the datasets used to train HMTL for licensing reasons, but we invite you to collect them by yourself: OntoNotes 5.0, CoNLL2003, and ACE2005. The configuration files expect the datasets to be placed in the data/ folder.

References

Please consider citing the following paper if you find this repository useful.

@article{sanh2018hmtl,
  title={A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks},
  author={Sanh, Victor and Wolf, Thomas and Ruder, Sebastian},
  journal={arXiv preprint arXiv:1811.06031},
  year={2018}
}

hmtl's People

Contributors

Stargazers

Watchers

Forkers

deepnlp henghuiz-zz codeaudit scbwin tony32769 almoslmi shubhampachori12110095 huguanglong megacrazy devhttps ricklentz mahmud83 cristipp cstghitpku widemeadows threefoldo qiansi aniketsaki tkhan3 lewiszhao mayurbhangale jaykimbravekjh hsouporto bennnun gaochangkuan ziaridoy20 cjopengler shaunstanislauslau dailyactie shafiahmed stealth-alex martin-ssdut cyrusdsouza databill86 chenyangh vinicius-ianni rampalanshul sayduke iamsile basicv8vc dasguptar tomarraj008 ahmedmn ssameerr donrv chl916185 sra1github yueyedeai avineshpvs beesitech pvk444 qiuwei nlpformyself zhouminping bajuka kaustav97 readall msaffarm tinyloop katherinelyx rangwani-harsh indexfziq phren0logy matei-ionita 36984712 alasdairtran hemant-jain omkar13 abeusher johnsonc senkey705 gavingx jsupeng antgr pavan-naik gautamsharma0095 riantr aiwebops vishalbelsare manikant92 dsun917 colinsongf dertilo auscenery appcoreopc xjerryhe-zz menrui1 lucyio mbakhodir emilstenstrom sanjibnarzary pvcastro phymucs aung2phyowai jahau dragomirradev dcthang fahimsun rogervaas sahanduiuc

hmtl's Issues

Sharing [email protected]

Sharing File...

AllenNLP library version used in this code

Hi,

I have been running into some Import Errors when I run the HMTL code. Specifically,
``from allennlp.nn.util import last_dim_softmax, weighted_sum
ImportError: cannot import name 'last_dim_softmax'

I can't find this module in the AllenNLP docs either, so I am assuming that the AllenNLP version used for this code is not the latest. Could you please let us know what version of AllenNLP has been used?
Thanks!

conll2012 setup issue

Hello, thanks for raising this question.

We used pre-trained word embeddings (Glove and ELMo). You can use the script scripts/data_setup.sh to download them and place them in a data folder.

Other datasets are also expected to be in the data folder (see the paths in the configuration files configs/*.json).
For instance, we compile the CoNLL2012 coreference data using this script from AllenNLP: https://github.com/allenai/allennlp/blob/master/scripts/compile_coref_data.sh
It compiles the CoNLL2012 data, and dump the coreference annotations into a single file.
For NER CoNLL, it is basically the same data as coreference which have not been dumped into the same single file (we can probably do something quick to avoid this data duplication).
Concerning the ACE data, we pre-process them so that the Mention Detection data match a CoNLL-NER format and the Relation Extraction task match a CoNLL-SRL format. Both are saved in a data/ace2005 folder.

If you want to use other datasets, it seems coherent to place them in the data folder, and use (if not modify) the dataset_readers classes.

Victor

Hi,
I want to reproduce your NER result. However, I met a problem when I set up conll2012 data.

I used this script https://github.com/allenai/allennlp/blob/master/scripts/compile_coref_data.sh. But it warned that there is no .parse file in the folder.

could not find the gold parse [.//data/files/data/english/annotations/bc/cctv/00/cctv_0001.parse] in the ontonotes distribution ... exiting ...

cat: 'conll-2012/v4/data/development/data/english/annotations/*/*/*/*.v4_gold_conll': No such file or directory
cat: 'conll-2012/v4/data/train/data/english/annotations/*/*/*/*.v4_gold_conll': No such file or directory
cat: 'conll-2012/v4/data/test/data/english/annotations/*/*/*/*.v4_gold_conll': No such file or directory

Originally posted by @djshowtime in #2 (comment)

How to finetune on customized dataset?

Hi, thanks for this beautiful work!

I have a coreference dataset that has format
Text, Pronoun, Pronoun-offset, A, A-offset, A-coref, B, B-offset, B-coref

Example:

Tom went out and bought a book. He found it very interesting.

Then, 'Tom' is the Pronoun, A is 'He', A-coref is True, B is 'it', B-coref is False, off-sets indicates each token's position.

Can I use HMTL to finetune on such dataset?
If so, how should I modify the json file, dataset reader?
If not, how can I use HMTL to predict the conference pair of the Pronoun in each text? Like predict which word in the text corefers the Pronoun?

I appreciate your help!

Negation detection as a Multi-task learning (MTL) layer

I have been going through this paper - Joint Entity Extraction and Assertion Detection for Clinical Text - which proposes an MTL approach to negation detection that leverages overlapping representation across sub-tasks i.e., jointly model named entity and negation
in an end-to-end system.

I already have NER model in place and was thinking how would I implement MTL using HMTL but I find it difficult mold given examples in this repo into negation multitask.

@VictorSanh : Would like to know your take on how to go about it?

A-RS-GM model

Regarding your paper, specifically tables 4 and 5, and is the meaning of the "RS" in A-RS-GM? This has no reference in any other part of the paper.

Thanks, and congrats for the paper!

Sample data

Can I get sample data of input and output?
I want to know the format of them.

alternative to ace 2005 corpa

Is there any alternative dataset we can use for ace 2005. Not able to afford the licensed fee.
I'm thinking of below
Semval 2010 task8
TACRED

Please suggest if we can use above dataset for training.

What columns of the *._conll are used for NER training?

Not an issue, but a query really. Have been looking at training my own models, just wondered if all columns of the *._conll files are used for NER training or is it just the NER tags?

Training time for HMTL

Hi Victor

Thank you for this awesome yet totally under-appreciated project! I would like to use it as a baseline to expand its RE capability. I guess the first step is to try to replicate your results. Could you advise the amount of training time that your result took? And on what kind of computation configuration?

Thank you for your input!
SH

Can the trained models be uploaded?

It would be very helpful if the trained model is uploaded.

NER and RE only

Hi,

I want to experiment with this architecture for NER and RE only. What changes do I need to make?

any details to setup the dataset?

Hello World. Memory access and disipline.

How can I accomplish something? How can I remember the small things? My memory has never been accessable to me. My memory is better now than it has ever been. I desire to code. I desire to be useful now and in the future, relevant now and in the future, need it now and in the future. I want to be successful at life from all aspects and viewpoints. I want to know JavaScript, and Java. I want to know Lenox. I want to know DOS. I want to know Morse Code. I want to know Binary. I want to understand Stellar time versus linear time or - whatever time. I believe that I'm about Time. The 4th Dimension. I believe that we are all moving forward in time. All together but one at a time. at a different rate than some others. I'm (seventhSoldier1.
And it's the 6th day for me technically [email protected]
odccdfw.com
odccdfw.online
[email protected])
How do I merge all of my past and current data together in one place and how do I query all platforms that have my information at one time and the man access to my information and to get access to my old information from the beginning when I first went online till today?
469-508-5446.

I want to be used by the government in some way I want to be a part of national security defending my country my people my Paradise.

My past has been a hindrance to me for the majority of my life now the things that I should have done in the mistakes that I should not have made I need freedom from my past so I can move forward.

I wasn't allowed a childhood I was robbed of my humanity early in life and I've done and ask that I've had to pay for mentally physically spiritually legally.

But I completed all those tests now I'm ready to move forward into another operating system in my mind and I'd like to go and change the record at the binary level so that they can understand because they don't know who I am anymore I' . someone new someone I want to be somebody now I know to be.

I look at life a little bit different and I learned a little bit different because I don't have a very long background learning out of written material.

I'm ready to be financially sound now. I'm ready to include myself into security and a future.

I want to leave a start to my family so that they may experience generational welth.

Someone very special to me as autism. I want to be able to be rough enough to fund a large portion of what we do for research in autism and taking care of people's bills that have autistic children so that we can understand autism and what a great gift it is to us what a great gift people are to us that have autistic dreams and plans and games and times they're very intelligent that's all.

I want a Hands-On mental that I can see in person and guidance to freedom and I'll stand tall and respect and loyalty.

I'm told to create an issue.
My issue is that I am a multitasker with that being said. How do I live my life?
When will it be okay with me for me to live my own dreams?

Have I really ever sacrificed anything? Have ~~~~I ever sacrificed myself To life yet?
Have I met life yet?

What causes us to know the truth but not practice too and be accepting of the liability of disorder?

On Demand Cartage Company Llc
6636 learning Oaks Street
Dallas Tx 75241

My goal is to be a fully licensed freight forwarder international entrepreneur developer coding towards a new future with Superior data retrieval abilities and data retention capabilities seamless migrations fluid capitalization I want all of my emails and my domains running in tandem with one another. I want all of my pictures I want all of my company emails everything that's dealing with me (On Demand Cartage Company)
To be in my name and my name is Shawn Terrance Mitchell.
I founded On Demand Cartage Company in 2009.
Alone dreamed of the ,Fourth Dimension, to be the reason for my company.

I've always loved organization. Freight forwarding gives me the opportunity to organize large quantities of things at one time.

I am a Sagittarius born for freedom and the ability to see the globe. I'm given the chance to maximize the quality of the information that I retain.
I have been accumulating my stockpile of facts over these many years. Now is the time when we coverage. Where Transportation and Data merge. Where shipments, the cycles of service and product provided to customers become Big Data.

I require access to the resources necessary to complete the task. The world as we know it has moved forward at a dizzying pace.

Our very futures will be mechanized. Soon we will be not longer be send shipments as Freight. In the next step we will be sending freight as packets.

These packets will be downloaded the Images previewed, accepted then printed to model.

Who are what can accelerate my understanding of these concepts. Please offer suggestions.

Novo teste

Undefined name 'layerRelation'

flake8 testing of https://github.com/huggingface/hmtl on Python 3.7.1

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./hmtl/models/layerRelation.py:96:60: F821 undefined name 'layerRelation'
                    regularizer: RegularizerApplicator) -> "layerRelation":
                                                           ^
1     F821 undefined name 'layerRelation'
1

Chinese is not supported？

I try Chinese sentense,but it seems have not support Chinese yet？
My test sentence:“**的首都是北京”

Any details to setup the ACE2005 dataset?

Hi, I am confused about the setting about ACE2005 dataset.

I got the dataset called ace_2005_td_v7_LDC2006T06, and I'm aware of the issue 2 about dataset setup, and tried the script preprocess.py you write in this link:

I set the parameters:

path_to_ace2005 = "ace_2005_td_v7/"
saving_path = "ace2005/"

However, I don't know how to split into train/val/test set.

The script produces a lot of files with ‘.sgm.coref’ and ‘.sgm.like_conll’ under ace2005/

In addition, the script has been running more than 12hours on a decent server. Right now there are 800 files under ace2005/. Is that normal?

Thanks in advance.

this is really cool any plans on porting this to bert?

RE metric

Hello Victor,

Concerning the RE metric, do you consider a relation correct when the full boundaries of both heads are correctly detected or only the last token of each head? Do predicted entity types of both heads need to be correct for the relation to be correct?

Issue with only train data being used for vocab creation

Hi team,
Thanks for this wonderful repo . The code in the repo is generic and can easily be reused. I wanted to ask that during creation of the vocab in all the models only training tokens are being used.

"datasets_for_vocab_creation": ["train"]

So in cases when we are using the multitask model we have a large coverage of tokens as we have a large vocab that consists of tokens from all datasets. So there is a high probability of test token to be found in that vocab. Whereas in case of using only single model the vocab size is less and there is a large chance of a token being OOV (Out of Vocab).
So how do we make sure that the improvements are due to multitask learning rather then due to large coverage of vocabulary in case of multitask learning?

The other point was that if we only consider vocab made from training data we make our model work well on only tokens that are present in training data which makes us loose important token information that is present in the word embeddings for those tokens which are not present in the training data.

It would be great to hear your thoughts on it.

Want to reproduce results for my thesis on another language

Hello,

I am having trouble finding the datasets since they are not free. I am trying to reproduce your results on different languages for my thesis. Can you at least give a sample on the datasets that you guys used for me to parse my datasets accordingly? Thank you.

Kind Regards