amazon-science / tanl Goto Github PK

Structured Prediction as Translation between Augmented Natural Languages

License: Apache License 2.0

Python 100.00%

natural-language-processing pytorch deep-learning

tanl's Issues

CoNLL2012 Datasets in .json format

datasets.py expects .json files for CoNLL2012 dataset. However, after searching online, I cannot find any preprocessing tools to yield .json files for the CoNLL2012 dataset.

Would the authors be able to provide a way to preprocess the CoNLL2012 dataset so that it can be used for training?

Thanks,

reproduce on other datasets

Since you mentioned, "For other datasets, we provide sample processing code which does not necessarily match the format of publicly available versions (we do not plan to adapt the code to load datasets in other formats)". I'd like to know how can I reproduce the results on other datasets in the paper.

Inquiry Regarding ACE2005-Event Data for TANL

Hi,
could you kindly share the ACE2005 dataset for Event Extraction(Event Trigger Dataset & Event Argument Dataset) or provide guidance on how I might obtain access to it?
Thanks!

Bug in augment_sentence function

Hi, there seems to be a small bug in augment_sentence function in utils.py. When the root of the entity tree is an entity with tags, those tags won't be augmented onto the output. For example, when I run the code above:

from utils import augment_sentence

tokens = ['Tolkien', 'was', 'born', 'here']
augmentations =  [
        ([('person',), ('born in', 'here')], 0, 1),
        ([('location',)], 3, 4),
    ]

# example from the test set of conll03 NER
tokens = ['Premier', 'league']
augmentations = [([('miscellaneous',)], 0, 2)]

begin_entity_token = "["
sep_token = "|"
relation_sep_token = "="
end_entity_token = "]"

augmented_output = augment_sentence(tokens, augmentations, begin_entity_token, sep_token, relation_sep_token, end_entity_token)
print(augmented_output)

It prints out Premier league instead of [ Premier league | miscellaneous ]. This happened because in line 124 (utils.py), the value of the root in entity tree is reset to an empty list. My quick fix of this is initializing the start index of the root as -1. That is changing line 103 in utils.py to

root = (None, -1, len(tokens))   # this node represents the entire sentence

It would be great if someone could let me know if I am correct on this. Thanks!

About the training batch size used in the low resource experiments

Hi!
Thanks a lot for sharing the code! I'm trying to reproduce the low-resource experiments on the CoNLL04 dataset. Could you please provide the size of a training batch (per GPU and the number of GPUs used)? Would also appreciate a full training config!

Episode numbers in few-shot experiment

Hi,
Thank you for sharing the code ! I'm trying to reproduce the results on FewRel 1.0 . And I'm wondering how many episodes and query numbers are used in 1 shot , and 5 shot cases, respectively ?

Thanks.

The format of Multiwoz dataset

Hi Giovanni,

Nice work and thanks for the sharing. I am reproducing the results of the DST task. However, I found the processed data format of multiwoz 2.1 dataset using the script from https://github.com/jasonwu0731/trade-dst does not match your code. May I ask if you do additional preprocessing procedure? If so, would you mind sharing the script?

Sincerely,
Yan

Ace2005EventExtraction Dataset

Hi,

I've followed the instructions per section A.5 of the paper using this github repo: https://github.com/nlpcl-lab/ace2005-preprocessing/tree/96c8fd4b5a8c87dd6a265d5c14f4d8b8eb9b7fbe

which gives me train/dev/test.json files for ace2005.

However, inside of tanl/datasets.py, https://github.com/amazon-research/tanl/blob/2bd8052f0ff6df3b8fd04d7da1469d73f8639099/datasets.py#L1165 , I cannot find a way to run Ace2005. I am currently receiving the following error when attempting to train with ace2005 -

FileNotFoundError: [Errno 2] No such file or directory: 'data/ace2005event/ace2005event_types.json'

Does anyone have any advice on how to obtain the necessary files besides train/dev/test.json files for ace2005 to train Ace2005 event extraction?

Thanks,

About data files used for the FewRel dataset

Hi! I'm wondering how to prepare the data files for the FewRel dataset.
Do we use the full train_wiki.json from https://github.com/thunlp/FewRel/tree/master/data as the training split for meta-training, and the full val_wiki.json for evaluation (support&query)? I'm confused because I notice that the fewrel_meta config also specifies do_eval=True. Then what dev split would the code use?
Would appreciate any guidance on this!

How to save tokenizer automatically?

Hello!
I would like to ask how do you automatically save the tokenizer of the new finetuned models?

About performance on tacred

Hi,

Thanks for sharing the code. I try to reproduce the result on tacred. However, the F1 score on the test set is only 67.67.

The config I used is listed below.

[tacred]
datasets = tacred
multitask = False
model_name_or_path = t5-base
num_train_epochs = 10
max_seq_length = 256
train_split = train
per_device_train_batch_size = 16
do_train = True
do_eval = True
do_predict = True

I run the code with

CUDA_VISIBLE_DEVICES=0,1 nohup python3 -m torch.distributed.launch --nproc_per_node=2 run.py tacred > result.log 2>&1 &

May I ask which part goes wrong? Thank you.

Regards,
Yiming

ATIS and SNIPS Dataset Source

Hi,

Would you mind leaving some instructions for where you found/preprocessed the ATIS and SNIPS datasets?

I found some .tsv files here for train/dev/test, but the format is not exactly what tanl/datasets.py expects.

Thanks,

amazon-science / tanl Goto Github PK

tanl's Issues

CoNLL2012 Datasets in .json format

reproduce on other datasets

Inquiry Regarding ACE2005-Event Data for TANL

Bug in augment_sentence function

About the training batch size used in the low resource experiments

Episode numbers in few-shot experiment

The format of Multiwoz dataset

Ace2005EventExtraction Dataset

About data files used for the FewRel dataset

How to save tokenizer automatically?

About performance on tacred

ATIS and SNIPS Dataset Source

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent