amazon-science / tanl Goto Github PK
View Code? Open in Web Editor NEWStructured Prediction as Translation between Augmented Natural Languages
License: Apache License 2.0
Structured Prediction as Translation between Augmented Natural Languages
License: Apache License 2.0
datasets.py expects .json files for CoNLL2012 dataset. However, after searching online, I cannot find any preprocessing tools to yield .json files for the CoNLL2012 dataset.
Would the authors be able to provide a way to preprocess the CoNLL2012 dataset so that it can be used for training?
Thanks,
Since you mentioned, "For other datasets, we provide sample processing code which does not necessarily match the format of publicly available versions (we do not plan to adapt the code to load datasets in other formats)". I'd like to know how can I reproduce the results on other datasets in the paper.
Hi,
could you kindly share the ACE2005 dataset for Event Extraction(Event Trigger Dataset & Event Argument Dataset) or provide guidance on how I might obtain access to it?
Thanks!
Hi, there seems to be a small bug in augment_sentence
function in utils.py
. When the root of the entity tree is an entity with tags, those tags won't be augmented onto the output. For example, when I run the code above:
from utils import augment_sentence
tokens = ['Tolkien', 'was', 'born', 'here']
augmentations = [
([('person',), ('born in', 'here')], 0, 1),
([('location',)], 3, 4),
]
# example from the test set of conll03 NER
tokens = ['Premier', 'league']
augmentations = [([('miscellaneous',)], 0, 2)]
begin_entity_token = "["
sep_token = "|"
relation_sep_token = "="
end_entity_token = "]"
augmented_output = augment_sentence(tokens, augmentations, begin_entity_token, sep_token, relation_sep_token, end_entity_token)
print(augmented_output)
It prints out Premier league
instead of [ Premier league | miscellaneous ]
. This happened because in line 124 (utils.py), the value of the root in entity tree is reset to an empty list. My quick fix of this is initializing the start index of the root as -1. That is changing line 103 in utils.py to
root = (None, -1, len(tokens)) # this node represents the entire sentence
It would be great if someone could let me know if I am correct on this. Thanks!
Hi!
Thanks a lot for sharing the code! I'm trying to reproduce the low-resource experiments on the CoNLL04 dataset. Could you please provide the size of a training batch (per GPU and the number of GPUs used)? Would also appreciate a full training config!
Hi,
Thank you for sharing the code ! I'm trying to reproduce the results on FewRel 1.0 . And I'm wondering how many episodes and query numbers are used in 1 shot , and 5 shot cases, respectively ?
Thanks.
Hi Giovanni,
Nice work and thanks for the sharing. I am reproducing the results of the DST task. However, I found the processed data format of multiwoz 2.1 dataset using the script from https://github.com/jasonwu0731/trade-dst does not match your code. May I ask if you do additional preprocessing procedure? If so, would you mind sharing the script?
Sincerely,
Yan
Hi,
I've followed the instructions per section A.5 of the paper using this github repo: https://github.com/nlpcl-lab/ace2005-preprocessing/tree/96c8fd4b5a8c87dd6a265d5c14f4d8b8eb9b7fbe
which gives me train/dev/test.json files for ace2005.
However, inside of tanl/datasets.py, https://github.com/amazon-research/tanl/blob/2bd8052f0ff6df3b8fd04d7da1469d73f8639099/datasets.py#L1165 , I cannot find a way to run Ace2005. I am currently receiving the following error when attempting to train with ace2005 -
FileNotFoundError: [Errno 2] No such file or directory: 'data/ace2005event/ace2005event_types.json'
Does anyone have any advice on how to obtain the necessary files besides train/dev/test.json files for ace2005 to train Ace2005 event extraction?
Thanks,
Hi! I'm wondering how to prepare the data files for the FewRel dataset.
Do we use the full train_wiki.json from https://github.com/thunlp/FewRel/tree/master/data as the training split for meta-training, and the full val_wiki.json for evaluation (support&query)? I'm confused because I notice that the fewrel_meta config also specifies do_eval=True. Then what dev split would the code use?
Would appreciate any guidance on this!
Hello!
I would like to ask how do you automatically save the tokenizer of the new finetuned models?
Hi,
Thanks for sharing the code. I try to reproduce the result on tacred. However, the F1 score on the test set is only 67.67.
The config I used is listed below.
[tacred]
datasets = tacred
multitask = False
model_name_or_path = t5-base
num_train_epochs = 10
max_seq_length = 256
train_split = train
per_device_train_batch_size = 16
do_train = True
do_eval = True
do_predict = True
I run the code with
CUDA_VISIBLE_DEVICES=0,1 nohup python3 -m torch.distributed.launch --nproc_per_node=2 run.py tacred > result.log 2>&1 &
May I ask which part goes wrong? Thank you.
Regards,
Yiming
Hi,
Would you mind leaving some instructions for where you found/preprocessed the ATIS and SNIPS datasets?
I found some .tsv files here for train/dev/test, but the format is not exactly what tanl/datasets.py expects.
Thanks,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.