megagonlabs / ditto Goto Github PK
View Code? Open in Web Editor NEWCode for the paper "Deep Entity Matching with Pre-trained Language Models"
License: Apache License 2.0
Code for the paper "Deep Entity Matching with Pre-trained Language Models"
License: Apache License 2.0
When I run the code I got three f1 from different epochs. Which f1 should we report as a final f1 accuracy based on the paper?
this is the example of out put: epoch 5: dev_f1=0.8317046688382194, f1=0.818146568437379, best_f1=0.8185719859539602
As the question, I am wondering did you also include special tokens like [COL] [VAL] and attribute names into BERT vocabulary?
Hello,
I run the code on Windows without GPU and turned off -fp16:
python.exe .\train_ditto.py '--task' 'Structured/Beer' '--batch_size' '32' '--max_len' '256' '--lr' '3e-5' '--n_epochs' '40' '--lm' 'roberta' '--da' 'del' '--dk' 'product' '--summarize'
RoBERTa is from https://huggingface.co/roberta-base
The paper says "We use the base uncased variant of each model in all our experiments". RoBERTa is case-sensitive.
So which uncased variant of RoBERTa was used?
Also which uncased variant of XLNet was used?
The paper reported 94.34%, the best I got is
epoch 14: dev_f1=0.896551724137931, f1=0.8666666666666666, best_f1=0.9032258064516129
Any suggestions on what may have caused this low performance?
Thank you.
Here is the output
step: 0, loss: 0.5871710777282715
epoch 1: dev_f1=0.37931034482758624, f1=0.36666666666666664, best_f1=0.36666666666666664
step: 0, loss: 0.2969485819339752
epoch 2: dev_f1=0.2745098039215686, f1=0.2692307692307693, best_f1=0.36666666666666664
step: 0, loss: 0.2463674694299698
epoch 3: dev_f1=0.32558139534883723, f1=0.32499999999999996, best_f1=0.36666666666666664
step: 0, loss: 0.5062930583953857
epoch 4: dev_f1=0.32558139534883723, f1=0.32499999999999996, best_f1=0.36666666666666664
step: 0, loss: 0.2536587119102478
epoch 5: dev_f1=0.4117647058823529, f1=0.36923076923076925, best_f1=0.36923076923076925
step: 0, loss: 0.3347562551498413
epoch 6: dev_f1=0.6923076923076924, f1=0.6470588235294117, best_f1=0.6470588235294117
step: 0, loss: 0.3830795884132385
epoch 7: dev_f1=0.8275862068965518, f1=0.6666666666666665, best_f1=0.6666666666666665
step: 0, loss: 0.27009156346321106
epoch 8: dev_f1=0.8387096774193549, f1=0.9333333333333333, best_f1=0.9333333333333333
step: 0, loss: 0.13321542739868164
epoch 9: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.024025270715355873
epoch 10: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.0391874834895134
epoch 11: dev_f1=0.896551724137931, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.00302126444876194
epoch 12: dev_f1=0.8387096774193549, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.06331554800271988
epoch 13: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.026920529082417488
epoch 14: dev_f1=0.896551724137931, f1=0.8666666666666666, best_f1=0.9032258064516129
step: 0, loss: 0.023745562881231308
epoch 15: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.012241823598742485
epoch 16: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.0017187324119731784
epoch 17: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.0006802910938858986
epoch 18: dev_f1=0.8484848484848484, f1=0.8484848484848484, best_f1=0.9032258064516129
step: 0, loss: 0.0009096315479837358
epoch 19: dev_f1=0.8387096774193549, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.0005167351919226348
epoch 20: dev_f1=0.8387096774193549, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.0003216741606593132
epoch 21: dev_f1=0.8387096774193549, f1=0.9032258064516129, best_f1=0.9032258064516129
I was trying to execute the training code on a cpu. With the following hyperparemeters.
python train_ditto.py
--task Structured/Beer
--batch_size 64
--max_len 64
--lr 3e-5
--n_epochs 5
--finetuning
--lm distilbert
--da del
--dk product
--save_model
--summarize
I think some how the dev_f1 score is zero and the accuracy stuck at 0.84 and the epoch is not increased and its going in loops since due to the while loop in mixda is epoch <= hp.n_epochs.
is there something I am missing or is it going in an infinite loop
I had an issue running the code after installing the required transformers==2.8
and sentencepiece==0.1.85
. I got the error:
Traceback (most recent call last):
File "train_ditto.py", line 90, in <module>
from snippext.mixda import initialize_and_train
File "Snippext_public/snippext/mixda.py", line 13, in <module>
from .model import MultiTaskNet
File "Snippext_public/snippext/model.py", line 3, in <module>
from transformers import BertModel, AlbertModel, DistilBertModel, RobertaModel, XLNetModel, LongformerModel
ImportError: cannot import name 'LongformerModel' from 'transformers' (/home/youcef/.conda/envs/py37/lib/python3.7/site-packages/transformers/__init__.py)
I fixed it by installing transformers==3.1
instead, which introduced LongformerModel.
I recommend the requirements should be updated accordingly.
Line 11 in 6dcf10a
wget https://ditto-em.s3.us-east-2.amazonaws.com/Company.zip
Can I get the file from another link? Thanks!
Can we add the --save_model flag to the "train the matching model" example? This will allow users to know how to produce the pt checkpoints to run the matching models in the second example.
An alternate suggestion is to add a link to the example shown on Running the matcher google collab so users know how to access the pt files to run the matcher.
I run the suggested example in the readme to try reproduce results showed in the paper:
CUDA_VISIBLE_DEVICES=0 python train_ditto.py \
--task Structured/Beer \
--batch_size 64 \
--max_len 64 \
--lr 3e-5 \
--n_epochs 40 \
--finetuning \
--lm distilbert \
--fp16 \
--da del \
--dk product \
--summarize
If I'm right, based on the paper I expected to get around F1-score around 94.7
You can also see the training logs here
What am I doing wrong?
When one attribute name appears in the token sequence, it may be removed by the summarization componenent.
For example, in the Line 719 in data/er_magellan/Structured/Amazon-Google/test.txt.su
, manufacturer
between [COL]
and [VAL]
is removed because of manufacturer
in title.
Since a value without an attribute name seems a bit unnatural, I'm not sure whether it's a bug.
Hi,
in your readme it says that the --summarize flag needs to be specified for matcher.py if it was also specified at training time.
When I do so I get the following error:
Defaults for this optimization level are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O2
cast_model_type : torch.float16
patch_torch_functions : False
keep_batchnorm_fp32 : True
master_weights : True
loss_scale : dynamic
0it [00:00, ?it/s]
Traceback (most recent call last):
File "matcher.py", line 242, in <module>
dk_injector=dk_injector)
File "matcher.py", line 149, in predict
pairs.append((to_str(row[0], summarizer, max_len, dk_injector),
File "matcher.py", line 49, in to_str
content = summarizer.transform(content, max_len=max_len)
File "/content/drive/My Drive/Master Thesis/ditto/repo/ditto/ditto/summarize.py", line 75, in transform
sentA, sentB, label = row.strip().split('\t')
ValueError: not enough values to unpack (expected 3, got 1)
Without the --summarize flag matcher.py is running fine.
Is there any workaround to use matcher.py with summarization?
!CUDA_VISIBLE_DEVICES=0 python train_ditto.py \ --task Textual/Company \ --batch_size 32 \ --max_len 128 \ --lr 3e-5 \ --n_epochs 20 \ --finetuning \ --lm roberta \ --fp16 \ --da drop_col
step: 0, loss: 0.609293520450592 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 Traceback (most recent call last): File "train_ditto.py", line 92, in <module> run_tag, hp) File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/ditto.py", line 201, in train train_step(train_iter, model, optimizer, scheduler, hp) File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/ditto.py", line 123, in train_step for i, batch in enumerate(train_iter): File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__ data = self._next_data() File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/dataset.py", line 80, in __getitem__ left, right = combined.split(' [SEP] ') ValueError: not enough values to unpack (expected 2, got 1)
How many changes are required in order to implement for spanish language?
Hey guys !
I had fun reading the paper and thanks for open-sourcing the model.
In the paper, you guys mentioned where [COL] and [VAL] are special tokens for indicating the start of attribute names and values respectively.
Meaning that [COL]
and [VAL]
are special tokens that are to be added to the tokenizer. In the repo https://github.com/megagonlabs/ditto/blob/master/ditto_light/dataset.py#L12, you guys are not adding this as special tokens to the vocabulary of the pre-trained tokenizer.
Any reason why?
Hello,
I am trying to run the training code but I come to this error:
2020-11-02 07:36:08.658676: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Downloading: 100% 232k/232k [00:00<00:00, 318kB/s]
Downloading: 100% 442/442 [00:00<00:00, 284kB/s]
Downloading: 100% 268M/268M [00:04<00:00, 62.7MB/s]
Traceback (most recent call last):
File "train_ditto.py", line 103, in <module>
run_tag)
File "Snippext_public/snippext/mixda.py", line 253, in initialize_and_train
alpha_aug=hp.alpha_aug)
File "Snippext_public/snippext/mixda.py", line 152, in train
with amp.scale_loss(loss, optimizer) as scaled_loss:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.6/dist-packages/apex/amp/handle.py", line 82, in scale_loss
raise RuntimeError("Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized. "
RuntimeError: Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized. model, optimizer = amp.initialize(model, optimizer, opt_level=...) must be called before `with amp.scale_loss`.
Please, can you guide me to solve this issue?
Can we get embedding of a entity? We need to develop a second model that computes "similarity" of entity and embeddings would be very helpful.
Hi ditto creators,
I am currently working on a short Masters project, and I discovered your library. We are performing de-duplication on the Cora citations dataset using py_entitymatching and deepmatcher. I was keen to evaluate your package as well, but the documentation seems to suggest I can only interact with the functions via the command line.
Speaking as someone who does not have a lot of software engineering experience, is it possible to use your package in a notebook environment? I am currently using Google Colab for the previously-mentioned packages.
Thanks in advance.
How to do inferencing after running the training on unseen data?
!CUDA_VISIBLE_DEVICES=0 python train_ditto.py
--task Structured/DBLP-ACM
--batch_size 32
--max_len 128
--lr 3e-5
--n_epochs 20
--finetuning
--lm roberta
--fp16
--da drop_col
Traceback (most recent call last):
File "train_ditto.py", line 13, in
from ditto_light.knowledge import *
File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/knowledge.py", line 5, in
import spacy
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/spacy/init.py", line 14, in
from .cli.info import info # noqa: F401
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/spacy/cli/init.py", line 3, in
from ._util import app, setup_cli # noqa: F401
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/spacy/cli/_util.py", line 8, in
import typer
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/typer/init.py", line 31, in
from .main import Typer, run
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/typer/main.py", line 11, in
from .completion import get_completion_inspect_parameters
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/typer/completion.py", line 10, in
import click._bashcomplete
ModuleNotFoundError: No module named 'click._bashcomplete'
Hi,
I fine-tuned this model and I met a curious issue when inference 2 similar textual data, for example like below:
[["PROTEXASE THERAPEUTICS", "PROTEXASE THERAPEUTICS, INC."]]
i tested it for 10 time which show 10 different prediction score:
[[0.6435145641728891, 0.35648543582711095],
[0.44686401791372865, 0.5531359820862713],
[0.4768868842385305, 0.5231131157614696],
[0.5165862730694053, 0.48341372693059464],
[0.45327667255908927, 0.5467233274409107],
[0.5023805971114581, 0.497619402888542],
[0.7547820757051401, 0.24521792429486],
[0.02058591123972741, 0.9794140887602728],
[0.7257732298308167, 0.27422677016918334],
[0.30100721825239946, 0.6989927817476006]]
and this result showed quiet unstable.
i check the from snippext.model import MultiTaskNet, and the dropout=0.1 which modify to 0. The random result also happen. In fact, the dropout layer does not work in inference processing.
i think this issue is unbelievable. I might miss something importance. i don't know what is the problem?
I'm testing ditto to match my own dataset so I'm running the following:
CUDA_VISIBLE_DEVICES=0 python train_ditto.py --task catalogo --batch_size 64 --max_len 64 --lr 3e-5 --n_epochs 10 --finetuning --lm distilbert --fp16
metrics in the output process are really good... around accuracy=0.990.
In order to do a sanity check and a to do analysis error, I run predictions over the test.text (in jsonline format):
CUDA_VISIBLE_DEVICES=0 python matcher.py --task catalogo --input_path input/to_be_evaluate.jsonl --output_path output/output_catalogo.jsonl --lm distilbert --use_gpu --fp16 --checkpoint_path checkpoints/
My dataset is balanced (Around 50% are positive and 50% negatives) so based on 0.99 accuracy I expect get almost the same amount positives and negatives in output file running the following commands:
$ ditto# cat output/output_catalogo.jsonl | grep '"match": "1"' | wc -l
139
$ ditto# cat output/output_catalogo.jsonl | grep '"match": "0"' | wc -l
5149
$ ditto# cat data/cboujon/test.txt | grep -P "\t0" | wc -l
2675
$ ditto# cat data/cboujon/test.txt | grep -P "\t1" | wc -l
2613
These numbers show me that accuracy is not 0.990 or I can't see where is my error.
Here is datasets and outputs
Task config is defined:
{
"name": "catalogo",
"task_type": "classification",
"vocab": ["0", "1"],
"trainset": "data/cboujon/train.txt",
"validset": "data/cboujon/valid.txt",
"testset": "data/cboujon/test.txt"
}
Hi,
In this line, evaluation computes the f1 values by setting th=0.0, th=1.0, th=0.05:
for th in np.arange(0.0, 1.0, 0.05):
I found that the f1 value by best_th=0.5
is not computed but this evaluation method assigns a new f1 value without comparing the f1 value by best_th=0.5
.
Cheers,
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.