megagonlabs / ditto Goto Github PK

View Code? Open in Web Editor NEW

244.0 244.0 84.0 24.67 MB

Code for the paper "Deep Entity Matching with Pre-trained Language Models"

License: Apache License 2.0

Python 100.00%

ditto's People

Contributors

Stargazers

Watchers

Forkers

cristhianboujon yyht cvsekhar tim-tianyu nathanyanjing igandarillas vikmary gnaiqing massudavide zwade-lbx u-brixton p17seongbin sduari rivercold modest-as tchugh yun-han zn-qiao naserahmadi wyfunique akshaybharadwaj11 lucianod28 ebishoff svenkerstjens sertansenturk-work francisco-perez-sorrosal medtray boscoj2008 richard-tang199 lhqq nhvd3500111 hsinyang0816 nishadi pauloh48 jermathew tangqideng avudzor naamaberman codingtumbleweed rengchizder aynazf zainebar dekelcohen petschwind veerbeek bilal62 learnerhong shugybugy-assaf progsi sagilevy123 tteofili oli-yun dixonshen skpalu arnaudmkonan godflyfly douglasrolins rpatil524 gchaumont zhempstead ellabettison-xapien zhqkyu erikjny zl-xiang lorenzofantunes imkolla hem-dash gibson4690 valhalla1887 mbahmani davidfucsko mbpd2023 bjorz hertera1 moemode pekahzero abtuo lido22 sridhar98 mtaboun lanfangping samiranprogramming khacnha edfvalim

ditto's Issues

Which f1 should we report?

When I run the code I got three f1 from different epochs. Which f1 should we report as a final f1 accuracy based on the paper?
this is the example of out put: epoch 5: dev_f1=0.8317046688382194, f1=0.818146568437379, best_f1=0.8185719859539602

Whether are special tokens like [COL] [VAL] and attribute names added into the vocabulary?

As the question, I am wondering did you also include special tokens like [COL] [VAL] and attribute names into BERT vocabulary?

drop_col gives error?

When I try to use data augmentation with drop_col, I get the error below. I did not change anything about the model or data, is there something I'm missing?

F1 Score for Structured/Beer on paper can't be reproduced

Hello,
I run the code on Windows without GPU and turned off -fp16:
python.exe .\train_ditto.py '--task' 'Structured/Beer' '--batch_size' '32' '--max_len' '256' '--lr' '3e-5' '--n_epochs' '40' '--lm' 'roberta' '--da' 'del' '--dk' 'product' '--summarize'

RoBERTa is from https://huggingface.co/roberta-base

The paper says "We use the base uncased variant of each model in all our experiments". RoBERTa is case-sensitive.
So which uncased variant of RoBERTa was used?
Also which uncased variant of XLNet was used?

The paper reported 94.34%, the best I got is
epoch 14: dev_f1=0.896551724137931, f1=0.8666666666666666, best_f1=0.9032258064516129
Any suggestions on what may have caused this low performance?

Thank you.

Here is the output
step: 0, loss: 0.5871710777282715
epoch 1: dev_f1=0.37931034482758624, f1=0.36666666666666664, best_f1=0.36666666666666664
step: 0, loss: 0.2969485819339752
epoch 2: dev_f1=0.2745098039215686, f1=0.2692307692307693, best_f1=0.36666666666666664
step: 0, loss: 0.2463674694299698
epoch 3: dev_f1=0.32558139534883723, f1=0.32499999999999996, best_f1=0.36666666666666664
step: 0, loss: 0.5062930583953857
epoch 4: dev_f1=0.32558139534883723, f1=0.32499999999999996, best_f1=0.36666666666666664
step: 0, loss: 0.2536587119102478
epoch 5: dev_f1=0.4117647058823529, f1=0.36923076923076925, best_f1=0.36923076923076925
step: 0, loss: 0.3347562551498413
epoch 6: dev_f1=0.6923076923076924, f1=0.6470588235294117, best_f1=0.6470588235294117
step: 0, loss: 0.3830795884132385
epoch 7: dev_f1=0.8275862068965518, f1=0.6666666666666665, best_f1=0.6666666666666665
step: 0, loss: 0.27009156346321106
epoch 8: dev_f1=0.8387096774193549, f1=0.9333333333333333, best_f1=0.9333333333333333
step: 0, loss: 0.13321542739868164
epoch 9: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.024025270715355873
epoch 10: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.0391874834895134
epoch 11: dev_f1=0.896551724137931, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.00302126444876194
epoch 12: dev_f1=0.8387096774193549, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.06331554800271988
epoch 13: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.026920529082417488
epoch 14: dev_f1=0.896551724137931, f1=0.8666666666666666, best_f1=0.9032258064516129
step: 0, loss: 0.023745562881231308
epoch 15: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.012241823598742485
epoch 16: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.0017187324119731784
epoch 17: dev_f1=0.8666666666666666, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.0006802910938858986
epoch 18: dev_f1=0.8484848484848484, f1=0.8484848484848484, best_f1=0.9032258064516129
step: 0, loss: 0.0009096315479837358
epoch 19: dev_f1=0.8387096774193549, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.0005167351919226348
epoch 20: dev_f1=0.8387096774193549, f1=0.9032258064516129, best_f1=0.9032258064516129
step: 0, loss: 0.0003216741606593132
epoch 21: dev_f1=0.8387096774193549, f1=0.9032258064516129, best_f1=0.9032258064516129

training

I was trying to execute the training code on a cpu. With the following hyperparemeters.
python train_ditto.py
--task Structured/Beer
--batch_size 64
--max_len 64
--lr 3e-5
--n_epochs 5
--finetuning
--lm distilbert
--da del
--dk product
--save_model
--summarize

I think some how the dev_f1 score is zero and the accuracy stuck at 0.84 and the epoch is not increased and its going in loops since due to the while loop in mixda is epoch <= hp.n_epochs.

is there something I am missing or is it going in an infinite loop

ImportError: cannot import name 'LongformerModel' from 'transformers' (transformers=2.8)

I had an issue running the code after installing the required transformers==2.8 and sentencepiece==0.1.85. I got the error:

Traceback (most recent call last):
  File "train_ditto.py", line 90, in <module>
    from snippext.mixda import initialize_and_train
  File "Snippext_public/snippext/mixda.py", line 13, in <module>
    from .model import MultiTaskNet
  File "Snippext_public/snippext/model.py", line 3, in <module>
    from transformers import BertModel, AlbertModel, DistilBertModel, RobertaModel, XLNetModel, LongformerModel
ImportError: cannot import name 'LongformerModel' from 'transformers' (/home/youcef/.conda/envs/py37/lib/python3.7/site-packages/transformers/__init__.py)

I fixed it by installing transformers==3.1 instead, which introduced LongformerModel.

I recommend the requirements should be updated accordingly.

ditto/requirements.txt

Line 11 in 6dcf10a

transformers==2.8.0

The link for the Company.zip file seems to be invalid.

wget https://ditto-em.s3.us-east-2.amazonaws.com/Company.zip

Resolving ditto-em.s3.us-east-2.amazonaws.com (ditto-em.s3.us-east-2.amazonaws.com)... 3.5.130.135, 3.5.132.184, 52.219.178.82, ...
Connecting to ditto-em.s3.us-east-2.amazonaws.com (ditto-em.s3.us-east-2.amazonaws.com)|3.5.130.135|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-08-18 23:17:26 ERROR 404: Not Found.

Can I get the file from another link? Thanks!

Add --save_model flag to the training example

Can we add the --save_model flag to the "train the matching model" example? This will allow users to know how to produce the pt checkpoints to run the matching models in the second example.

An alternate suggestion is to add a link to the example shown on Running the matcher google collab so users know how to access the pt files to run the matcher.

How to reproduce the paper metrics?

I run the suggested example in the readme to try reproduce results showed in the paper:

CUDA_VISIBLE_DEVICES=0 python train_ditto.py \
  --task Structured/Beer \
  --batch_size 64 \
  --max_len 64 \
  --lr 3e-5 \
  --n_epochs 40 \
  --finetuning \
  --lm distilbert \
  --fp16 \
  --da del \
  --dk product \
  --summarize

If I'm right, based on the paper I expected to get around F1-score around 94.7

But I just get:

You can also see the training logs here

What am I doing wrong?

Summarization sometimes removes attribute names between [COL] and [VAL]

When one attribute name appears in the token sequence, it may be removed by the summarization componenent.

For example, in the Line 719 in data/er_magellan/Structured/Amazon-Google/test.txt.su, manufacturer between [COL] and [VAL] is removed because of manufacturer in title.

Since a value without an attribute name seems a bit unnatural, I'm not sure whether it's a bug.

How (code) to serialize the inputs ?

Error when using --summarize with matcher.py

Hi,

in your readme it says that the --summarize flag needs to be specified for matcher.py if it was also specified at training time.
When I do so I get the following error:

Defaults for this optimization level are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O2
cast_model_type        : torch.float16
patch_torch_functions  : False
keep_batchnorm_fp32    : True
master_weights         : True
loss_scale             : dynamic
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "matcher.py", line 242, in <module>
    dk_injector=dk_injector)
  File "matcher.py", line 149, in predict
    pairs.append((to_str(row[0], summarizer, max_len, dk_injector),
  File "matcher.py", line 49, in to_str
    content = summarizer.transform(content, max_len=max_len)
  File "/content/drive/My Drive/Master Thesis/ditto/repo/ditto/ditto/summarize.py", line 75, in transform
    sentA, sentB, label = row.strip().split('\t')
ValueError: not enough values to unpack (expected 3, got 1)

Without the --summarize flag matcher.py is running fine.

Is there any workaround to use matcher.py with summarization?

ValueError: not enough values to unpack (expected 2, got 1) - Textual/Company

!CUDA_VISIBLE_DEVICES=0 python train_ditto.py \ --task Textual/Company \ --batch_size 32 \ --max_len 128 \ --lr 3e-5 \ --n_epochs 20 \ --finetuning \ --lm roberta \ --fp16 \ --da drop_col

step: 0, loss: 0.609293520450592 Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0 Traceback (most recent call last): File "train_ditto.py", line 92, in <module> run_tag, hp) File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/ditto.py", line 201, in train train_step(train_iter, model, optimizer, scheduler, hp) File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/ditto.py", line 123, in train_step for i, batch in enumerate(train_iter): File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__ data = self._next_data() File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/dataset.py", line 80, in __getitem__ left, right = combined.split(' [SEP] ') ValueError: not enough values to unpack (expected 2, got 1)

ditto for spanish

How many changes are required in order to implement for spanish language?

Adding custom tokens

Hey guys !
I had fun reading the paper and thanks for open-sourcing the model.

In the paper, you guys mentioned where [COL] and [VAL] are special tokens for indicating the start of attribute names and values respectively. Meaning that [COL] and [VAL] are special tokens that are to be added to the tokenizer. In the repo https://github.com/megagonlabs/ditto/blob/master/ditto_light/dataset.py#L12, you guys are not adding this as special tokens to the vocabulary of the pre-trained tokenizer.

Any reason why?

Code implementation - Training

Hello,

I am trying to run the training code but I come to this error:

2020-11-02 07:36:08.658676: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Downloading: 100% 232k/232k [00:00<00:00, 318kB/s]
Downloading: 100% 442/442 [00:00<00:00, 284kB/s]
Downloading: 100% 268M/268M [00:04<00:00, 62.7MB/s]
Traceback (most recent call last):
  File "train_ditto.py", line 103, in <module>
    run_tag)
  File "Snippext_public/snippext/mixda.py", line 253, in initialize_and_train
    alpha_aug=hp.alpha_aug)
  File "Snippext_public/snippext/mixda.py", line 152, in train
    with amp.scale_loss(loss, optimizer) as scaled_loss:
  File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/apex/amp/handle.py", line 82, in scale_loss
    raise RuntimeError("Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized.  "
RuntimeError: Invoked 'with amp.scale_loss`, but internal Amp state has not been initialized.  model, optimizer = amp.initialize(model, optimizer, opt_level=...) must be called before `with amp.scale_loss`.

Please, can you guide me to solve this issue?

How to get entity embeddings

Can we get embedding of a entity? We need to develop a second model that computes "similarity" of entity and embeddings would be very helpful.

[Question] Can I use this package in a notebook environment?

Hi ditto creators,

I am currently working on a short Masters project, and I discovered your library. We are performing de-duplication on the Cora citations dataset using py_entitymatching and deepmatcher. I was keen to evaluate your package as well, but the documentation seems to suggest I can only interact with the functions via the command line.

Speaking as someone who does not have a lot of software engineering experience, is it possible to use your package in a notebook environment? I am currently using Google Colab for the previously-mentioned packages.

Thanks in advance.

How is your F1 score calculated, whether you use weight or macro or micro

Inferencing

How to do inferencing after running the training on unseen data?

ModuleNotFoundError: No module named 'click._bashcomplete'

!CUDA_VISIBLE_DEVICES=0 python train_ditto.py
--task Structured/DBLP-ACM
--batch_size 32
--max_len 128
--lr 3e-5
--n_epochs 20
--finetuning
--lm roberta
--fp16
--da drop_col

Traceback (most recent call last):
File "train_ditto.py", line 13, in
from ditto_light.knowledge import *
File "/home/ec2-user/SageMaker/vendor_matching/ditto/ditto_light/knowledge.py", line 5, in
import spacy
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/spacy/init.py", line 14, in
from .cli.info import info # noqa: F401
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/spacy/cli/init.py", line 3, in
from ._util import app, setup_cli # noqa: F401
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/spacy/cli/_util.py", line 8, in
import typer
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/typer/init.py", line 31, in
from .main import Typer, run
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/typer/main.py", line 11, in
from .completion import get_completion_inspect_parameters
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/typer/completion.py", line 10, in
import click._bashcomplete
ModuleNotFoundError: No module named 'click._bashcomplete'

random result when inference 2 similar textual data

Hi,

I fine-tuned this model and I met a curious issue when inference 2 similar textual data, for example like below:

[["PROTEXASE THERAPEUTICS", "PROTEXASE THERAPEUTICS, INC."]]

i tested it for 10 time which show 10 different prediction score:
[[0.6435145641728891, 0.35648543582711095],
[0.44686401791372865, 0.5531359820862713],
[0.4768868842385305, 0.5231131157614696],
[0.5165862730694053, 0.48341372693059464],
[0.45327667255908927, 0.5467233274409107],
[0.5023805971114581, 0.497619402888542],
[0.7547820757051401, 0.24521792429486],
[0.02058591123972741, 0.9794140887602728],
[0.7257732298308167, 0.27422677016918334],
[0.30100721825239946, 0.6989927817476006]]

and this result showed quiet unstable.

i check the from snippext.model import MultiTaskNet, and the dropout=0.1 which modify to 0. The random result also happen. In fact, the dropout layer does not work in inference processing.

i think this issue is unbelievable. I might miss something importance. i don't know what is the problem?

Inconsistent metrics

I'm testing ditto to match my own dataset so I'm running the following:
CUDA_VISIBLE_DEVICES=0 python train_ditto.py --task catalogo --batch_size 64 --max_len 64 --lr 3e-5 --n_epochs 10 --finetuning --lm distilbert --fp16

metrics in the output process are really good... around accuracy=0.990.

In order to do a sanity check and a to do analysis error, I run predictions over the test.text (in jsonline format):
CUDA_VISIBLE_DEVICES=0 python matcher.py --task catalogo --input_path input/to_be_evaluate.jsonl --output_path output/output_catalogo.jsonl --lm distilbert --use_gpu --fp16 --checkpoint_path checkpoints/

My dataset is balanced (Around 50% are positive and 50% negatives) so based on 0.99 accuracy I expect get almost the same amount positives and negatives in output file running the following commands:

$ ditto# cat output/output_catalogo.jsonl | grep '"match": "1"' | wc -l 
139
$ ditto# cat output/output_catalogo.jsonl | grep '"match": "0"' | wc -l 
5149
$ ditto# cat data/cboujon/test.txt | grep -P "\t0" | wc -l 
2675
$ ditto# cat data/cboujon/test.txt | grep -P "\t1" | wc -l 
2613

These numbers show me that accuracy is not 0.990 or I can't see where is my error.
Here is datasets and outputs

Task config is defined:

{
  "name": "catalogo",
  "task_type": "classification",
  "vocab": ["0", "1"],
  "trainset": "data/cboujon/train.txt",
  "validset": "data/cboujon/valid.txt",
  "testset": "data/cboujon/test.txt"
}

evaluation method seems to assign a new f1 value as a best score without computing f1 value by best_th=0.5

Hi,

In this line, evaluation computes the f1 values by setting th=0.0, th=1.0, th=0.05:
for th in np.arange(0.0, 1.0, 0.05):

I found that the f1 value by best_th=0.5 is not computed but this evaluation method assigns a new f1 value without comparing the f1 value by best_th=0.5.

Cheers,