Comments (3)
Hello, the proper way to use Vector Based Blocking is presented here:
https://pyjedai.readthedocs.io/en/latest/tutorials/pyTorchWorkflow.html
Vector Based Blocking generates a dictionary of ids that correspond to candidate matches. Therefore, at the end of vb blocking, you'll either get this dictionary or a graph similar to entity matching. FAISS also gives distance/similarity scores, avoiding the need for an additional step of entity matching. Check out the tutorial, and if you have any questions, I'm happy to help.
from pyjedai.
Hi Nikoletos,
I used the exact code for using sminilm and faiss, then I used Unique Mapping Clustering .
I achieved low scores for Precision: 3.24% , Recall: 2.23%, F1-score: 2.64%.
How do I achieve the scores of Precision: 83.18% , Recall: 67.10%, F1-score: 74.28%?
:
Code:
from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding
emb = EmbeddingsNNBlockBuilding(vectorizer='sminilm',
similarity_search='faiss')
blocks, g = emb.build_blocks(data,
top_k=5,
similarity_distance='euclidean',
load_embeddings_if_exist=False,
save_embeddings=False,
with_entity_matching=True)
from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering
ccc = UniqueMappingClustering()
clusters = ccc.process(g, data, similarity_threshold=0.40)
_ = ccc.evaluate(clusters, with_classification_report=True)
Results:
Building blocks via Embeddings-NN Block Building [sminilm, faiss]
Embeddings-NN Block Building [sminilm, faiss]: 100%
2152/2152 [00:20<00:00, 117.82it/s]
Device selected: cuda
Μethod: Embeddings-NN Block Building
Method name: Embeddings-NN Block Building
Parameters:
Vectorizer: sminilm
Similarity-Search: faiss
Top-K: 5
Vector size: 384
Runtime: 20.2259 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 9.38%
Recall: 93.77%
F1-score: 17.05%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
True positives: 1009
False positives: 9751
True negatives: 1156633
False negatives: 67
Total comparisons: 10760
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Statistics:
FAISS:
Indices shape returned after search: (1076, 5)
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 9.37732342007435,
'Recall %': 93.77323420074349,
'F1 %': 17.049678945589726,
'True Positives': 1009,
'False Positives': 9751,
'True Negatives': 1156633,
'False Negatives': 67}
Μethod: Unique Mapping Clustering
Method name: Unique Mapping Clustering
Parameters:
Runtime: 0.0187 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 0.57%
Recall: 0.28%
F1-score: 0.37%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
True positives: 3
False positives: 527
True negatives: 1155627
False negatives: 1073
Total comparisons: 530
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 0.5660377358490566,
'Recall %': 0.2788104089219331,
'F1 %': 0.37359900373599003,
'True Positives': 3,
'False Positives': 527,
'True Negatives': 1155627,
'False Negatives': 1073}
from pyjedai.
What I suggest you do is start experimenting with:
- top_k=5 (5 to 20)
- similarity_distance='euclidean' ('cosine')
and then with the clustering:
- similarity_threshold=0.4 (from 0 to 1)
or you can even check the optuna tutorial here https://pyjedai.readthedocs.io/en/latest/tutorials/Optuna.html
from pyjedai.
Related Issues (8)
- Entity Matching metrics get sim score error HOT 4
- Entity Resolution Results Inconsistent Between Individual Steps and Workflow Method HOT 1
- ValueError in datamodel.Data HOT 1
- Hello! Collaborate and cross-inspire? HOT 2
- Executing BlockPurging -> stats results in AttributeError HOT 2
- Precision over 100% reported if ground truth contains pairs of identical ids HOT 4
- Bug in similarity calculation in EntityMatching and incorrect documentation for dirtyER HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyjedai.