Git Product home page Git Product logo

ai-team-uoa / pyjedai Goto Github PK

View Code? Open in Web Editor NEW
67.0 67.0 10.0 132.87 MB

An open-source library that leverages Python’s data science ecosystem to build powerful end-to-end Entity Resolution workflows.

Home Page: https://pyjedai.readthedocs.io

License: Apache License 2.0

Python 99.98% Dockerfile 0.02%
data-disambigation data-matching deduplication duplicate-detection entity-matching entity-resolution fuzzy-matching link-discovery machine-learning python

pyjedai's Issues

ValueError in datamodel.Data

Describe the bug
Hey, great job for this package. Is it still being maintained ? I'd like to know before using it in prod.
Also, I believe there is a bug in datamodel.Data. If you pass a dataset_2 that does not contain id_column_name_1 (which seems like a valid case) you will get a ValueError.

To Reproduce
Steps to reproduce the behavior:

import pandas as pd
from pyjedai.datamodel import Data

df1 = pd.DataFrame({'id': [1, 2], 'data': ['a', 'b']})
df2 = pd.DataFrame({'other_id': [1, 2], 'data': ['a', 'b']})
Data(
    dataset_1=df1,
    id_column_name_1="id",
    dataset_2=df2,
    id_column_name_2="other_id",
)

--> ValueError

Expected behavior
No ValueError

Additional context
I believe here, it should be self.attributes_2.remove(self.id_column_name_2) instead of self.attributes_2.remove(self.id_column_name_1)

Precision over 100% reported if ground truth contains pairs of identical ids

We have a dirty ER workflow, where the EntityMatching graph is generated with similarity_threshold=0.0 (to get all compared edges) and then we optimize the clustering for the optimal similarity_threshold using optuna. We encountered this:
Figure_1

On the top end, where the threshold gets towards 1.0 and as such the clustering does not produce alot of matches, the reported precision goes beyond 100%. I would have to dig deeper what exactly causes this, but maybe you have an idea, possibly it is only a bug regarding edge cases where the number of matches is low.

best

Block Filtering and Block Purging after Vector Based Blocking

Hi, I have tried vector based blocking with sentence transformers and faiss, and these blocks contain a dict of indices and sets of indices.
I can't proceed with Block Filtering and Block Purging.
It doesn't allow cardinality as I presume other methods like QGramsBlocking returns a dict of {'key':datamodel.Block} items.

image image

Bug in similarity calculation in EntityMatching and incorrect documentation for dirtyER

Incorrect Docs

At the top of https://pyjedai.readthedocs.io/en/latest/tutorials/DirtyER.html
an attribute list is used for the data attr = ['Entity Id','author', 'title'] (by the way IMHO it does not make sense to include the Entity Id as it always will be different for each entity, as such it will just reduce the similarity score of identical entities, so I would suggest to remove 'Entity Id' from the attr list).
Later entity matching is instantiated without specifying an attribute list:

em = EntityMatching(
    metric='jaccard',
    similarity_threshold=0.0
)

This, however will result in all attributes of the entities to be compared, as EntityMatching is not falling back to using the attributes specified in the Data, see:

self.attributes: list = attributes

The constructor uses the provided attributes or none. I would suggest to either update the tutorial:

em = EntityMatching(
    metric='jaccard',
    similarity_threshold=0.0,
    attributes=attr
)

or even better, fallback to the use the data.attributes in the em.predict method if self.attributes is None.

Issues regarding similarity calculation

As I understand the _similarity method, attributes can be either a dict, a list or None. For reflecting the dict use case self.attributes should be allowed to be a dict, by changing its type to any here:

self.attributes: list = attributes

More severe is that currently calculation of similarity is only correct if no attributes are specified.
For dict case if should be elif here:

if isinstance(self.attributes, list):

Currently, last else case will overwrite calculated dict similarity.

For list case denominator should be outside the loop, not inside. So this line:

similarity /= len(self.attributes)

should be deindented one step, otherwise sum will be divided by len(self.attributes)^2.

best

Normalization of NaN is not working as intended

The data class tries to normalize na / nan values into empty strings.
This is done here:

self.dataset_1 = self.dataset_1.astype(str)
self.dataset_1.fillna("", inplace=True)
if not self.is_dirty_er:
self.dataset_2 = self.dataset_2.astype(str)
self.dataset_2.fillna("", inplace=True)

but it does not work as intended.
When casting the DataFrame to str, all nan values will be replaced with the string "nan" and fillna does nothing anymore.
see:

>>> pandas.DataFrame.isnull(pandas.DataFrame([numpy.nan]).astype(float))
      0
0  True
>>> pandas.DataFrame.isnull(pandas.DataFrame([numpy.nan]).astype(str))
       0
0  False
>>> 

Though, I do not know the best way to handle the intended conversion. One way could be to just change the order, first do fillna and later cast to string. But I don't know what happens if fillna('', inplace=True) is thrown against dtypes incompatible with / other than a string.

best

Entity Matching metrics get sim score error

Hi, there are issues with the entity matching portion.
As seen from the tutorials...

image

As far as I understand, EntityMatching via "from pyjedai.matching import EntityMatching"
has a keyword argument metric.
It allows ['jaccard', 'jaro', 'edit_distance', 'Frequency', 'BM25F', 'cosine', 'TF-IDF','overlap_coefficient', 'generalized_jaccard', 'dice', 'PL2'] which are string matching algorithms.

The algorithms that have issues are ['PL2', 'TF-IDF', 'BM25F', 'Frequency']
This is an error if i typed metric = 'PL2'.

image

'export_pairs' is not working due to missing 'write' function

Describe the bug
'export_pairs' is not working due to missing 'write' function

To Reproduce

add w.export_pairs() to WorkFlow.ipynb

Expected behavior

export pairs as a dataframe

Screenshots

NameError Traceback (most recent call last)
Cell In[18], line 1
----> 1 w.export_pairs()

File site-packages\pyjedai\workflow.py:195, in PYJEDAIWorkFlow.export_pairs(self)
189 def export_pairs(self) -> pd.DataFrame:
190 """Export pairs to file.
191
192 Returns:
193 pd.DataFrame: pairs as a DataFrame
194 """
--> 195 return write(self.final_pairs, self.data)

NameError: name 'write' is not defined

Desktop (please complete the following information):

  • WIN 11, Chrome, pyjedai 0.1.7

Additional context
in the past 'write' was in evaluation.py

Entity Resolution Results Inconsistent Between Individual Steps and Workflow Method

Issue: Entity Resolution Results Vary Between Individual Steps and Workflow

Description:

I encountered an issue while performing entity resolution on the dataset located at '../data/test/ccer', which consists of files abt_100.csv, gt_100.csv, buy_100.csv.

Despite using identical parameters and the same datasets, the outcomes vary drastically between the two approaches: the Individual Steps Method and the Workflow Method.

Results:

Individual Steps Results:

Performance:

  • Precision: 55.17%
  • Recall: 65.31%
  • F1-score: 59.81%

Workflow Results:

  • Precision: 50.0%
  • Recall: 2.04%
  • F1-Score: 3.92%

Code Used:

#### Individual Steps Code:

from pyjedai.block_building import ExtendedSuffixArraysBlocking
from pyjedai.block_cleaning import BlockFiltering, BlockPurging
from pyjedai.comparison_cleaning import BLAST
from pyjedai.matching import EntityMatching
from pyjedai.clustering import UniqueMappingClustering

bb = ExtendedSuffixArraysBlocking(4)
blocks = bb.build_blocks(data)

bc = BlockFiltering(ratio=0.76)
blocks = bc.process(blocks, data)

bp = BlockPurging(smoothing_factor=1.025)
blocks = bp.process(blocks, data)

mb = BLAST(weighting_scheme='X2')
blocks = mb.process(blocks, data)

EM = EntityMatching(
    metric='jaro',
    similarity_threshold=0.5
)
pairs_graph = EM.predict(blocks, data)

ccc = UniqueMappingClustering()
clusters = ccc.process(pairs_graph, data, similarity_threshold=0.55)
_ = ccc.evaluate(clusters, with_classification_report=True)```






### Workflow Code:
```w = BlockingBasedWorkFlow(
    block_building=dict(
        method=ExtendedSuffixArraysBlocking,
        params=dict(suffix_length=4),
        attributes_1=['name', 'description', 'price'],
        attributes_2=['name', 'description', 'price']
    ),
    block_cleaning=[
        dict(
            method=BlockFiltering,
            params=dict(ratio=0.76)
        ),
        dict(
            method=BlockPurging,
            params=dict(smoothing_factor=1.025)
        ),
    ],
    comparison_cleaning=dict(method=BLAST),
    entity_matching=dict(
        method=EntityMatching,
        metric='jaro',
        similarity_threshold=0.5,
        attributes=['name', 'description', 'price']
    ),
    clustering=dict(method=UniqueMappingClustering, similarity_threshold=0.55),
    name="Workflow-Test"
)

w.run(data, workflow_step_tqdm_disable=True, verbose=False)
f1, precision, recall = w.get_final_scores()
print(f'Precision : {precision}\nRecall : {recall}\nF-1 Score : {f1}')

Executing BlockPurging -> stats results in AttributeError

Consider following example code

data = Data(
    dataset_1=d1,
    id_column_name_1='id',
    ground_truth=gt,
    attributes_1=attr
)
bb = QGramsBlocking()
blocks = bb.build_blocks(data)
bp = BlockPurging()
blocks = bp.process(blocks, data, tqdm_disable=False)
bp.stats(blocks)

This results in:

Traceback (most recent call last):
  File "[REDACTED]/jedi_er.py", line 43, in <module>
    bp.stats(blocks)
  File "[REDACTED]/.local/lib/python3.10/site-packages/pyjedai/block_building.py", line 181, in stats
    "\n\tNumber of blocks dropped: " + str(self.num_of_blocks_dropped) +
AttributeError: 'BlockPurging' object has no attribute 'num_of_blocks_dropped'

best

Hello! Collaborate and cross-inspire?

Hi there! Nice to meet you. I'm a software engineer working on entity resolution. I've been reading a lot of your papers on the topic and have found them extremely helpful, so thank you for that.

I just found this repo, it looks quite useful (PS you might want to add topics to this repo so it is easier to discover, I missed this repo until now because of this). I have been working on an entity resolution framework called mismo that looks quite similar. I am looking forward to reading this repo more and using it as inspiration for mismo. I invite you to take a look at mismo as well, and I hope that we can both learn things from each other and ask a lot of questions :)

Some things I hope to port to mismo:

  • Many of the algorithms you have implemented.
  • Pipeline mechanics
  • Schema unification
  • Plotting and evaluation methods

Some things you might find interesting in mismo

  • I use ibis with duckdb, not pandas. This gives MUCH better performance and scalability than in-memory, greedily-evaluated pandas.
  • Using altair for interactive, rich plots instead of matplotlib
  • using reaction and solara for rich interactive jupyter widgets

Cheers,
Nick

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.