- The code can run on multi core machines with better performance as
multiprocessor
concept of python is used. - The code can
scale
for more than 2 pofiles. - The fields are extensible, any number of fields are supported by
Profile class
.
- clone the repo and install requirements.txt packages in your respective container/virtualenv
- run :-
python3 Duplicate_Finder.py
def main():
o1 = Profile(id=1,first_name = "Kanhai", last_name = "Shah",email_field = "[email protected]", random_field=1)
o2 = Profile(first_name = "Kanhai", last_name = "Shah",email_field = "[email protected]")
df = Duplication([o1.get_profile(),o2.get_profile()])
df.findDuplicates(['email_field','first_name','last_name','random_field'])
print(df.get_result())
- It is given in question that :-
if first_name + last_name + email match between two profiles is greater than 80% (you can try using a library like https://pypi.org/project/fuzzywuzzy/), increase the match score to 1 - Also in find_duplicates sometimes all these 3 fields are not passed
- To resolve this confusion, what I have done :-
- If any of
first_name, last_name, email_field
is passed, the fuzz logic is performed as in every Profile these fields are mandatory to be there and they will be there. - if fuzz_logic gives >80% match, total_match_score is incremented by 1
- If any of
- The code today is extensible for more than 2 profiles, but the fuzzywuzzy comparison runs for only 2 string at a time.
- To resolve this confusion, what I have done :-
- The
first profile (0 index profile
) is used like an anchor profile, itcompares itself with every other
logic for fuzzy fields (first_name, last_name,email_field) and theminimum match %
is used todecide
thetotal match score update(whether to increment or not)
- The