Git Product home page Git Product logo

schuylkill's Introduction

schuylkill

Build Status Coverage Status t PyPi version

Fixing human errors by matching those hard-to-spell words.

This Python utility merges pandas DataFrames based on two string columns using a variety of matching techniques, including fuzzy merging (using the fuzzywuzzy package).

Installation

Via PyPi:

pip install schuylkill

Examples

Exact Matching

You can merge data based on string functions: "exact", "startswith", "contains":

>>> import schuylkill as skool
>>> import pandas as pd

# Create the data
>>> left = pd.DataFrame({"street": ["Wash", "road", "Test"], "x": [1, 2, 3]})
>>> right = pd.DataFrame({"street": ["Washington", "Market", "Broad"], "y": [1, 2, 3]})

# Merge based on "contains"
>>> merged = skool.exact_merge(left, right, on="street", how="contains")
  street_x  x  right_index    street_y    y
0     Wash  1          0.0  Washington  1.0
1     road  2          2.0       Broad  3.0
2     Test  3          NaN         NaN  NaN

You can also use the pandas pipe() function to chain multiple merges together:

# Create the data
>>> left = pd.DataFrame({"street": ["Washington", "Mark", "road"], "x": [1, 2, 3]})
>>> right = pd.DataFrame({"street": ["Washington", "Market", "Broad"], "y": [4, 5, 6]})

# Combine multiple merges
>>> merged = (
    skool.exact_merge(left, right, on="street", how="exact")
    .pipe(skool.exact_merge, left, right, on="street", how="startswith")
    .pipe(skool.exact_merge, left, right, on="street", how="contains")
)
     street_x  x  right_index    street_y    y
0  Washington  1          0.0  Washington  4.0
1        Mark  2          1.0      Market  5.0
2        road  3          2.0       Broad  6.0

In the above example, each merge performed matches one row, and the final merged data frame has three matches.

Fuzzy Matching

Fuzzy matching based on a score threshold is also available:

# Create the data
>>> left = pd.DataFrame({"street": ["Washington", "Market", "Broad"], "x": [1, 2, 3]})
>>> right = pd.DataFrame({"street": ["Washington", "Mrkt", "Brd"], "y": [4, 5, 6]})

# Perform a fuzzy merge
>>> merged = skool.fuzzy_merge(left, right, on="street", score_cutoff=70)
     street_x  x  match_probability  right_index    street_y    y
0  Washington  1               1.00          0.0  Washington  4.0
1      Market  2               0.80          1.0        Mrkt  5.0
2       Broad  3               0.75          2.0         Brd  6.0

schuylkill's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.