ipums / hlink Goto Github PK

Hierarchical record linkage at scale

License: Mozilla Public License 2.0

Dockerfile 0.03% Python 96.37% Scala 3.39% Makefile 0.10% Batchfile 0.11%

machine-learning record-linkage python pyspark

hlink's Introduction

hlink: hierarchical record linkage at scale

hlink is a Python package that provides a flexible, configuration-driven solution to probabilistic record linking at scale. It provides a high-level API for python as well as a standalone command line interface for running linking jobs with little to no programming. hlink supports the linking process from beginning to end, including preprocessing, filtering, training, model exploration, blocking, feature generation and scoring.

It is used at IPUMS to link U.S. historical census data, but can be applied to any record linkage job. A paper on the creation and applications of this program on historical census data can be found at https://www.tandfonline.com/doi/full/10.1080/01615440.2021.1985027.

Suggested Citation

Wellington, J., R. Harper, and K.J. Thompson. 2022. "hlink." https://github.com/ipums/hlink: Institute for Social Research and Data Innovation, University of Minnesota.

Installation

hlink requires

Python 3.10, 3.11, or 3.12
Java 8 or greater for integration with PySpark

You can install the newest version of the python package directly from PyPI with pip:

pip install hlink

Docs

The documentation site can be found at hlink.docs.ipums.org. This includes information about installation and setting up your configuration files.

An example script and configuration file can be found in the examples directory.

Quick Start

The main class in the library is LinkRun, which represents a complete linking job. It provides access to each of the link tasks and their steps. Here is an example script that uses LinkRun to do some linking.

from hlink.linking.link_run import LinkRun
from hlink.spark.factory import SparkFactory
from hlink.configs.load_config import load_conf_file

# First we create a SparkSession with all default configuration settings.
factory = SparkFactory()
spark = factory.create()

# Now let's load in our config file. See the example config below.
# This config file is in toml format, but we also allow json format.
# Alternatively you can create a python dictionary directly with the same
# keys and values as is in the config.
config = load_conf_file("./my_conf.toml")

lr = LinkRun(spark, config)

# Get some information about each of the steps in the
# preprocessing task.
prep_steps = lr.preprocessing.get_steps()
for (i, step) in enumerate(prep_steps):
    print(f"Step {i}:", step)
    print("Required input tables:", step.input_table_names)
    print("Generated output tables:", step.output_table_names)

# Run all of the steps in the preprocessing task.
lr.preprocessing.run_all_steps()

# Run the first two steps in the matching task.
lr.matching.run_step(0)
lr.matching.run_step(1)

# Get the potential_matches table.
matches = lr.get_table("potential_matches")

assert matches.exists()

# Get the Spark DataFrame for the potential_matches table.
matches_df = matches.df()

An example configuration file:

### hlink config file ###
# This is a sample config file for the hlink program in toml format.

# The name of the unique identifier in the datasets
id_column = "id" 

### INPUT ###

# The input datasets
[datasource_a]
alias = "a"
file = "data/A.csv"

[datasource_b]
alias = "b"
file = "data/B.csv"

### PREPROCESSING ###

# The columns to extract from the sources and the preprocessing to be done on them.
[[column_mappings]]
column_name = "NAMEFRST"
transforms = [
    {type = "lowercase_strip"}
]

[[column_mappings]]
column_name = "NAMELAST"
transforms = [
    {type = "lowercase_strip"}
]

[[column_mappings]]
column_name = "AGE"
transforms = [
    {type = "add_to_a", value = 10}
]

[[column_mappings]]
column_name = "SEX"


### BLOCKING ###

# Blocking parameters
# Here we are blocking on sex and +/- age. 
# This means that no comparisons will be done on records
# where the SEX fields don't match exactly and the AGE 
# fields are not within a distance of 2.
[[blocking]]
column_name = "SEX"

[[blocking]]
column_name = "AGE_2"
dataset = "a"
derived_from = "AGE"
expand_length = 2
explode = true

### COMPARISON FEATURES ###

# Here we detail the comparison features that are
# created between the two records. In this case
# we are comparing first and last names using 
# the jaro-winkler metric.

[[comparison_features]]
alias = "NAMEFRST_JW"
column_name = "NAMEFRST"
comparison_type = "jaro_winkler"

[[comparison_features]]
alias = "NAMELAST_JW"
column_name = "NAMELAST"
comparison_type = "jaro_winkler"

# Here we detail the thresholds at which we would
# like to keep potential matches. In this case
# we will keep only matches where the first name
# jaro winkler score is greater than 0.79 and
# the last name jaro winkler score is greater than 0.84.

[comparisons]
operator = "AND"

[comparisons.comp_a]
comparison_type = "threshold"
feature_name = "NAMEFRST_JW"
threshold = 0.79

[comparisons.comp_b]
comparison_type = "threshold"
feature_name = "NAMELAST_JW"
threshold = 0.84

hlink's People

Contributors

Stargazers

Watchers

Forkers

bollwyvl jrbalch543

hlink's Issues

Document Model Exploration step 3

There are only 3 Model Exploration steps listed in the documentation, but the Model Exploration link task has 4 steps. We should document the last step, which is LinkStepGetFeatureImportances. Note that right now the documentation is the same for Model Exploration and Household Model Exploration, but Household Model Exploration doesn't have this last link step, so this is a place that they differ.

Check out https://hlink.docs.ipums.org/link_tasks.html#model-exploration-and-household-model-exploration.

Spark `AnalysisException` in Model Exploration step 3

Running Model Exploration's step 3 "get feature importances" causes Spark to raise a pyspark.sql.utils.AnalysisException with the message "Table or view not found: model_eval_features_list". Somehow this step is expecting this table to exist when it doesn't. Maybe we should add this table to the step's required input tables and look at preceding steps to figure out where the disconnect is occurring.

Loosen dependency versions

Instead of pinning to a particular version of packages like pandas and colorama, let's make use of loose specifications like colorama>=0.4. It may make sense to keep some packages pinned more tightly. Maybe we want pyspark>=3.3,<3.4 or just want to keep pyspark exactly pinned.

This also applies to the version of Python used for testing in the Dockerfile. We can probably pin this to just Python 3.10 instead of Python 3.10.4. There are several more bugfix releases out now.

Log user commands

We should add some logging so that we can see which commands the user is running. This should help give some context to the other logging and errors that are written to the logs.

Maybe there's a pre-command hook that we can make use of to write each command to the logs just before it's run.

Use a different TOML package

The TOML parser that we're using appears to be unmaintained, or at least not regularly updated. It has some known bugs that keep it from matching the TOML standard, and some of the error messages are not particularly helpful. The tomli crate is a newer crate that will be added into the standard library in Python 3.11 as tomllib. Let's move to using tomli instead of toml. I believe some of our dependencies are already using tomli, so this will actually reduce the number of packages installed in the environment.

This may change how hlink handles some configuration files, especially ones that do not meet the TOML standards.

Add type hints to several of the central interface classes

We'd like to add more type hints in hlink. This is a big task, so we can take advantage of Python's gradual typing model to do this in smaller chunks. Here are some modules that might benefit from having type hints added and/or improved:

We can update documentation as we go as needed. See also #52.

Evaluate and work on skipped tests

There are 4 skipped tests at the time of writing. Let's evaluate why they're skipped and see if we can fix them up and get them to run.

Remove the deprecated `hlink.linking.transformers.interaction_transformer` module

We deprecated this module in #97. Removing it is a breaking change, so we will need to wait until a major version bump to remove the module entirely.

For now, users should use pyspark.ml.feature.Interaction (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.Interaction.html) instead of the InteractionTransformer provided by this deprecated module.

Pin the Sphinx version in setup.py

Generating the documentation with different versions of Sphinx can cause lots of changes to the html pages. Let's pin the version of Sphinx that's installed with hlink to reduce the number of issues with developers generating the documentation with differing versions of Sphinx. It looks like the most recent Sphinx version is 5.1.1.

Use colorama more simply

Recently the colorama package has changed how it handles initialization. Instead of init() and deinit(), there's now an option to use the just_fix_windows_console() function. This function is safe to call multiple times, and we can just call it when the hlink script starts up. So let's use just_fix_windows_console() once and then not worry about Windows ANSI issues for the rest of the program. Where to call it I'm not quite sure, but it will be nice to be able to get rid of the calls to init() and deinit(). This will require pinning colorama>=0.4.6.

We can still use the colorama Fore and Style classes to print colored text. There are other packages that also do this, but what we're doing is simple enough that colorama should be fine for now.

Remove mutable default arguments in `LinkStep`

There's a discussion of mutable default arguments here: https://stackoverflow.com/questions/1132941/least-astonishment-and-the-mutable-default-argument. I didn't know that before! There are some empty lists passed as default arguments in the constructor for LinkStep. This isn't causing any issues right now, but it could cause issues in the future, as it gives many different places in the code unwitting write access to the same lists.

Let's replace the lists with tuples or do something similar to prevent the issue.

Document the `attach_variable` feature selections transform

attach_variable corresponds to the hlink/linking/templates/shared/attach_variable.sql script.

Make a `types` or `typing` module with aliases useful for type hinting

We'd like to gradually work on adding more type hinting into hlink. I suspect that there are some rather complex types that we'll use often. We could create a types or typing module that could be used to simplify these types for type hinting. The usage might look something like

from hlink.types import Spark, SparkDF, HlinkConfig

def fn(config: HlinkConfig, spark: Spark) -> SparkDF:
    ...

And the actual module contents would be pretty simple. Maybe something like

import pyspark.sql

Spark = pyspark.sql.session.SparkSession
SparkDF = pyspark.sql.dataframe.DataFrame
HlinkConfig = dict[str, Any]

Make configuration more reliable with stronger typing

We've thought about making use of Rust for this, because the way it handles this stuff is abserdely cool. The more free parsing of Python may make this change hard to do in a completely backwards-compatible manner. Some TOML formatting issues that the Python parser handles might cause errors when parsed with Rust.

The general idea is to use maturin to build the Rust crate as a Python module, then call that module from existing Python to parse the hlink config file. We could use the Rust toml crate, which has support for defining the configuration as a Rust struct with derive macros. With some magic from the Rust serde crate, we can parse an enumeration like comparison types without any changes to the current configuration format.

Detect comparison_features with the same output column in config validation

If two [[comparison_features]] sections share an "alias" or output column, Spark will raise an error due to a duplicate column when it's computing tmp_potential_matches_prepped in matching step 2. We can check for this when we're validating the config file to raise an error much earlier in the process.

A similar check may also apply to other sections like [[feature_selections]] and [[column_mappings]].

In Model Exploration, models with too many parameters can overflow the table name buffer

In Model Exploration, some of the names of the tables generated depend on the parameters to the models. Non-alphanumeric characters are replaced with underscores _. If the model has many complicated parameters or even a lot of whitespace in its parameter list, this can cause the table name to be longer than 128 characters, which causes Spark to throw an error.

Here's an example table name: model_eval_precision_recall_curve_random_forest__maxdepth___7___numtrees___100___threshold_ratio___1_2_. This name fits in the buffer, but adding some more parameters would likely cause an issue.

Upgrade to Spark 3.3.1

This is a patch release of Spark. Since we are currently on 3.3.0, it should be an easy upgrade to go to 3.3.1. The release notes for Spark 3.3.1 are here: https://spark.incubator.apache.org/releases/spark-release-3-3-1.html.

Don't use `spark.sql.legacy.allowUntypedScalaUDF`

We set this Spark configuration option to true when we were migrating from Spark 2 to Spark 3. This makes our Scala code compatible with Spark 3, but the long term solution is to modify our Scala code to satisfy the new requirements for Spark 3 and avoid setting this configuration option. The configuration option is set in hlink.spark.session.SparkConnection.

From the Spark migration guide at https://spark.apache.org/docs/3.1.1/sql-migration-guide.html:

"In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Remove the return type parameter to automatically switch to typed Scala udf is recommended, or set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default."

Support Python 3.12

Python 3.12 was just recently released. With the work in #92 and #90, adding support for Python 3.12 should be fairly easy. We will need to wait for our dependencies, especially pyspark, pandas, and numpy, to support Python 3.12 first though. There may be some API changes or bugs to fix during the upgrade.

Document some undocumented comparison features

This is a checklist issue for documenting some undocumented comparison types and feature selection transforms that I've found that are undocumented.

Comparison Types to Document

All of these comparison features are implemented in hlink/linking/core/comparison_feature.py.

caution_comp_3_012 and caution_comp_4_012 are variants of caution_comp_3 and caution_comp_4, respectively. Instead of returning just 0 or 1, they return 0, 1, or 2. The 2 is returned as a special case, and if caution_comp_3_012 returns 0 or 1, then it agrees with caution_comp_3, and similarly for caution_comp_4_012 and caution_comp_4.

Log to a separate file for each hlink run

Right now all of the logs for hlink go to the same file. This is fine when only one instance of hlink is running at a time, but it causes issues when multiple instances are running and trying to write to the same file. To fix this, let's give each run of hlink its own log file. They can all go in the same directory.

The name of the log file should be <config_name>-<hexadecimal_uuid>.log. We are already generating a hexadecimal uuid for each run in main._setup_logging().

Most of these changes will need to happen in main.load_conf() and main._setup_logging().

Update documentation for functions in `hlink.core.comparison_feature`

A few of the functions in this module have docstrings that are out-of-sync with their arguments or argument types. They could use a little bit of updating.

Add support for aliasing columns in only one input dataset

Right now any [[column_mappings]] in the configuration file that alias columns do so for both input datasets at once. It would be nice if you could optionally alias a column from a single dataset. This would allow linking two datasets with similar data whose column names are slightly different.

One workaround for this issue is to rename the columns in the input datasets so that they match, but this is inconvenient.

Drop the "comment" column from the results of the desc command

The DataFrame returned by DESCRIBE table includes three columns: "col_name", "data_type", and "comment". Since we don't make use of comments in hlink, the comment column is always full of nulls. Let's drop this column since it's not providing any useful information.

This should be a simple change in hlink.scripts.lib.table_ops.show_table_columns(). We can call .drop("comment") on the DataFrame returned by DESCRIBE.

Don't pin CI/CD to Java 11

Pinning to this older Java version prevents us from pulling some Docker images that are on newer versions of Ubuntu. From the Spark docs, pyspark just needs Java>=8. So let's pull the default JDK for the OS instead of always pulling JDK 11. This will make our CI/CD more flexible and let us unpin from Debian bullseye in the Dockerfile.

It feels safe to assume that the default Java on these images will be at least version 8. 8 is very old at this point.

Generate API documentation

We ought to generate some API documentation for LinkRun and all the related classes and functions that are part of hlink's API. These docs could be linked to from the current hlink.docs.ipums.org website, or they could be a subsection of that site.

I'm pretty sure that Sphinx is capable of generating this sort of documentation, so let's try that out sometime since we're already using Sphinx.

Add documentation for the `sql_condition` comparison type

This is a useful catch-all comparison type that can help users write their own SQL queries in the [[comparison_features]] config section. We should document it so that users know that it exists and how to use it.

`main_loop.Main` doesn't consistently reload its autocomplete cache

main_loop.Main keeps an autocomplete cache of existing tables so that it knows how to autocomplete arguments to some commands. It doesn't update the cache after each operation that can create or drop tables though, so the autocompletion can get out of sync with the actual tables that exist.

There are at least a couple of different ways to approach this:

Update main_loop.Main to always update the cache after each operation that could create or drop tables.
Remove the caching and compute which tables exist "on the fly" whenever we need to autocomplete things. This I think is the nicer solution, but it might make tab completion too slow.

Update the help message for `main_loop.Main.do_run_step()`

This help string mentions "household linking". We should replace that with just "linking".

comparison_features marked as categorical = false are still treated as categorical

Comparison features may be either numerical/continuous or categorical. Adding the categorical = true attribute to a comparison_features block makes it categorical. But there's a bug where the comparison feature is treated as categorical no matter what value categorical is set to. categorical = false does not make the comparison feature numerical.

The responsible code is in hlink/linking/core/pipeline.py, in _calc_categorical_features():

# Check for categorical features in all comparison features
    for x in comparison_features:
        if x["alias"] in cols:
            if "categorical" in x.keys():
                categorical_comparison_features.append(x["alias"])
        else:
            continue

Unlike the code for pipeline features below it, this code just checks that the categorical key is present. We should check x.get("categorical", False) instead of "categorical" in x.keys().

While we're working on this bit of code, I see a few more improvements we could make.

Remove the else clause, since it's not doing anything here.
Come up with a more descriptive name for the variable x.
Add some logging debug or info statements listing the categorical features sometime after they've been computed.

For now this bug can be worked around by just not providing the categorical key at all in comparison features that are not intended to be categorical.

Use importlib.metadata instead of pkg_resources

We're getting these warnings from the tests:

/home/rileyh/open_source/hlink/hlink/scripts/main.py:12: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

Instead of pkg_resources, we can use importlib.metadata, which offers a similar API in a more efficient and reliable way.

Add a function to show the model parameters and labels after training

We'd like to add this function for debugging and getting more information about the results of training. This Stack Overflow post may be helpful: https://stackoverflow.com/questions/42935914/how-to-map-features-from-the-output-of-a-vectorassembler-back-to-the-column-name.

Rename some `matching_scoring_test` tests

The tests test_step_3_skip_on_no_conf() and test_step_3_alpha_beta_thresholds() in matching_scoring_test.py both mention Matching step 3, which doesn't exist. I believe they're both testing Matching step 2, so we should rename them to line up with that. They both have TODO comments above them as reminders.

Model Exploration step 3 seems to be broken

This step has been breaking for me recently, so I took a look at it. It seems to me that model_exploration.link_step_get_feature_importances is not reading in its chosen model correctly. It expects step 2 (link_step_train_test_models) to serialize the model to a particular path. But I believe step 2 instead saves the model to a dictionary on the LinkRun.

This causes step 3 to error out each time it looks for the chosen model because it can't find it. We don't have any tests that cover this module, so adding some tests that run all of the steps for Model Exploration in a row might be very helpful.

Don't install `hlink.tests` as an importable module

Right now (before and after the changes in #71), users can import hlink.tests and its submodules. This doesn't seem like a big issue to me, but it is weird. It would be nice if the tests were not a submodule of hlink. Maybe we can move them to the top-level in the directory structure so that they don't get included in the distribution and wheel? See also #67, which may go well with this issue.

Enable running tests with just the command `pytest`

Right now you need to type pytest hlink/tests to run hlink's tests. If you type pytest without any arguments, this error is printed:

====================================================== test session starts =======================================================
platform linux -- Python 3.10.4, pytest-7.1.2, pluggy-1.0.0
rootdir: /hlink, configfile: pytest.ini
plugins: cov-4.0.0
collected 0 items / 1 error                                                                                                      

============================================================= ERRORS =============================================================
_________________________________________________ ERROR collecting test session __________________________________________________
Defining 'pytest_plugins' in a non-top-level conftest is no longer supported:
It affects the entire test suite instead of just below the conftest as expected.
  /hlink/hlink/tests/conftest.py
Please move it to a top level conftest file at the rootdir:
 /hlink
For more information, visit:
  https://docs.pytest.org/en/stable/deprecations.html#pytest-plugins-in-non-top-level-conftest-files
==================================================== short test summary info =====================================================
ERROR 
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======================================================== 1 error in 7.66s ========================================================

It would be nice to look into changing this so that the tests can be run as just 'pytest'. This may involve some changes to CI/CD as well to change the invocation of pytest there.

Create a working tutorial

We have a tutorial in examples/tutorial, but it is missing data to run on. Let's make some example datasets so that the tutorial can be run without having to do any extra work. This should be really helpful for people working with hlink for the first time who are wanting to check out how it works and get it set up. I don't think we need very many columns in the example datasets, just NAMEFRST, NAMELAST, AGE, and SEX. The two datasets are expected to be samples taken 10 years apart (so AGE will increase by about 10).

Check for id column in config validation should be case-insensitive

The case of column names doesn't matter to hlink. So the check in hlink.scripts.lib.conf_validations.check_datasource should be case-insensitive so that it doesn't throw spurious errors when the id column name is capitalized differently in the config file and the datasources.

Improve tests for `conf_validations`

The conf_validations module is not particularly well tested. We should at least have one test that has a valid config file so that we can confirm that all of the steps run without errors on a valid config file.

There are lots of edge cases and error conditions to be triggered during config validation. Trying to hit all of them is probably not worth it, at least for this issue, but hitting some of the more important or fragile ones may be helpful. It is helpful that the module is divided into several smallish functions, which should make it easier to test.

Refactor a few core modules

There are a few small refactoring changes that would be good to make in some of the core modules. These are

The id_col argument to core.comparison_feature.create_feature_tables() is not used. Let's get rid of it.
The threshold_ratio argument to core.threshold._apply_alpha_threshold() is also unused, so we can get rid of it too.
core.transform has two * imports that make it hard to check for undefined variables. The pyspark.sql.types import may be able to go away entirely.

Break `reporting.LinkStepReportRepresentivity._run()` into several smaller functions

This function is really large, over 500 lines. It has several functions nested inside of it which are also quite large. A good first step may be to move these nested functions out to the top module level and then evaluate how hard the function is to read and understand with that change.

`JaroWinklerSimilarity` returns 1 for two empty strings

When we updated Scala Apache Commons Text from 1.4 to 1.9 in version 3.2.0 of hlink, we started using JaroWinklerSimilarity instead of JaroWinklerDistance. Commons Text renamed JaroWinklerDistance to JaroWinklerSimilarity because it was actually computing the similarity between two strings. But the behavior on empty strings has also changed, and we didn't notice that until now.

With the new JaroWinklerSimilarity, two empty strings "" and "" have computed similarity 1.0. This used to be 0.0 with the old JaroWinklerDistance. We may need to have a special case to handle this in hlink, as we would like two empty strings to have similarity 0.0.

Handle warnings output by tests

There are four unique warnings output when the tests run (they are repeated many times though). Three of the warnings are deprecation warnings coming from PySpark, which we don't have much control over. Hopefully they will be fixed in future versions of PySpark; we'll see.

The last warning is

/usr/local/lib/python3.10/site-packages/sklearn/metrics/_ranking.py:874: UserWarning: No positive class found in y_true, recall is set to one for all thresholds.

This is caused by the fact that some of our test sets are very small, and so might not have a good mix of positive and negative classifications. This isn't important for this test, so let's try to ignore or silence the error in pytest. It would be good if we could scope this to the single test instead of doing it globally for all tests.

Use single preceding underscores instead of double underscores for private methods and functions

This is a style choice that we've slowly been moving to as we make other changes. Instead of something like __method(), we use _method(). __method() does some mangling of the name of the method, but _method() is mostly just a documentation indication that the method is considered private and not part of the public API.

Similarly, let's move from __function() to _function() for consistency.

Support Python 3.11

Let's do some testing and fix any issues that come up when upgrading to Python 3.11. We should support both Python 3.10 and Python 3.11. I would like to have Github Actions run on both of these Python versions for each commit. That way we can make sure that we're supporting both as we move forward.

Python 3.11 is also quite a bit faster than previous versions and gives us access to the tomllib standard library module. We won't be able to directly use tomllib as long as we support Python 3.10, though.

Cover the `pip install .` use case in GitHub Actions

Currently our GitHub Actions install hlink in editable mode with the -e flag and with additional development dependencies. We need to install the development dependencies for linting and testing, but we could probably get away with not passing the -e flag to pip install. This might help prevent some installation bugs where pip install .[dev] doesn't copy over needed resource files. We ran into one of these bugs recently with Jinja.

We may also want to set up some GitHub Actions that just run pip install . without installing development dependencies and then confirm that some of the basics of hlink are working as expected. pytest won't be installed in this situation since it's a development dependency.

Move from setup.py to pyproject.toml

The declarative pyproject.toml format is a better choice for dependency specification than setup.py is for hlink. We should move as much as we can out of setup.py and into pyproject.toml. We already have a pyproject.toml file, but it's just got some settings for black in it at the moment.

We'll need to be careful that we continue to provide all of the metadata that we need for PyPI and that twine and the upload process support pyproject.toml as well as setup.py.

See also #38, which may help us move logic out of setup.py and simplify it.

Improve error handling for when two instances of hlink try to point to the same database

When two instances of hlink try to load data from the same database, a huge error is printed by Spark. One of the important lines printed looks like

ERROR XSDB6: Another instance of Derby may have already booted the database <database_path>

With Spark 2, I was able to catch the Python exception that's thrown and print a more useful error message, but the huge JVM stack trace was still printed. In Spark 3, I think that we can set the spark.sql.pyspark.jvmStacktrace.enabled configuration option to prevent Spark from printing these stack traces. This may have an effect on other places where we get large JVM stack traces too.

Use something like a MANIFEST.in file to define package data

Right now there's some gnarly logic in setup.py that adds the Jinja template files as package data so that they get included with the Python installation. We may be able to use something like a MANIFEST.in file that just lists the directories to include instead of having this logic. This would make the setup.py file simpler.

Here's some documentation for how to use MANIFEST.in: https://packaging.python.org/en/latest/guides/using-manifest-in/. It could be there's a better way than using MANIFEST.in, so we could look into that as well.

Use pyspark's `Interaction` class

In hlink/linking/transformers/interaction_transformer.py, we have a backported copy of the Interaction class from pyspark. Now that we're on pyspark 3, this class is available to us, so we can go ahead and get rid of the backport and use pyspark's version instead.

Looking at the documentation and source code for pyspark (https://spark.apache.org/docs/3.3.0/api/python/reference/api/pyspark.ml.feature.Interaction.html), this should hopefully be a very easy switch over without any other changes needed.