HoloClean over Spark and PyTorch
v0.1.0
Noisy and erroneous data is a major bottleneck in analytics. Data cleaning and repairing account for about 60% of the work of data scientists. To address this bottleneck, we recently introduced HoloClean, a semi-automated data repairing framework that relies on statistical learning and inference to repair errors in structured data. In HoloClean, we build upon the paradigm of weak supervision and demonstrate how to leverage diverse signals, including user-defined heuristic rules (such as generalized data integrity constraints) and external dictionaries, to repair erroneous data.
HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights form noisy, incomplete, and erroneous data.
HoloClean has three key properties:
It is the first holistic data cleaning framework that combines a variety of heterogeneous signals, such as integrity constraints, external knowledge, and quantitative statistics, in a unified framework.
It is the first data cleaning framework driven by probabilistic inference. Users only need to provide a dataset to be cleaned and describe high-level domain specific signals.
It can scale to large real-world dirty datasets and perform automatic repairs that are two times more accurate than state-of-the-art methods.
For more information read our blog post.
- HoloClean:Holistic Data Repairs with Probabilistic Inference, (VLDB 2017)
This file will go through the steps needed to install the required packages and software to run HoloClean. For a more detailed installation guide check out the Holoclean_Installation_v3.pdf file in the git repo.
1.1 Ubuntu: For 32 bit machines, run:
wget https://3230d63b5fc54e62148ec95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.3.0-Linux-x86.sh
bash Anaconda-2.3.0-Linux-x86.sh
For 64 bit machines, run:
wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.3.0-Linux-x86_64.sh
bash Anaconda-2.3.0-Linux-x86_64.sh
conda create -n py27Env python=2.7 anaconda
Then the environment can be activated by running:
source activate py27Env
Make sure to keep the environment activated for the rest of the installation process
Note: before you install spark, you may need to install Scala on your system
Download the spark-2.2.0-bin-hadoop2.7.tgz
file from the spark website
Go to the directory where you downloaded the file and run:
tar -xzf spark-2.2.0-bin-hadoop2.7.tgz
pip install pyspark
3.1 For Ubuntu: update and upgrade your apt-get:
sudo apt-get update
sudo apt-get upgrade
Install MySQL by running:
sudo apt-get install mysql-server
3.2 For MacOS
Install and run the MySQL .dmg file for MacOS from https://dev.mysql.com/downloads/mysql/
After the installation is finished:
open system preferences and click on the MySQL icon and make sure the MySQL Server Instance is running.
Next run :
sudo /usr/local/mysql/bin/mysql_secure_installation
Set a new root password and use the default options for other prompts
3.3 Create MySQL User and Database
Go to the root directory and run the script:
./mysql_script.sh
Again go to the repo's root directory directory and run:
pip install -r python-package-requirement.txt
Follow instructions for your OS at:
http://pytorch.org/
To install pytorch
make sure to use Python 2.7 for installation (the other settings can be left as default)
Make sure to install version 0.3.0 or later
6.1 For Ubuntu:
Check if you have JDK 8 installed by running
java -version
If you do not have JDK 8, run the following command:
sudo apt-get install openjdk-8-jre
6.2 For MacOS
Check if you have JDK 8 by running
/usr/libexec/java_home -V
If you do not have JDK 8, download and install JDK 8 for MacOS from the oracle website: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
To get started, the following tutorials in the tutorial directory will get you familiar with the HoloClean framework
To run the tutorials in Jupyter Notebook go to the root directory in the terminal and run
./start_notebook.sh
Data Loading & Denial Constraints Tutorial
Complete Pipeline
Error Detection