Git Product home page Git Product logo

holoclean-old's Introduction

HoloClean: Weakly Supervised Data Cleaning

HoloClean over Spark and PyTorch

Status

Build Status License Documentation Status

v0.1.0

Data Cleaning with HoloClean

Noisy and erroneous data is a major bottleneck in analytics. Data cleaning and repairing account for about 60% of the work of data scientists. To address this bottleneck, we recently introduced HoloClean, a semi-automated data repairing framework that relies on statistical learning and inference to repair errors in structured data. In HoloClean, we build upon the paradigm of weak supervision and demonstrate how to leverage diverse signals, including user-defined heuristic rules (such as generalized data integrity constraints) and external dictionaries, to repair erroneous data.

HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and multiple other signals to build a probabilistic model that accurately captures the data generation process, and uses the model in a variety of data curation tasks. HoloClean allows data practitioners and scientists to save the enormous time they spend in building piecemeal cleaning solutions, and instead, effectively communicate their domain knowledge in a declarative way to enable accurate analytics, predictions, and insights form noisy, incomplete, and erroneous data.

HoloClean has three key properties:

  • It is the first holistic data cleaning framework that combines a variety of heterogeneous signals, such as integrity constraints, external knowledge, and quantitative statistics, in a unified framework.

  • It is the first data cleaning framework driven by probabilistic inference. Users only need to provide a dataset to be cleaned and describe high-level domain specific signals.

  • It can scale to large real-world dirty datasets and perform automatic repairs that are two times more accurate than state-of-the-art methods.

For more information read our blog post.

References

  • HoloClean:Holistic Data Repairs with Probabilistic Inference, (VLDB 2017)

Installation

This file will go through the steps needed to install the required packages and software to run HoloClean. For a more detailed installation guide check out the Holoclean_Installation_v3.pdf file in the git repo.

1. Setting Up and Using Conda

1.1 Ubuntu: For 32 bit machines, run:

wget https://3230d63b5fc54e62148ec95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.3.0-Linux-x86.sh
bash Anaconda-2.3.0-Linux-x86.sh

For 64 bit machines, run:

wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.3.0-Linux-x86_64.sh
bash Anaconda-2.3.0-Linux-x86_64.sh

1.2 MacOS:

Follow instructions here to install Anaconda (Not miniconda) for MacOS

1.3 Using Conda

Open/Restart the terminal and create a Python 2.7 environment by running the command:
conda create -n py27Env python=2.7 anaconda

Then the environment can be activated by running:

source activate py27Env

Make sure to keep the environment activated for the rest of the installation process

2. Download and Install Spark

Note: before you install spark, you may need to install Scala on your system

Download the spark-2.2.0-bin-hadoop2.7.tgz file from the spark website Go to the directory where you downloaded the file and run:

tar -xzf spark-2.2.0-bin-hadoop2.7.tgz
pip install pyspark

3. Install MySQL Server

3.1 For Ubuntu: update and upgrade your apt-get:

sudo apt-get update	
sudo apt-get upgrade

Install MySQL by running:

sudo apt-get install mysql-server

3.2 For MacOS

Install and run the MySQL .dmg file for MacOS from https://dev.mysql.com/downloads/mysql/
After the installation is finished:
open system preferences and click on the MySQL icon and make sure the MySQL Server Instance is running.

Next run :

sudo /usr/local/mysql/bin/mysql_secure_installation

Set a new root password and use the default options for other prompts

3.3 Create MySQL User and Database

Go to the root directory and run the script:

./mysql_script.sh

4. Installing Required Packages

Again go to the repo's root directory directory and run:

pip install -r python-package-requirement.txt

5. Installing Pytorch

Follow instructions for your OS at: http://pytorch.org/ To install pytorch
make sure to use Python 2.7 for installation (the other settings can be left as default)
Make sure to install version 0.3.0 or later

6. Install JDK 8

6.1 For Ubuntu:
Check if you have JDK 8 installed by running

java -version

If you do not have JDK 8, run the following command:

sudo apt-get install openjdk-8-jre

6.2 For MacOS
Check if you have JDK 8 by running
/usr/libexec/java_home -V

If you do not have JDK 8, download and install JDK 8 for MacOS from the oracle website: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

7. Getting Started

To get started, the following tutorials in the tutorial directory will get you familiar with the HoloClean framework
To run the tutorials in Jupyter Notebook go to the root directory in the terminal and run

./start_notebook.sh

Data Loading & Denial Constraints Tutorial
Complete Pipeline
Error Detection

holoclean-old's People

Contributors

aayushshah15 avatar ah89 avatar epang080516 avatar gmichalo avatar ihabilyas avatar j48zheng avatar jvonderwell avatar jw-mcgrath avatar matrixpachi-w avatar minafarid avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.