Git Product home page Git Product logo

fdcd's Introduction

This repository contains the code for the paper "Fast Algorithms for Denial Constraint Discovery".


Installation dependencies

Before building the algorithms, make sure to install the following prerequisites:

  • Java JDK 1.8 or later
  • Maven 3.1.0 or later
  • Git
  • Boost (only for enumeration with the MMCS algorithm)

Setup

1. Clone the code

As the first step, clone this repository :

$ git clone https://github.com/EduardoPena/fdcd.git
$ cd fdcd

2. Compile the code and generate jar file

Then, build fdcd with the following maven command:

.../fdcd$ mvn clean install

The command above will create a "fat" jar called discoverDCs.jar and place it into the target folder.

3. Install MMCS Algorithm (optional)

DC enumeration with the MMCS algorithm requires a C++ implementation, found in: MHS generation algorithms. If you are willing to use it, please, follow the instructions to build the executable (we use the default name, agdmhs). Then, copy the executable agdmhs into the folder containg the fdcd jar (e.g., target).


Execution

Once you have compiled the code, you can run the discovery, for example:

.../fdcd$ java -jar target/discoverDCs.jar data/tax.csv

Parameters

The only required parameter is the dataset. See the data/ folder for sample .csv files. Additionally, you can specify three optional parameters:

  • -n : number of rows. For example, the following command executes the discovery with the first 10000 rows of the dataset:
.../fdcd$ java -jar target/discoverDCs.jar data/tax.csv -n 10000
  • -o : output file path. In case the parameter -o is not specified, the program only shows the number of results. The following command saves the discovered DCs in the taxdcs.out file.
.../fdcd$ java -jar target/discoverDCs.jar data/tax.csv -n 10000 -o taxdcs.out
  • -e : enumeration method. The enumeration method to be used with the ECP algorithm. The following algorithms are available: INCS, EI, HEI, MMCS, HMMCS, MCS (check the paper for technical details). The default is INCS. For example, the following command runs the discovery using the HEI enumeration algorithm:
.../fdcd$ java -jar target/discoverDCs.jar data/tax.csv -n 10000 -o taxdcs.out -e HEI

Repository structure

  • src/: the Java implementation of fdcd
  • data/: a sample of the datasets used for experiments

More data

This repository contains only sample datasets. The full datasets used in the paper are hosted here


Metanome and comparisons

We compare our algorithms with state-of-the-art algorithms found here These algorithms are integrated with Metanome, a specialized data profiling plataform. We intend to integrate our algorithm into the plataform soon.


fdcd's People

Contributors

eduardohmpena avatar eduardopena avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.