Git Product home page Git Product logo

data_civilizer_system's Introduction

The Data Civilizer Architecture

Data Civilizer is an end-to-end data preparation system. Data Civilizer provides data discovery and cleaning services. The main components of the civilizer are shown in the below figure. Data Civilizer contains three primary layers, namely the CIVILIZER frontend, the CIVILIZER engine, and the CIVILIZER services. The workflow engine allows a user to string together any of the CIVILIZER services in a directed graph to accomplish her data integration goal. Then the engine manages the execution of services in the directed graph. Users interact with the system via the Civilizer Studio.

The Data Civilizer system

In a nutshell, Data Civilizer has the following primary services:

A data discovery module, Aurum, to help find datasets of interest. Aurum builds an enterprise knowledge graph (EKG) to connect similar tables. Aurum allows a user to browse the graph, run similarity searches on tables in the graph, and perform keyword searches. Aurum can also help find all join paths that connect these tables.

An entity resolution module to form clusters of records thought to represent the same entity. We are currently using DeepER, a deep learning-based entity resolution module. This can be easily replaced with another module which can do the same job.

A golden record module. This component has the effect of collapsing clusters representing the same entity into single representative records. Our golden record system excels at cleaning when there are multiple sources of the same information.

Other cleaning tools in the system include an abbreviation system, a disguised missing values detection tool (fahes), and ImputeDB for filling in missing values.

Docker Installation

You need to install Docker and Docker Compose first, then proceed to the following instructions.

A Hardware Requirements Specification

The DC system needs at least 8GB RAM. The default memory assigned to Docker is 2GB of RAM. The memory could be adjusted from the Docker preferences.

We recommend to set Docker CPUs to 4, if possible.

The DC system needs 15GB disk space to allocated to Docker; this is including the intermediate images, such as Ubuntu and llsc/cuda-torch. If the user needs to save some space, you can clean unused images, containers or volumes using system prune, see https://docs.docker.com/config/pruning/

docker system prune 

You can also browse the docker images and delete the ones you do not need

docker images
docker rmi <REPOSITORY Name> 

Development

Get the code

# clone using https
git clone --recursive https://github.com/qcri/data_civilizer_system.git
# or if you prefer ssh
git clone [email protected]:qcri/data_civilizer_system.git

cd data_civilizer_system

Build the custom llsc/cuda-torch base image:

docker-compose build cuda_torch

Build and run all Data Civilizer images:

docker-compose build apis && docker-compose build grecord_service && docker-compose build studio && docker-compose up studio

Run the system (if already built):

docker-compose up studio

Note: The the apis service requires the Stanford Global Vectors for Word Representation binary file, GloVe.t7, which is stored external to the docker image. If the file is not present when the service is started, it will be created automatically. However, depending on your system, this could take up to 40 minutes, during which there will be no indicator of progress. You can check the progress from another command window with either top or ps auxf. With top, watch for luajit to execute and finish; cpu load should remain high throughout. With ps, wait for the apis container to execute civilizer.py.

Authoring and Running Data Civilizer Pipelines

Run the system, then visit http://localhost:5000.

The current version of the studio allows the user to add a service and run it; then the user can add another service. That means the user cannot create a full pipeline and then ask the studio to run it. After adding each node, you have to:

     - save
     - build 
     - execute   

In the following, there are different pipeline examples based on the uploaded datasets, namely six tables from a university and E-Commerce.

Active Node

Each node has an attribute called isActive. This attribute takes 'y' or 'n' value. This attribute is used to indicate the deepest active node in the pipeline.

Fahes service to detect disguised missing

This example check disguised missing values in 6 tables available at /app_storage/data_sets/MIT_Demo/ and write the output at /app_storage/data_sets/MIT_DemoResults/.

Run the system and go to http://localhost:5000/#!/editor 
Create a new pipeline 
Instantiate a node from the Fahes service
Add the following values to the source and destination parameters of the service:
source: /app/storage/data_sets/MIT_Demo/
destination: /app/storage/data_sets/MIT_DemoResults/     

ImputeDB service to predict missing values

This example predicts the missing values in Sis_department available at /app_storage/data_sets/MIT_Demo_IM/ and write the output at /app/storage/data_sets/MIT_DemoResults/.

Run the system and go to http://localhost:5000/#!/editor 
Create a new pipeline 
Instantiate a node from the ImputeDB service
Add the following values to the source and destination parameters of the service:
source: /app/storage/data_sets/MIT_Demo_IM/
destination: /app/storage/data_sets/MIT_DemoResults/ 
tableName: Sis_department
Query: select Dept_Budget_Code from Sis_department;
ratio: 0    

PKDuck service to fix abbreviations

This example fixes abbreviations in the tables available at /app_storage/data_sets/MIT_Demo/ and write the output at /app_storage/data_sets/MIT_DemoResults/.

Run the system and go to http://localhost:5000/#!/editor 
Create a new pipeline 
Instantiate a node from the PKDuck service
Add the following values to the source and destination parameters of the service:
source: /app/storage/data_sets/MIT_Demo/
destination: /app/storage/data_sets/MIT_DemoResults/ 
Columns: 12#11#8
Tau: 0.8

Entity matching and Golden Record services

This example creates a pipeline of two services to consolidate data available in two different E-Commerce tables for Amazon and Google items, which are available at /app_storage/data_sets/deeper/Amazon-GoogleProducts. Note that, the entity matching is done using DeepER where a model is pre-trained and available in the same folder.

Run the system and go to http://localhost:5000/#!/editor
Create a new pipeline 
Instantiate a node from the DeepER service
Add the following values to the DeepER parameters:
source: /app/storage/data_sets/deeper/fodors-zagats
destination: /app/storage/data_sets/deeper/output 
Table1: fodors.csv
Table2: zagats.csv
predictionsFileName: matches.csv
number_of_pairs: 50000         

After executing the DeepER node, instantiate another node for EntityConsolidation

Add the following values to the EntityConsolidation parameters:
source: /app/storage/data_sets/deeper/output/
destination: /app/storage/data_sets/deeper/output/GRRows.csv 

data_civilizer_system's People

Contributors

dongdeng avatar elkindi avatar m-mahdavi avatar mansoure avatar michaeldh42 avatar qahtanaa avatar raulcf avatar stravanni avatar tracyhenry avatar vijaygadepally avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data_civilizer_system's Issues

Package 'oracle-java8-installer' has no installation candidate

Right after running:

docker-compose build apis && docker-compose build grecord_service && docker-compose build studio && docker-compose up studio

The following error message appears:

Package oracle-java8-installer is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'oracle-java8-installer' has no installation candidate
ERROR: Service 'apis' failed to build: The command '/bin/sh -c apt-get update &&     apt-get install -y --no-install-recommends         oracle-java8-installer     &&     apt-get clean &&     rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100

remove node_modules

Just checked out the repo and got tons of new node.js libraries.

Please add node_modules/ to .gitignore. The user should be able to run npm install to get the node modules (note that package.json is in the repo).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.