Git Product home page Git Product logo

aeda's Introduction

AEDA stands for Automated Exploratory Data Analysis

AEDA will automatically profile any supported database using reading access priviledges. The results of the profiling will be stored in a second supported database with write priviledges.

Profiling a database means metadata extraction from all the tables of a given database and storing this information into a second metadata database that can be used to query information about the source database. The metadata database is a data catalog.

AEDA generates SQL queries to be executed in the source database and store the results in a metadata database. The structure of the metadata database can be found in this document.

Usage

1. Clone and install the repository

Download or clone this repository and install the dependencies.

git clone https://github.com/darenasc/aeda.git
cd aeda

If you don't have pipenv installed, you can install it with:

pip install pipenv

Then, you can install the dependencies with:

pipenv install Pipfile

2. Create a database connection file

aeda requires a databases.ini file in the src/aeda/connection_strings/ folder to store the connections to databases. You can rename the databases.ini.template file that is included with the repo and then add your connections there. The databases.ini file is not syncronised with the repo.

3. Add database connections

The database connections have the following format.

# databases.ini
[my-source-database]
db_engine = <A-SUPPORTED-DB-ENGINE>
host = <IP-OR-HOSTNAME-SOURCE-DATABASE>
schema = <SCHEMA-SOURCE-DATABASE>
catalog = <CATALOG-SOURCE-DATABASE>
user = <SOURCE-USER>
password = <SOURCE-PASSWORD>
port = <SOURCE-PORT>

[my-metadata-database]
db_engine = <A-SUPPORTED-DB-ENGINE>
host = <IP-OR-HOSTNAME-METADATA-DATABASE>
schema = <SCHEMA-METADATA-DATABASE>
catalog = <CATALOG-METADATA-DATABASE>
user = <METADATA-USER>
password = <METADATA-PASSWORD>
port = <METADATA-PORT>
metadata_database = yes # yes or no optional parameter

[<SQLITE3-REFERENCE-NAME>]
db_engine = sqlite3
schema = <SQLITE3-DATABASE-NAME>
folder = <PATH/TO/THE/FOLDER/OF/THE/SQLITE3/DATABASE>
metadata_database = yes

A [connection-name] in square brackets that is used by aeda to identify what database you want to use. In the example above there are two database connections [my-source-database] and [my-metadata-database].

[my-source-database] is the database that we want to profile, we need reading priviledges to that database. [my-metadata-database] is the database where we will store the metadata from [my-source-database]. The database defined by [my-metadata-database] requires writing priviledges.

You can check the SQL Code documentation file to learn about the database structure of the metadata database and what metadata is extracted from the profiled sources.

Note: Do not use quotes in the databases.ini file and remove '<' and '>' chars.

The metadata_database parameter is optional. It is used by the streamlit app to show the connection and presents the metadata_database as a dropdown list.

The supported database engines, to fill the db_engine property in the databases.ini file are:

  • sqlite3
  • mysql
  • postgres
  • mssqlserver
  • mariadb
  • snowflake
  • aurora
  • saphana
  • saphana_odbc

3.1 Create the metadata database

You could create a SQLite3 local database or create metadata databases using MySQL, PostgreSQL, or MS SQL Server. Using the following commands from the terminal in the src/aeda folder:

python aeda_.py create_db sqlite3 --section <YOUR-SQLITE3-DATABASE>  # Creates a sqlite3 database, or
python aeda_.py create_db mysql --section <my-metadata-database>

A connection definition for a SQLite3 database has only three properties:

[<SQLITE3-REFERENCE-NAME>]
db_engine = sqlite3
schema = <SQLITE3-DATABASE-NAME>
folder = <PATH/TO/THE/FOLDER/OF/THE/SQLITE3/DATABASE>

3.2. Check connections

You can check what connections are available using list-connections that will list the connections available. You can use the name in the section column to refer to that specific connection.

python aeda_.py list-connections

3.3 Test the connections

To test the connections to the databases you have created, you can use the following command:

cd src/aeda
python aeda_.py test-connections --source my-source-database --metadata my-metadata-database

Where my-source-database and my-metadata-database are the names of the connection definitions in the databases.ini configuration file.

This should print the following:

[ OK ]  Connection to the ****.****.**** source tested successfully...
[ OK ]  Connection to the ****.****.**** source tested successfully...

3.3 List the connections

Once you add your connections, you can check them using the list-connections.

cd src/aeda
python aeda_.py list-connections

4. Exploring the source database

To explore a database you need to run the following command from the terminal in the src/aeda folder:

cd src/aeda
python aeda_.py explore --source my-source-database --metadata my-metadata-database

Where my-source-database and my-metadata-database are the names of the connection definitions in the databases.ini configuration file.

5. Relax and wait for the results.

The process has 6 stages and will print Done! when the process is finished.

The phases of the profiling are six:

  1. It's going to get all the columns from the metadata.
  2. It's going to compute number of columns and number of rows per table.
  3. It's going to compute the number of unique values and number of NULL values per column.
  4. It's going to compute the data value frequency per column.
  5. It's going to compute the monthly frequency of the timestamp or date type columns.
  6. It's going to compute statistics of the numeric type columns.

The tables are processed by number of rows, so from step 3 it's going to process the tables with less rows first.

6. Visualising the results

You can query the resulting database or use a minimalistic user interface develped with streamlit from the src/aeda/streamlit folder. It will publish the report in the port 5000 of your localhost.

cd src/aeda/streamlit
streamlit run aeda_app.py

Feedback is appreciated!

  • Any questions or feedback? just create an issue
  • There are issues with help wanted to test commercial databases.

aeda's People

Contributors

darenasc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

aeda's Issues

Process data in-memory

For small tables in the source database, process them from a dataframe and insert them into the metadata database. Tables with less than 10k rows or according to the number of rows suggested by benchmark #26 between in-memory processing or querying the source database.

Add anomaly analysis

Is your feature request related to a problem? Please describe.
In the profiling of columns, an optional analysis of anomalies could be included.

Describe the solution you'd like
The search of outliers in numerical columns and search of biased classes in categorical columns.

Describe alternatives you've considered
None

Allow multiple profilings

Is your feature request related to a problem? Please describe.
Currently, aeda allows to profile one database at a time. Ideally, it can accept a list of connections defined in databases.ini and process them in parallel or sequence, using only one API call.

Describe the solution you'd like
A clear and concise description of what you want to happen.
The explore call can receive a list of database connections separated by space and generate a queue to process them, the last database connection should represent the target metadata database.

Provide a way to transform data

Is your feature request related to a problem? Please describe.
It would be ideal to provide an interface to transform data at column level. For example adding representation codes so that when a code appears in the table it can be visualized with its interpretation.

Describe the solution you'd like
Include a dict where to store a dictionary alike for code transformations at the code level, so that reports can be visualized with these clear data.

Describe alternatives you've considered
Adding a user-input in the streamlit app or by passing a dictionary as a config file.

Add date related summary

A table similar to stats but for datetime related data.

  • max date
  • min date
  • range (total number of days)
  • data received per day
  • median

Add filter by data type

Add filter data types such as TEXT, NTEXT, JSONB, and others when processing data_values.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.