Git Product home page Git Product logo

data-warehouse-and-etl-best-practice's Introduction

Data warehouse and ETL best practice

A catalogue of best practices for managing data

I've catalogued here a list of common problems which hinder data analysis and machine learning, and a list of good practices for negating them.

Common problems which hinder data analysis and machine learning

Problem Example(s) Solution(s)
What do the fields in the data mean? How are they defined? Where does the data come from? Keep a data dictionary
Each data field owned by a specific person or team
Document the ETL process
I can't find the data that I want Keep a data dictionary
Data in relational databases should be normalised
Data is inconsistently formatted (e.g. different datetime formats in different parts of the database) Perform automatic data validation at ingestion
Perform automatic data cleaning
There are duplicate records in the data Perform automated data validation
Data in relational databases should be normalised
The database is very slow to query Data should be indexed along the most common query dimensions
Service each logical business unit with their own data mart
Monitor system performance metrics
Data is not stored in a tabular format yaml, json, noSQL Service each logical business unit with their own data mart
There are missing (NULL) values. How do I interpret them? Each data field owned by a specific person or team
Keep a data dictionary
Perform automated data cleaning
Perform automated data validation
Add constraints to the data entry system
The same entity is recorded multiple times The same customer has been registered multiple times under different customer IDs Perform automated data validation
Add constraints to the data entry system
Spelling errors and inconsistent category labelling e.g. the gender field contains "male", "M", "Male" etc. Perform automatic data validation at ingestion
Perform automatic data cleaning at ingestion
Add constraints to the data entry system
Data types are incorrect or unorthodox weight="5kg" (should be weightKg=5) Perform automated data cleaning
Perform automated data validation
Add constraints to the data entry system
There are data points which look obviously wrong Perform routine anomaly detection
Monitor data quality metrics
I can't access the latest data because the ETL process is too slow
Whoops - I deleted data by accident Establish procedures for scheduled backup and disaster recovery

Data Warehouse and ETL Best Practices

Good Practice Description Reason
Access control
Failures and worsening data quality metrics should raise an alarm Alerting means bringing issues quickly to the attention of the data administrators/engineers In a data pipeline, data errors quickly propagate to downstream processes and data consumers, so quick response is important
Add constraints to the data entry system Limits on the types of data that can be input into the system allow direct control over the inflow of data into the system This can be used to enforce data uniformity and quality can be enforced by limiting user options on data entry systems
Keep a data dictionary A data dictionary is data documentation enabling a user to understand and use the data without the need for any other source of information i.e. describing what each data field means, how it is calculated, how it may be used etc. Good data documentation enables end users (analytics, ML engineers etc.) to use the data without having to trawl through your organisation trying to piece together what the data means.
Service each logical business unit with their own data mart A data mart is a subset (and/or transformation) of the full data designed to service the needs of a specific segment of data consumers (e.g. the marketing department) Reduces computational load on the main data warehouse
Can make the data more accessible to data consumers (they are not overwhelmed with irrelevant data)
Can be used to facilitate data access control
Data in relational databases should be normalised Normalisation is a ubiquitous relational database organisation standard proposed by Edward Codd Protects against data duplication and redundancy
Results in consistent and organized data with a predictable format
Protects against anomalies arising during INSERT, DELETE or UPDATE
Automated data cleaning Transformation of data into a more usable form by an automatic process (e.g. a scheduled process, or at ingestion time) Avoids each data consumer having to do this themselves before being able to use the data
Data should be indexed along the most common query dimensions The data should be optimized for querying in the way in which it is expected to be most commonly queried This can increase the speed at which data can be accessed by orders of magnitude
Automated data validation Verifying whether the data conforms to certain prespecified data quality rules in an automated fashion Can be used to keep the data quality high, by rejecting low quality input, or use for monitoring input data quality.
Document the ETL process The process by which data arrives in its final production state in the data warehouse should be clearly documented Enables quick onboarding of new data administrators
Gives contextual understanding of the data to data consumers like analysts and machine learning engineers
Each data field owned by a specific person or team Each field in (or other logic partition of) the data should have a specific individual or team responsible for understanding and maintaining the quality of it. This contact person/team should be clearly documented alongside the data which they are responsible for Without this, organisational knowledge can be permanently lost and data quality deteriorate.
Anomaly detection New data not matching the historic or expected distribution of data should be identified and investigated Anomalies are usually either data errors or unusual system phenomena, both of which are worthy of investigation
Failure response
Log everything Logs are persistent written records of the actions taken by the system (e.g. data pipeline) When something goes wrong, logs tell us what happened
Monitor data quality metrics

Single number summaries of data quality (record counts, % null values, pipeline runtime etc.) should be recorded and watched closely These metrics provide early warning that something in the data pipeline has gone wrong
Observability and monitoring Data quality and other system performance metrics should be surfaced to data administrators/engineers in an interface or platform which makes the metrics easy to visualise, interrogate and explore Can help provide early warning of errors in the data pipeline
Can help with debugging, understanding and resolving errors in the data pipeline
Timestamp every data point
Monitor data drift and concept drift Data drift is a change in the distribution of the data. Concept drift is a change in the relationships between fields These kinds of changes are a powerful indicator of error or opportunity
Use structured logging
Establish procedures for scheduled backup and disaster recovery
Data lineage
Monitor system performance metrics

Security Principles

Principle Explanation
Observability/monitoring
Incident response
Defense in Depth
Separation of Duties
Data encryption In transit and at rest
Least Privilege
Secure by Design

Data Quality Metrics

https://www.ibm.com/downloads/cas/W6BAW9MQ

data-warehouse-and-etl-best-practice's People

Contributors

j-sephb-lt-n avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.