A catalogue of best practices for managing data
I've catalogued here a list of common problems which hinder data analysis and machine learning, and a list of good practices for negating them.
Good Practice | Description | Reason |
---|---|---|
Access control | ||
Failures and worsening data quality metrics should raise an alarm | Alerting means bringing issues quickly to the attention of the data administrators/engineers | In a data pipeline, data errors quickly propagate to downstream processes and data consumers, so quick response is important |
Add constraints to the data entry system | Limits on the types of data that can be input into the system allow direct control over the inflow of data into the system | This can be used to enforce data uniformity and quality can be enforced by limiting user options on data entry systems |
Keep a data dictionary | A data dictionary is data documentation enabling a user to understand and use the data without the need for any other source of information i.e. describing what each data field means, how it is calculated, how it may be used etc. | Good data documentation enables end users (analytics, ML engineers etc.) to use the data without having to trawl through your organisation trying to piece together what the data means. |
Service each logical business unit with their own data mart | A data mart is a subset (and/or transformation) of the full data designed to service the needs of a specific segment of data consumers (e.g. the marketing department) | Reduces computational load on the main data warehouse |
Can make the data more accessible to data consumers (they are not overwhelmed with irrelevant data) | ||
Can be used to facilitate data access control | ||
Data in relational databases should be normalised | Normalisation is a ubiquitous relational database organisation standard proposed by Edward Codd | Protects against data duplication and redundancy |
Results in consistent and organized data with a predictable format | ||
Protects against anomalies arising during INSERT, DELETE or UPDATE | ||
Automated data cleaning | Transformation of data into a more usable form by an automatic process (e.g. a scheduled process, or at ingestion time) | Avoids each data consumer having to do this themselves before being able to use the data |
Data should be indexed along the most common query dimensions | The data should be optimized for querying in the way in which it is expected to be most commonly queried | This can increase the speed at which data can be accessed by orders of magnitude |
Automated data validation | Verifying whether the data conforms to certain prespecified data quality rules in an automated fashion | Can be used to keep the data quality high, by rejecting low quality input, or use for monitoring input data quality. |
Document the ETL process | The process by which data arrives in its final production state in the data warehouse should be clearly documented | Enables quick onboarding of new data administrators |
Gives contextual understanding of the data to data consumers like analysts and machine learning engineers | ||
Each data field owned by a specific person or team | Each field in (or other logic partition of) the data should have a specific individual or team responsible for understanding and maintaining the quality of it. This contact person/team should be clearly documented alongside the data which they are responsible for | Without this, organisational knowledge can be permanently lost and data quality deteriorate. |
Anomaly detection | New data not matching the historic or expected distribution of data should be identified and investigated | Anomalies are usually either data errors or unusual system phenomena, both of which are worthy of investigation |
Failure response | ||
Log everything | Logs are persistent written records of the actions taken by the system (e.g. data pipeline) | When something goes wrong, logs tell us what happened |
Monitor data quality metrics | Single number summaries of data quality (record counts, % null values, pipeline runtime etc.) should be recorded and watched closely | These metrics provide early warning that something in the data pipeline has gone wrong |
Observability and monitoring | Data quality and other system performance metrics should be surfaced to data administrators/engineers in an interface or platform which makes the metrics easy to visualise, interrogate and explore | Can help provide early warning of errors in the data pipeline |
Can help with debugging, understanding and resolving errors in the data pipeline | ||
Timestamp every data point | ||
Monitor data drift and concept drift | Data drift is a change in the distribution of the data. Concept drift is a change in the relationships between fields | These kinds of changes are a powerful indicator of error or opportunity |
Use structured logging | ||
Establish procedures for scheduled backup and disaster recovery | ||
Data lineage | ||
Monitor system performance metrics |
Principle | Explanation |
---|---|
Observability/monitoring | |
Incident response | |
Defense in Depth | |
Separation of Duties | |
Data encryption | In transit and at rest |
Least Privilege | |
Secure by Design |