Git Product home page Git Product logo

data-model-drift's Introduction

Getting Traction on Data and Model Drift

Welcome to our Data and Model Drift Repository! The environment of our world is constantly changing. For machine learning, this means that deployed models are confronted with unknown data and can become outdated over time. A proactive drift management approach is required to ensure that productive AI services deliver consistent business value in the long term.

Check out our background article Getting a Grip on Data and Model Drift with Azure Machine Learning for an in-depth discussion about the concepts used in this repository.

Starting with tabular data use cases, we provide the following examples to detect and mitigate data and model drift. This example is based on the purely visual aspects of identifying data and model drift. The automation aspect will be covered in point MLOps

1. Statistical tests and expressive visualizations to detect and analyze drift in features and model predictions

KDE intersections to identify data drift

For a predictive maintenance example, we inspect the amount of drift by comparing the distributions of training data ("reference") and the production inference observations ("current"). The statistical tests prove significant drift for two input features heat_deviation and speed_deviation. Furthermore, the Kernel Density Estimation (KDE) plots help us to understand the amount and direction of data drift.

2. A predictive approach to identify the impact of data and concept drift on the model

Model drift impact on predicted class probabilities

Here, we compare the performrmance of two classifiers in predicting the most recent inference observations. The classifier which is trained on current data outperforms the initial model. The diagrams show the corresponding drift in predicted probabilities for the positive class.

3. Creating automated pipelines to identify data drift regularly as part of an MLOps architecture

MLOps architecture for evergreen models

Data and model drift management should be part of an overall MLOps solution. Here, we provide sample code for automated drift detection using Azure Machine Learning Pipelines.The MLOps implementation on Azure Machine Learning can be found in the following two subfolders of this repository:

  1. MLOps with Python SDK v2 (preview) & CLI v2 SDK-V2
  2. MLOps with Python SDK v1 SDK-V1

4. Connect AzureML pipelines to data drift dashboard in PowerBI

Based on the AzureML pipelines defined in SDK-V2, you can query the output data in PowerBI for an interactive view on potential data drift between your reference and current data. An exemplary report can look something like in the image below.

Data drift report in PowerBI

To connect your data source (coming from the AzureML Pipeline) to PowerBI, please consider the following steps:

  1. Install PowerBI desktop
  • If you have a Windows OS, you can download this software for free via the Microsoft store.
  1. Retrieve the data from your Blob Storage
  • Under Get data , navigate to more , select Azure and finally select Azure Blob Storage. You may need to log into Azure at this stage.
  • You will now get prompted to enter the name of the Blob. You can find the name of the storage account associated to your AML workspace in the Azure Portal. Copy this name and paste it into the text box.
  • You will see multiple directories, choose the parent directory that contains the file. You can retrieve the exact location of the output file with the data drift database from the experiment that you launched via the CLI or .ipynb notebook.
  1. Select the right file
  • You will now see all files that are availble in the Blob parent directory that holds your experiments. To select the relevant file, click transform data. The Power Query prompt will now open.
  • In Power Query, filter under the column name by pressing on the inverted triangle sign. Select text filters and then contains. Now paste the full path to the file starting after the parent directory. It could look something like azureml/<BLOB ID>/pipeline_job_store_data_drift/drift_db_processed.csv
  • Power Query will return one file in the view now. Under the content header, click on the yellow binarysign.
  • Power Query will import the .csv file from the Blob Storage. You now also have the chance to review the schema of the table and change columns as needed. Once you are finished, press Close & Apply in the top left pane. You have now established a live connection to your database.
  1. Create a report
  • Under reports you can drag and drop the relevant columns to re-create the visuals.
  • To create the KDE plots, start with filtering one specific column under filters. Then, select a line chart with the "current-" and "reference kde values" as y axis and "x axis" as x axis.
  • For the KDE intersection percentage you can use a Gauge chart, select the "kde_overlap"column and aggregate by average. Since it's a constant columnm, the average will yield the actual value. Do not forget to apply the same filter settings (e.g. filter by one column) as you did in for the KDE intersection plot.
  • For the drift indication as well as the column name, you can select a card item. Select the first item. Similar to "kde_overlap", this is a constant value hence, the first value is the same as the rest of the data given the respective filter. Do not forget to apply the same filter settings (e.g. filter by one column) as you did in for the KDE intersection plot.
  • Add more visuals if desired. Once you're ready, you can publish your report by clicking on publish in the top menu. Once published, you can distribute and refresh the report.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

data-model-drift's People

Contributors

andreaskopp avatar krisbock avatar microsoft-github-operations[bot] avatar microsoftopensource avatar natasha-savic-msft avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

data-model-drift's Issues

ImportError: cannot import name 'CronSchedule' from 'azure.ai.ml.entities'

Running DRIFT_PIPELINES_V2.ipynb:

create a cron schedule start from current time and fire at minute 0,10 of every hour within the AEST TZ

from datetime import datetime
from dateutil import tz
from azure.ai.ml.constants import TimeZone
from azure.ai.ml.entities import (
CronSchedule,
RecurrenceSchedule,
RecurrencePattern,
ScheduleStatus,
)

schedule_start_time = datetime.now(tz=tz.gettz())
cron_schedule = CronSchedule(
expression="0,10 * * * *",
start_time=schedule_start_time,
time_zone=TimeZone.AUS_EASTERN_STANDARD_TIME,
status=ScheduleStatus.ENABLED,
)

pipeline_job.schedule = cron_schedule


ImportError Traceback (most recent call last)
Input In [27], in <cell line: 5>()
3 from dateutil import tz
4 from azure.ai.ml.constants import TimeZone
----> 5 from azure.ai.ml.entities import (
6 CronSchedule,
7 RecurrenceSchedule,
8 RecurrencePattern,
9 ScheduleStatus,
10 )
12 schedule_start_time = datetime.now(tz=tz.gettz())
13 cron_schedule = CronSchedule(
14 expression="0,10 * * * *",
15 start_time=schedule_start_time,
16 time_zone=TimeZone.AUS_EASTERN_STANDARD_TIME,
17 status=ScheduleStatus.ENABLED,
18 )

ImportError: cannot import name 'CronSchedule' from 'azure.ai.ml.entities' (/anaconda/envs/azureml_py38/lib/python3.8/site-packages/azure/ai/ml/entities/init.py)

Adding bonferroni correction

Hi,

I have used this repo to build our current drft solution for our team.

image

I am historizing the dataset using an azure ml tabular dataset by passing the path of my drift loadings in blob. This allows me to query the data over time and plot how the p-values are varying.

However in order to detect if we have drift or not in the overall dataset it is unfair to just look to one feature.

I am using a bonferroni correction: Bland JM, Altman DG: Multiple significance tests: The Bonferroni method. BMJ 1995;310(6973):170.

In the same way that it is implemented in Seldon Core:

# TODO: return both feature-level and batch-level drift predictions by default
        # values below p-value threshold are drift
        if drift_type == 'feature':
            drift_pred = (p_vals < self.p_val).astype(int)
        elif drift_type == 'batch' and self.correction == 'bonferroni':
            threshold = self.p_val / self.n_features
            drift_pred = int((p_vals < threshold).any())  # type: ignore[assignment]
        elif drift_type == 'batch' and self.correction == 'fdr':
            drift_pred, threshold = fdr(p_vals, q_val=self.p_val)  # type: ignore[assignment]
        else:
            raise ValueError('`drift_type` needs to be either `feature` or `batch`.')

Maybe something worth to add to the example!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.