awslabs / deequ Goto Github PK

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

License: Apache License 2.0

Scala 99.51% Makefile 0.01% Java 0.49%

dataquality spark unit-testing scala

deequ's Issues

Integrate Stateful Metrics Computation into our high-level API

Set up integration tests

We should have integration tests e.g. on EMR triggered by Travis CI.

Support for state aggregation and persistence for VerificationSuite fluent API

The fluent API does currently not offer setting two parameters for state aggregation and state saving.

A workaround for this is to use the deprecated run method:

val stateProvider = InMemoryStateProvider()
VerificationSuite().run(df, checks, saveStatesWith = Some(stateProvider))
val state = stateProvider.load(analyzer)

Implementation of the two additional parameters should be straightforward by adding a method for aggregation and one for state saving in VerificationRunBuilder possibly with this signature:

class VerificationRunBuilder {
  // ...
  def aggregateWith(stateLoader: StateLoader): this.type
  def saveStatesWith(statePersister: StatePersister): this.type
  // ...
}

and passing them to VerificationSuite().doVerificationRun().

Create an example for anomaly detection

Create an example for the anomaly detection functionality of deequ.

Exclude Constraint Suggestion Rule suggesting a Uniqueness constraint from the default Set of Rules

The Uniqueness analyzer is too expensive for very large datasets and shouldn't be included by default.

Add guide for contributing

Write a markdown doc detailing how to contribute (including style guide, expected test coverage, etc)

Fluent API entry points that don't require Data as Input if the Metrics are already available in a MetricsRepository

Using the MetricsRepository, you can reuse metrics calculated at some earlier point. If you do that, you don't always require the data again if everything that's needed was already calculated before. At the moment, it is possible to work around that issue for most cases by just inputting an empty DataFrame and then use the method .reuseExistingResultsForKey(resultKey, failIfResultsForReusingMissing=true). However, we should properly support this use case in our API, for example by adding a method .withoutData in addition to the current .onData methods. In case we need some information about the schema, providing that instead of the data itself should be enough.

Check for test coverage during build

We should decide on a threshold for test coverage and have the build check that.

Create an example for constraint suggestion

Create an example for computing and evaluating constraint suggestions for data.

More advanced Anomaly Detection methods

We would like to support seasonality, multivariate metrics and more advanced AnomalyDetectionStrategies.

Release RC1

We should release a new RC for deequ today, as many of the examples depend on Stefan's latest changes which will not be in the RC0 release.

Add extended Constraint Suggestion example with more data and failing constraint suggestions

Current example here.

Min/Max String Length Analyzer

Support for Referential Integrity

We would like to support referential integrity constraints and analyzers.

Add markdown documentation for an AnalysisRunner example

The AnalysisRunner is our main entry point for calculating metrics in a scenario where you don't want to write unit tests but want to do something different with the metrics. We need to make the distinction between the AnalysisRunner and the ColumnProfilerRunner clear.

Add time series periodicity detection

Automatic periodicity detection can be used as input to seasonal anomaly detection methods where users don't know the seasonality.

Improve CategoricalRangeRule in constraint suggestion

When running constraint suggestion for CategoricalRangeRule for columns with cardinality of a few dozens, the suggested constraint is failing by a very small margin (Value: 0.99999907 does not meet the constraint requirement!).

Add markdown documention for the algebraic state example

Documentation for core concepts in our library

Do we need a separate markdown file where we explain all core concepts like analyzers, checks, constraints, metrics, column profiles, metric repositories, check levels etc.?

We should also have a list of all analyzers, constraints, anomaly detection methods and constraint suggestion rules we offer in one place.

Maybe even a readthedocs-documentation?

Add markdown documentation for an advanced data profiling example

We should give hints on visualization and show some of the advanced options.

Suggested .isNonNegative check fails on String columns with non-negative integers

Constraints that are automatically suggested must be able to apply to the dataset that they were suggested from.

When using a ConstraintSuggestionRunner [1] to automatically suggest Check constraints on a DataFrame, the suggested .isNonNegative(...) constraints assume that the column they are applied to is of a numeric type. However, the constraint suggestion runner will produce this constraint on String-typed columns that contain numeric values. Therefore, this suggested .isNonNegative check will fail on the data that was used to generate it.

More unit tests for column profiling

We should have more tests for the column profiling functionality.

Analyzer for Precision and Scale of BigDecimals

Suggested hasDataType constraint fails when data is not complete

Constraints that are automatically suggested must be able to apply to the dataset that they were suggested from.

When using a ConstraintSuggestionRunner [1] to automatically suggest Check constraints on a DataFrame, the suggested .hasDataType(...) constraints do not include any information about completeness. Thus, if the column that the .hasDataType constraint was suggested for has any null values (i.e. is not a complete column), then the suggested constraint will fail on the dataset that it was suggested from.

[1] Specifically, this expression is what is meant by "automatically suggested constraints":

ConstraintSuggestionRunner()
  .onData(data)
  .addConstraintRules(Rules.DEFAULT)
  .run()

Where data is a org.apache.spark.sql.DataFrame instance.

Create an example for data profiling

Create an example for using our single-column data profiling to profile data.

Logging instead of Printing

The constraint suggestion and column profiling API, which is still experimental, offers the option to print status updates after each pass over the data. We should replace this with proper logging and see if we can use that in other places in the library as well.

Enable publishing of maven artifacts via travis

We should be able to easily publish to maven central.

Create constraint suggestion rules for continuous columns

For columns with continuous values (int, double, ...), we want to suggest appropriate constraints. We need to determine when we want to use static rules (e.g., max/min; this could replace the isPositive rule also) and when to use anomaly detection. May depend on whether there is a trend change in the data distribution.

Check new versions for binary compatibility

We can use MiMa to verify that new versions of the library don't break existing clients. This could be hard though because MiMa plays nicely only with SBT.

Support Spark 2.3

Hi Deequ team,

This is a pretty useful tool. However, it only supports Spark 2.2. Is there a plan to support Spark 2.3 version?

Add markdown documention for the anomaly detection example

Add markdown documention for the data profiling example

isContainedIn Check breaks for columns with special characters

The isContainedIn Checks will fail whenever there exists at least one column with a special character. Here, a special character means something that is reserved in the Spark SQL language: e.g. a [ or ]. isContainedIn generates SQL, but the column name is not properly escaped. This means that, at execution time, the generated SQL will fail with a syntax error. As a consequence, this syntax error will cascade through all other Checks when associated with a VerificationRunBuilder. Thus, even valid Checks will fail due to this SQL syntax error.

Add support for multivariate Anomaly detection strategies

Currently, anomaly detection strategies can consume a single metric and its historic values. We would like to extend the interface for strategies s.t. they can be based on multiple metrics.

Refine existing Constraint Suggestion Rules and add new Rules

Document how to release

Write a short documentation on how to release.

Add markdown documentation on how to reuse existing results in a MetricsRepository

Show how to use the method reuseExistingResultsForKey and maybe show how this can be used in some example use cases:

Experiment with checks and constraint suggestion rules
Run different checks on some data without duplicating work, maybe a cheap one using basic analyzers first and then a second check that uses both the cheap constraints from before and some of the more expensive constraints?

Using Deequ with zeppelin

Not sure if this is a Zeppelin related issue or not,
But I get the following error, I'm using zeppelin 0.8.1, with spark 2.4, scala version version 2.11.12 running on AWS EMR.

java.lang.NoSuchMethodError: scala.reflect.internal.Symbols$Symbol.originalEnclosingMethod()Lscala/reflect/internal/Symbols$Symbol; 
	at scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.getEnclosingMethodAttribute(GenASM.scala:1306) 
	at scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.genClass(GenASM.scala:1222) 
	at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.emitFor$1(GenASM.scala:135) 
	at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:141) 
	at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1625) 
	at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1610) 
	at scala.tools.nsc.Global$Run.compileSources(Global.scala:1605) 
	at scala.tools.nsc.interpreter.IMain.compileSourcesKeepingRun(IMain.scala:388) 
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.compileAndSaveRun(IMain.scala:804) 
	at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.compile(IMain.scala:763) 
	at scala.tools.nsc.interpreter.IMain.bind(IMain.scala:627) 
	at scala.tools.nsc.interpreter.IMain.bind(IMain.scala:664) 
	at scala.tools.nsc.interpreter.IMain$$anonfun$quietBind$1.apply(IMain.scala:663) 
	at scala.tools.nsc.interpreter.IMain$$anonfun$quietBind$1.apply(IMain.scala:663) 
	at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:202) 
	at scala.tools.nsc.interpreter.IMain.quietBind(IMain.scala:663) 
	at org.apache.zeppelin.spark.SparkScala211Interpreter$.loopPostInit$1(SparkScala211Interpreter.scala:179) 
	at org.apache.zeppelin.spark.SparkScala211Interpreter$.org$apache$zeppelin$spark$SparkScala211Interpreter$$loopPostInit(SparkScala211Interpreter.scala:214) 
	at org.apache.zeppelin.spark.SparkScala211Interpreter.open(SparkScala211Interpreter.scala:86) 
	at org.apache.zeppelin.spark.NewSparkInterpreter.open(NewSparkInterpreter.java:102) 
	at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:62) 
	at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69) 
	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:617) 
	at org.apache.zeppelin.scheduler.Job.run(Job.java:188) 
	at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140) 
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) 
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) 
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
	at java.lang.Thread.run(Thread.java:748)

No histgram for boolean columns?

Example:

"column_hasVisit": {
      "column": "column_hasVisit",
      "completeness": 1,
      "approximateNumDistinctValues": 2,
      "dataType": {
        "enumClass": "com.amazon.deequ.analyzers.DataTypeInstances",
        "value": "Boolean"
      },
      "isDataTypeInferred": false,
      "typeCounts": {},
      "histogram": null
    }

Create an example for the MetricsRepositories

We should have an example that showcases how to use the MetricsRepositories to store, query and re-use computed metrics.

Verify license headers automatically

We should configure the RAT plugin for maven and have it check that all .scala files have appropriate license headers.

Add markdown documention for the constraint suggestion example

Has anyone successfully used this in combination with AWS Glue?

Hoping to use this in conjunction with Glue to automate testing into our data pipeline. Has anyone done this successfully?

I'm attempting to run the BasicExample.scala code using a SageMaker notebook attached to a Glue Development Endpoint. I'm working in the SparkMagic kernel.

>,<,>=,<= checks fail on column names with special characters

The .isGreaterThan, .isLessThan, .isGreaterThanOrEqualTo, and .isLessThanOrEqualTo methods on the Check type will fail with a Spark SQL SyntaxError at runtime when applied to columns whose names contain special characters or keywords.

ApproxUniqueness Analyzer

The normal Uniqueness analyzer is too expensive for large datasets.

Create an example for computing metrics after updates on partitioned data

More showcasing for the state support.

Add markdown documention for the metrics repository example

Emphasize our four main API entry points and when to use them

We have 4 main entry points in our API at the moment:

VerificationSuite
AnalysisRunner
ColumnProfilerRunner
ConstraintSuggestionRunner

Should we list these 4 in our main README in a separate paragraph and write one sentence for each of them to highlight the main use cases?

Histogram analysis for categorical data with numerical values

It is possible of have categorical data with numerical values.
For example 1 indicating male and 2 indicating female.

It would be nice be have histogram analysis available for those columns. Currently, it is limited to only boolean and string columns

awslabs / deequ Goto Github PK

deequ's Issues

Recommend Projects

Recommend Topics

Recommend Org