awslabs / deequ Goto Github PK
View Code? Open in Web Editor NEWDeequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
License: Apache License 2.0
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
License: Apache License 2.0
We should have integration tests e.g. on EMR triggered by Travis CI.
The fluent API does currently not offer setting two parameters for state aggregation and state saving.
A workaround for this is to use the deprecated run method:
val stateProvider = InMemoryStateProvider()
VerificationSuite().run(df, checks, saveStatesWith = Some(stateProvider))
val state = stateProvider.load(analyzer)
Implementation of the two additional parameters should be straightforward by adding a method for aggregation and one for state saving in VerificationRunBuilder
possibly with this signature:
class VerificationRunBuilder {
// ...
def aggregateWith(stateLoader: StateLoader): this.type
def saveStatesWith(statePersister: StatePersister): this.type
// ...
}
and passing them to VerificationSuite().doVerificationRun()
.
Create an example for the anomaly detection functionality of deequ.
The Uniqueness analyzer is too expensive for very large datasets and shouldn't be included by default.
Write a markdown doc detailing how to contribute (including style guide, expected test coverage, etc)
Using the MetricsRepository, you can reuse metrics calculated at some earlier point. If you do that, you don't always require the data again if everything that's needed was already calculated before. At the moment, it is possible to work around that issue for most cases by just inputting an empty DataFrame and then use the method .reuseExistingResultsForKey(resultKey, failIfResultsForReusingMissing=true). However, we should properly support this use case in our API, for example by adding a method .withoutData in addition to the current .onData methods. In case we need some information about the schema, providing that instead of the data itself should be enough.
We should decide on a threshold for test coverage and have the build check that.
Create an example for computing and evaluating constraint suggestions for data.
We would like to support seasonality, multivariate metrics and more advanced AnomalyDetectionStrategies.
We should release a new RC for deequ today, as many of the examples depend on Stefan's latest changes which will not be in the RC0 release.
Current example here.
We would like to support referential integrity constraints and analyzers.
The AnalysisRunner is our main entry point for calculating metrics in a scenario where you don't want to write unit tests but want to do something different with the metrics. We need to make the distinction between the AnalysisRunner and the ColumnProfilerRunner clear.
Automatic periodicity detection can be used as input to seasonal anomaly detection methods where users don't know the seasonality.
When running constraint suggestion for CategoricalRangeRule
for columns with cardinality of a few dozens, the suggested constraint is failing by a very small margin (Value: 0.99999907 does not meet the constraint requirement!
).
Do we need a separate markdown file where we explain all core concepts like analyzers, checks, constraints, metrics, column profiles, metric repositories, check levels etc.?
We should also have a list of all analyzers, constraints, anomaly detection methods and constraint suggestion rules we offer in one place.
Maybe even a readthedocs-documentation?
We should give hints on visualization and show some of the advanced options.
Constraints that are automatically suggested must be able to apply to the dataset that they were suggested from.
When using a ConstraintSuggestionRunner
[1] to automatically suggest Check
constraints on a DataFrame
, the suggested .isNonNegative(...)
constraints assume that the column they are applied to is of a numeric type. However, the constraint suggestion runner will produce this constraint on String
-typed columns that contain numeric values. Therefore, this suggested .isNonNegative
check will fail on the data that was used to generate it.
We should have more tests for the column profiling functionality.
Constraints that are automatically suggested must be able to apply to the dataset that they were suggested from.
When using a ConstraintSuggestionRunner
[1] to automatically suggest Check
constraints on a DataFrame
, the suggested .hasDataType(...)
constraints do not include any information about completeness. Thus, if the column that the .hasDataType
constraint was suggested for has any null
values (i.e. is not a complete column), then the suggested constraint will fail on the dataset that it was suggested from.
[1] Specifically, this expression is what is meant by "automatically suggested constraints":
ConstraintSuggestionRunner()
.onData(data)
.addConstraintRules(Rules.DEFAULT)
.run()
Where data
is a org.apache.spark.sql.DataFrame
instance.
Create an example for using our single-column data profiling to profile data.
The constraint suggestion and column profiling API, which is still experimental, offers the option to print status updates after each pass over the data. We should replace this with proper logging and see if we can use that in other places in the library as well.
We should be able to easily publish to maven central.
For columns with continuous values (int, double, ...), we want to suggest appropriate constraints. We need to determine when we want to use static rules (e.g., max/min; this could replace the isPositive rule also) and when to use anomaly detection. May depend on whether there is a trend change in the data distribution.
We can use MiMa to verify that new versions of the library don't break existing clients. This could be hard though because MiMa plays nicely only with SBT.
Hi Deequ team,
This is a pretty useful tool. However, it only supports Spark 2.2. Is there a plan to support Spark 2.3 version?
The isContainedIn
Check
s will fail whenever there exists at least one column with a special character. Here, a special character means something that is reserved in the Spark SQL language: e.g. a [
or ]
. isContainedIn
generates SQL, but the column name is not properly escaped. This means that, at execution time, the generated SQL will fail with a syntax error. As a consequence, this syntax error will cascade through all other Check
s when associated with a VerificationRunBuilder
. Thus, even valid Check
s will fail due to this SQL syntax error.
Currently, anomaly detection strategies can consume a single metric and its historic values. We would like to extend the interface for strategies s.t. they can be based on multiple metrics.
Write a short documentation on how to release.
Show how to use the method reuseExistingResultsForKey and maybe show how this can be used in some example use cases:
Not sure if this is a Zeppelin related issue or not,
But I get the following error, I'm using zeppelin 0.8.1, with spark 2.4, scala version version 2.11.12 running on AWS EMR.
java.lang.NoSuchMethodError: scala.reflect.internal.Symbols$Symbol.originalEnclosingMethod()Lscala/reflect/internal/Symbols$Symbol;
at scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.getEnclosingMethodAttribute(GenASM.scala:1306)
at scala.tools.nsc.backend.jvm.GenASM$JPlainBuilder.genClass(GenASM.scala:1222)
at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.emitFor$1(GenASM.scala:135)
at scala.tools.nsc.backend.jvm.GenASM$AsmPhase.run(GenASM.scala:141)
at scala.tools.nsc.Global$Run.compileUnitsInternal(Global.scala:1625)
at scala.tools.nsc.Global$Run.compileUnits(Global.scala:1610)
at scala.tools.nsc.Global$Run.compileSources(Global.scala:1605)
at scala.tools.nsc.interpreter.IMain.compileSourcesKeepingRun(IMain.scala:388)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.compileAndSaveRun(IMain.scala:804)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.compile(IMain.scala:763)
at scala.tools.nsc.interpreter.IMain.bind(IMain.scala:627)
at scala.tools.nsc.interpreter.IMain.bind(IMain.scala:664)
at scala.tools.nsc.interpreter.IMain$$anonfun$quietBind$1.apply(IMain.scala:663)
at scala.tools.nsc.interpreter.IMain$$anonfun$quietBind$1.apply(IMain.scala:663)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:202)
at scala.tools.nsc.interpreter.IMain.quietBind(IMain.scala:663)
at org.apache.zeppelin.spark.SparkScala211Interpreter$.loopPostInit$1(SparkScala211Interpreter.scala:179)
at org.apache.zeppelin.spark.SparkScala211Interpreter$.org$apache$zeppelin$spark$SparkScala211Interpreter$$loopPostInit(SparkScala211Interpreter.scala:214)
at org.apache.zeppelin.spark.SparkScala211Interpreter.open(SparkScala211Interpreter.scala:86)
at org.apache.zeppelin.spark.NewSparkInterpreter.open(NewSparkInterpreter.java:102)
at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:62)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:617)
at org.apache.zeppelin.scheduler.Job.run(Job.java:188)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:140)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Example:
"column_hasVisit": {
"column": "column_hasVisit",
"completeness": 1,
"approximateNumDistinctValues": 2,
"dataType": {
"enumClass": "com.amazon.deequ.analyzers.DataTypeInstances",
"value": "Boolean"
},
"isDataTypeInferred": false,
"typeCounts": {},
"histogram": null
}
We should have an example that showcases how to use the MetricsRepositories to store, query and re-use computed metrics.
We should configure the RAT plugin for maven and have it check that all .scala files have appropriate license headers.
Hoping to use this in conjunction with Glue to automate testing into our data pipeline. Has anyone done this successfully?
I'm attempting to run the BasicExample.scala code using a SageMaker notebook attached to a Glue Development Endpoint. I'm working in the SparkMagic kernel.
The .isGreaterThan
, .isLessThan
, .isGreaterThanOrEqualTo
, and .isLessThanOrEqualTo
methods on the Check
type will fail with a Spark SQL SyntaxError
at runtime when applied to columns whose names contain special characters or keywords.
The normal Uniqueness analyzer is too expensive for large datasets.
More showcasing for the state support.
We have 4 main entry points in our API at the moment:
Should we list these 4 in our main README in a separate paragraph and write one sentence for each of them to highlight the main use cases?
It is possible of have categorical data with numerical values.
For example 1 indicating male and 2 indicating female.
It would be nice be have histogram analysis available for those columns. Currently, it is limited to only boolean and string columns
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.