Comments (8)
Cannot wait for these features to be added!
from deequ.
For now you can compute the min/max for any datatype by aggregating the data yourself and using another metric. The HistogramMetric
. Example below:
import org.apache.spark.sql.functions.{min, max}
import com.amazon.deequ.analyzers.{Histogram}
// init the data
case class Item(
id: Long,
productName: String,
description: String,
priority: String,
numViews: Long
)
val rdd = spark.sparkContext.parallelize(Seq(
Item(1, "Thingy A", "awesome thing.", "high", 0),
Item(2, "Thingy B", "available at http://thingb.com", null, 0),
Item(3, null, null, "low", 5),
Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10),
Item(5, "Thingy E", null, "high", 12)))
val data = spark.createDataFrame(rdd)
// compute the min/max directly and we filter nulls
val dataMinMax = (
data.filter($"productName".isNotNull).agg(min($"productName") as "minProductName", max($"productName") as "maxProductName")
)
// we now use a histogram
{ AnalysisRunner
// data to run the analysis on
.onData(dataMinMax)
// define analyzers that compute metrics
.addAnalyzer(Histogram("minProductName"))
.addAnalyzer(Histogram("maxProductName"))
.run()
.allMetrics.foreach(println(_))
}
Ideally Deequ should allow users to add Metrics
and Analyzers
. The framework should be open for extensions.
from deequ.
Could someone a little bit more elaborated what should be done in this task?
from deequ.
At the moment, deequ does not support any metrics calculations on timestamp/date columns. The task here would be to integrate those. A problem ist that most of our analyzers produce a DoubleMetric (e.g. for the max), but here we would need to implement a new Metric that operates on dates.
from deequ.
Is it ok to change Maximum and Minimum so it will support any type with total order instead of only Double?
I can give it a try then.
from deequ.
Lets try to make it support timestamps in addition to what it supports now. In general, we only operate on Spark's supported column types.
from deequ.
I have implemented whole new analyzers for timestamp with metrics that supports timestamp values. and added constraints that supports Spark's DateType and TimestampType to cover many use cases. I like to submit a PR if it is required?
from deequ.
We would be very happy to receive such a PR!
from deequ.
Related Issues (20)
- Compliance calculation result HOT 1
- numerical statistical indicators have lost precision
- [FEATURE] Supporing Aggregation metrics for a group
- [FEATURE] Filter condition is ignored when filtering records based on row-level checks HOT 5
- Anomaly checks when fails
- containsCreditCardNumber analyser constraint doesnt support for JCB credit card
- Performance impact when trying to generate profiling report for more than 200 columns HOT 2
- Is AggregateMatch type check supported in the library? HOT 1
- [FEATURE] Cross-building via Mill HOT 5
- How to use Deequ to implement a custom return result set and return the correct and incorrect number of each check result
- Java null pointer issue , while creating sparksession , with deequ jar
- [BUG] Spark 3.4 and Deequ breeze version conflict HOT 1
- [FEATURE] Can we enhance `VerificationSuite` to supports more than one Dataframe?
- Custom user analyzers
- Support for Custom SQL Execution in Deequ Library
- Question: DQ over time
- [FEATURE] Extend RatioOfSums to support other aggregations
- [FEATURE] Support Wilson Score Interval for RetainCompletenessRule
- [BUG] Row-level filtering marking the records as pass when null values are present in the column
- Why is `Distance` not an analyzer?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deequ.