Git Product home page Git Product logo

Comments (8)

apython1998 avatar apython1998 commented on July 30, 2024 3

Cannot wait for these features to be added!

from deequ.

MassyB avatar MassyB commented on July 30, 2024 1

For now you can compute the min/max for any datatype by aggregating the data yourself and using another metric. The HistogramMetric. Example below:

import org.apache.spark.sql.functions.{min, max}
import com.amazon.deequ.analyzers.{Histogram}
// init the data
case class Item(
  id: Long,
  productName: String,
  description: String,
  priority: String,
  numViews: Long
)


val rdd = spark.sparkContext.parallelize(Seq(
  Item(1, "Thingy A", "awesome thing.", "high", 0),
  Item(2, "Thingy B", "available at http://thingb.com", null, 0),
  Item(3, null, null, "low", 5),
  Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10),
  Item(5, "Thingy E", null, "high", 12)))

val data = spark.createDataFrame(rdd)

// compute the min/max directly and we filter nulls
val dataMinMax = (

data.filter($"productName".isNotNull).agg(min($"productName") as "minProductName", max($"productName") as "maxProductName")
)

// we now use a histogram

{ AnalysisRunner
  // data to run the analysis on
  .onData(dataMinMax)                       
  // define analyzers that compute metrics
  .addAnalyzer(Histogram("minProductName"))
  .addAnalyzer(Histogram("maxProductName"))
  .run()
  .allMetrics.foreach(println(_))
}

Ideally Deequ should allow users to add Metrics and Analyzers. The framework should be open for extensions.

from deequ.

klangner avatar klangner commented on July 30, 2024

Could someone a little bit more elaborated what should be done in this task?

from deequ.

sscdotopen avatar sscdotopen commented on July 30, 2024

At the moment, deequ does not support any metrics calculations on timestamp/date columns. The task here would be to integrate those. A problem ist that most of our analyzers produce a DoubleMetric (e.g. for the max), but here we would need to implement a new Metric that operates on dates.

from deequ.

klangner avatar klangner commented on July 30, 2024

Is it ok to change Maximum and Minimum so it will support any type with total order instead of only Double?
I can give it a try then.

from deequ.

sscdotopen avatar sscdotopen commented on July 30, 2024

Lets try to make it support timestamps in addition to what it supports now. In general, we only operate on Spark's supported column types.

from deequ.

Yash0215 avatar Yash0215 commented on July 30, 2024

I have implemented whole new analyzers for timestamp with metrics that supports timestamp values. and added constraints that supports Spark's DateType and TimestampType to cover many use cases. I like to submit a PR if it is required?

from deequ.

sscdotopen avatar sscdotopen commented on July 30, 2024

We would be very happy to receive such a PR!

from deequ.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.