Git Product home page Git Product logo

streamdq's Introduction

StreamDQ

StreamDQ GitHub license Build Status codecov

StreamDQ is a library built on top of Apache Flink for defining "unit tests for data", which measure data quality in large data streams.

Run StreamDQ locally

Prerequisite: Java 11, Maven

  1. Clone this repository

  2. Switch to its directory

    cd StreamDQ

  3. Install and run the tests

    mvn install

How to use StreamDQ

import com.stefan_grafberger.streamdq.VerificationSuite
import com.stefan_grafberger.streamdq.anomalydetection.detectors.aggregatedetector.AggregateAnomalyCheck
import com.stefan_grafberger.streamdq.anomalydetection.strategies.DetectionStrategy
import com.stefan_grafberger.streamdq.checks.aggregate.AggregateCheck
import com.stefan_grafberger.streamdq.checks.row.RowLevelCheck
import com.stefan_grafberger.streamdq.VerificationSuite

val env: StreamExecutionEnvironment = StreamExecutionEnvironment.createLocalEnvironment(LOCAL_PARALLELISM)
env.streamTimeCharacteristic = TimeCharacteristic.EventTime

val rawStream = ClickData.readData(env)
val keyedStream = rawStream.keyBy { data -> data.userId }

val verificationResult = VerificationSuite()
    .onDataStream(keyedStream, env.config)
    .addRowLevelCheck(RowLevelCheck()
        .isContainedIn("priority", listOf(Priority.HIGH, Priority.LOW))
        .isInRange("numViews", BigDecimal.valueOf(0), BigDecimal.valueOf(1_000_000)
        .matchesPattern("email", Pattern.compile(EMAIL_REGEX))))
    .addAggregateCheck(AggregateCheck()
        .onWindow(TumblingEventTimeWindows.of(Time.seconds(10)))
        .hasCompletenessBetween("productName", 0.8, 1.0)
        .hasApproxUniquenessBetween("id", 0.9, 1.0)
        .hasApproxQuantileBetween("numViews", 0.5, 0.0, 10.0))
    .addAggregateCheck(AggregateCheck()
        .onContinuousStreamWithTrigger(CountTrigger.of(100))
        .hasApproxCountDistinctBetween("productName", 5_000_000, 10_000_000))
    .addAnomalyCheck(AggregateAnomalyCheck()
        .onCompleteness("productId")
        .withWindow(TumblingEventTimeWindows.of(Time.milliseconds(100)))
        .withStrategy(DetectionStrategy().onlineNormal(0.1, 1.0))
        .build())
    .build()                

License

This library is licensed under the Apache 2.0 License.

streamdq's People

Contributors

stefan-grafberger avatar tong-woo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.