Git Product home page Git Product logo

data-unit-test's Introduction

data-unit-test

data-unit-test is a framework that allows user to test their data present in any database. This database can be hive, Redshift, Athena or even MySQL. The users provide a YAML file that will contain all the tests specifications. The framework parses the YAML input file and creates the executable tests. These tests are executed against the database provided by the user and the results are validated against the expected results provided by the user. The framework uses TestNG and each test is run as a JUNIT test that is run via the TestNG framework. The test report is also generated for the users to view the output of their test cases.

Build and Run

The project requires sbt to be installed on the local system. Once sbt is installed, user can build it locally to generate the uber jar file. This jar can then be used to run using the run.sh script provided with the project. Once the project is forked, user can use following command to clone and build.

  • $ git clone <github_repo>
  • $ cd <github_repo>
  • $ sbt clean assembly The last command will generate the uber jar in target folder. The user can then create the YAML file containing all the tests. Sample YAML files can be found in examples directory of the project. For running the test cases see Executing Tests

Generating JavaDocs

The java doc can be generated by sbt. Following command can be used for that.

  • $ sbt clean doc

Notes

The java docs generated will be in directory /target/api

Supported tests

Currently following test scenarios are supported

Name Description Expectation Example
assert_equals Tests for the equality of actual and expected data. Note that this does not checks for the ordering of data. In a hive table, there can be millions of rows which are processed in a distributed fashion. Hence the ordering of results cannot be guaranteed. Hence, the equality check disregards the order of results and compares the expected and actual results.

LIST. Expects a list of list. Which means the user can provide the list of rows with comma separated column values.

or

MAP. Expects a list of map. Which means that one has to provide each row in form of map where the keys are the column names and values are the column values. Since there can be multiple rows, hence user will provide a list of such maps where each map correspond to a row

or

CSV file name. Expects a csv file with header line containing the column names and the body contains the row values. Each line of the file corresponds to a row.

LIST:

- [dt=20170202/dh=00]
- [dt=20170202/dh=01]
- [dt=20170202/dh=02]

 

MAP

-
id: 1
name: name1
-
id: 2
name: name2
-
id: 3
name: name3
-
id: 4
name: name4
-
id: 5
name: name5

 

CSV

src/test/resources/expected_equals.csv

 assert_includes  Tests for the inclusion of the expected results in the actual resultset obtained by querying the data. This does not checks for the ordering of data for the same reason as above.  LIST. Expects a list of list. Which means the user can provide the list of rows with comma separated column values.

or

MAP. Expects a list of map. Which means that one has to provide each row in form of map where the keys are the column names and values are the column values. Since there can be multiple rows, hence user will provide a list of such maps where each map correspond to a row

or

CSV file name. Expects a csv file with header line containing the column names and the body contains the row values. Each line of the file corresponds to a row.

 

LIST:

- [dt=20170202/dh=00]
- [dt=20170202/dh=01]
- [dt=20170202/dh=02]

 

MAP

-
id: 1
name: name1
-
id: 2
name: name2
-
id: 3
name: name3
-
id: 4
name: name4
-
id: 5
name: name5

 

CSV

src/test/resources/expected_equals.csv

 assert_excludes Tests for the exclusion of the expected results in the actual resultset obtained by querying the data. This does not checks for the ordering of data for the same reason as above.  LIST. Expects a list of list. Which means the user can provide the list of rows with comma separated column values.

or

MAP. Expects a list of map. Which means that one has to provide each row in form of map where the keys are the column names and values are the column values. Since there can be multiple rows, hence user will provide a list of such maps where each map correspond to a row

or

CSV file name. Expects a csv file with header line containing the column names and the body contains the row values. Each line of the file corresponds to a row.

 

LIST:

- [dt=20170202/dh=00]
- [dt=20170202/dh=01]
- [dt=20170202/dh=02]

 

MAP

-
id: 1
name: name1
-
id: 2
name: name2
-
id: 3
name: name3
-
id: 4
name: name4
-
id: 5
name: name5

 

CSV

src/test/resources/expected_equals.csv

 assert_ordered_equals Most restrictive of all tests. This performs the exact match of the expected and actual results. Ordering is also checked in this test.  LIST. Expects a list of list. Which means the user can provide the list of rows with comma separated column values.

or

MAP. Expects a list of map. Which means that one has to provide each row in form of map where the keys are the column names and values are the column values. Since there can be multiple rows, hence user will provide a list of such maps where each map correspond to a row

or

CSV file name. Expects a csv file with header line containing the column names and the body contains the row values. Each line of the file corresponds to a row.

 

LIST:

- [dt=20170202/dh=00]
- [dt=20170202/dh=01]
- [dt=20170202/dh=02]

 

MAP

-
id: 1
name: name1
-
id: 2
name: name2
-
id: 3
name: name3
-
id: 4
name: name4
-
id: 5
name: name5

 

CSV

src/test/resources/expected_equals.csv

assert_fails This checks for the exceptions. Some time we expect an exception to be thrown while executing the query. Unfortunately in hive, the exceptions are mostly wrapped and many times, the only way to pinpoint an issue is to look at the exception message. Hence this test expects a string message and checks for the existence of that message in the actual error stack trace. STRING message

"org.apache.spark.sql.AnalysisException"

"org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:No privilege 'Select' found for inputs { database:usage})"

Return types

Name Description corresponding java type Example
LIST this means that the expected value provided by the user is in form of a list of list ArrayList<ArrayList<Object>> - [dt=20170202/dh=00]
- [dt=20170202/dh=01]
- [dt=20170202/dh=02]
MAP This means that the expected value provided by the user is in form of a list of map ArrayList<Map<String,String>> -
id: 1
name: name1
-
id: 2
name: name2
-
id: 3
name: name3
-
id: 4
name: name4
-
id: 5
name: name5
CSV The expected value provided by user is the name of a local csv file. Should be enclosed in double quotes if there is a space in the file name so as to avoid default string splitting based on space performed by java. String src/test/resources/expected_equals.csv
Exception The expected value provided by user is some exception message. Should be enclosed in double quotes so as to avoid default string splitting based on space performed by java. String

"org.apache.spark.sql.AnalysisException"

"org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:No privilege 'Select' found for inputs { database:usage})"

YAML Specifications

Name Description Required Example
name Unique name of the test case true test assert permission exception
setup A list of setup queries to be run before running the actual test case. false

- use temp
- drop table if exists test_list

test A single query that needs to be run and whose result will be tested against expected data provided by user. true select id, name from table test_list
assert_type One of the assert types mentioned above true  
teardown A list of teardown queries to be run after the test has been performed false - drop table test_list
ignore_failures boolean parameter indicating whether the tests should be marked as success or failure in case there is an error while executing setup or teardown queries. In case this has been set to true and there is an error while executing setup or teardown queries, the test will be failed. false false/true
return_type One of the above mentioned return types true  
assert_expectations The expected value that will be tested against the actual results obtained by running the test query. Should be in accordance with the return types. In case return type is a LIST, a list of list must be provided. In case return type is MAP, a list of maps should be provided. In case of return type being CSV, a local csv file name should be provided. In case of EXCEPTION, the expected exception message is required. true Mentioned above in return types and assert types.

Executing tests

The framework packages all the necessary classes and their dependencies into an uber jar. The packaging is done using SBT. In order to run the test cases, user must provide the tests in form of YAML file while adhering to the specs mentioned above. Following parameters are supported while running

Name Description Required Example
MAIN_JAR_LOCATION main jar location containing the application jar true lib/validation-framework-assembly-1.0.jar
EXTRA_CLASSPATH location of extra jars containing the external driver class true in case using any external driver, false otherwise lib/hive-jdbc-1.2.0.jar
YML_FILE the test yaml file containing all the tests that are required to be run true tests.yml
DB_URL the jdbc endpoint to which the framework should connect to in order to execute the tests. Note that this must be enclosed in double quotes since it may contain lots of special characters that may trigger java's parsing true "jdbc:hive2://:/;"
USER_NAME the user name to be used while connecting to the jdbc end point true
USER_PASSWORD the password to be used while connecting to the database true
LOGGER The log4j.properties file to be used by the framework. By default the framework includes a console logger that will print all INFO and above level of logs to console. In case user wants to provide a separate logger (may be to dump the contents in a log file or to change the logger level), this option can be utilized false log4j.properties
ACTIVE_CONN_COUNT The number of active connections to maintain in the connection pool. Defaults to 5 false Some number
CONN_IDLE_TIMEOUT Time that the framework waits for before removing the abandoned connection. Defaults to 1 false Some number
JDBC_DRIVER External third party JDBC driver class name true in case user wants to use some other third party jdbc driver driver.class.name

A utility in form of shell script is also provided to run the framework and can be downloaded from github (https://git.autodesk.com/cloudplatform-bigdata/DataUnitTest/blob/master/run.sh)

Following command can be used to run this

./run.sh [--MAIN_JAR_LOCATION | -M <main jar location containing the application jar>] [--EXTRA_CLASSPATH | -E <location of extra jars containing the external driver class>] [--YML_FILE | -Y <test yml file>] [--DB_URL | -D <jdbc end point>] [--USER_NAME | -U <user name>] [--USER_PASSWORD | -P <password>] [--JDBC_DRIVER | -J <JDBC driver class name>] [--LOGGER | -L <log4j properties file>] [--ACTIVE_CONN_COUNT | -A <active connections count> ] [--CONN_IDLE_TIMEOUT | -T <connection abandon timeout>]

Sample Tests

Sample test yaml files can be found in the git repo. https://git.autodesk.com/cloudplatform-bigdata/DataUnitTest. under src/test/resources

Test results

The test results are generated in directory test-output. The tests.html file under directory validation-tests gives an html format output while the testng-results.xml gives the xml output.

Contributing

The data-unit-test project is meant to evolve with feedback - the project and its users greatly appreciate any thoughts on ways to improve the design or features. Read below to see how you can take part and contribute:

Contributing Guide

Read our guide to learn about the development process and how to work with the core team.

License

data-unit-test is Apache-2.0 licensed

Future Enhancements

The current scope of the test framework is rather limited. Future enhancements can include following

  • Templatization of common tests: Common tests like count, max, min, can be templatized. These templates can have placeholders for table name, column name, etc. and can be referred by the actual test yaml file. The framework should be made capable of replacing these placeholders with the values provided in the test yaml file and execute the corresponding queries. This feature will enable users to just provide the common parameters for the templates and reduce the scope of errors.
  • DB Lookup: Currently, the users have to provide a static expectation data either in form of List/Map or csv file. Feature can be built where user can provide a JDBC connection and corresponding query to be executed against that jdbc endpoint. The framework should be made capable of querying the jdbc end point to fetch the expected result and convert it into appropriate format for matching with the actual results.

data-unit-test's People

Contributors

ashwini-kr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.