data-unit-test

data-unit-test is a framework that allows user to test their data present in any database. This database can be hive, Redshift, Athena or even MySQL. The users provide a YAML file that will contain all the tests specifications. The framework parses the YAML input file and creates the executable tests. These tests are executed against the database provided by the user and the results are validated against the expected results provided by the user. The framework uses TestNG and each test is run as a JUNIT test that is run via the TestNG framework. The test report is also generated for the users to view the output of their test cases.

Build and Run

The project requires sbt to be installed on the local system. Once sbt is installed, user can build it locally to generate the uber jar file. This jar can then be used to run using the run.sh script provided with the project. Once the project is forked, user can use following command to clone and build.

$ git clone <github_repo>
$ cd <github_repo>
$ sbt clean assembly The last command will generate the uber jar in target folder. The user can then create the YAML file containing all the tests. Sample YAML files can be found in examples directory of the project. For running the test cases see Executing Tests

Generating JavaDocs

The java doc can be generated by sbt. Following command can be used for that.

$ sbt clean doc

Notes

The java docs generated will be in directory /target/api

Supported tests

Currently following test scenarios are supported

Name	Description	Expectation	Example
assert_equals	Tests for the equality of actual and expected data. Note that this does not checks for the ordering of data. In a hive table, there can be millions of rows which are processed in a distributed fashion. Hence the ordering of results cannot be guaranteed. Hence, the equality check disregards the order of results and compares the expected and actual results.	LIST. Expects a list of list. Which means the user can provide the list of rows with comma separated column values. or MAP. Expects a list of map. Which means that one has to provide each row in form of map where the keys are the column names and values are the column values. Since there can be multiple rows, hence user will provide a list of such maps where each map correspond to a row or CSV file name. Expects a csv file with header line containing the column names and the body contains the row values. Each line of the file corresponds to a row.	LIST: - [dt=20170202/dh=00] - [dt=20170202/dh=01] - [dt=20170202/dh=02] MAP - id: 1 name: name1 - id: 2 name: name2 - id: 3 name: name3 - id: 4 name: name4 - id: 5 name: name5 CSV src/test/resources/expected_equals.csv
assert_includes	Tests for the inclusion of the expected results in the actual resultset obtained by querying the data. This does not checks for the ordering of data for the same reason as above.	LIST. Expects a list of list. Which means the user can provide the list of rows with comma separated column values. or MAP. Expects a list of map. Which means that one has to provide each row in form of map where the keys are the column names and values are the column values. Since there can be multiple rows, hence user will provide a list of such maps where each map correspond to a row or CSV file name. Expects a csv file with header line containing the column names and the body contains the row values. Each line of the file corresponds to a row.	LIST: - [dt=20170202/dh=00] - [dt=20170202/dh=01] - [dt=20170202/dh=02] MAP - id: 1 name: name1 - id: 2 name: name2 - id: 3 name: name3 - id: 4 name: name4 - id: 5 name: name5 CSV src/test/resources/expected_equals.csv
assert_excludes	Tests for the exclusion of the expected results in the actual resultset obtained by querying the data. This does not checks for the ordering of data for the same reason as above.	LIST. Expects a list of list. Which means the user can provide the list of rows with comma separated column values. or MAP. Expects a list of map. Which means that one has to provide each row in form of map where the keys are the column names and values are the column values. Since there can be multiple rows, hence user will provide a list of such maps where each map correspond to a row or CSV file name. Expects a csv file with header line containing the column names and the body contains the row values. Each line of the file corresponds to a row.	LIST: - [dt=20170202/dh=00] - [dt=20170202/dh=01] - [dt=20170202/dh=02] MAP - id: 1 name: name1 - id: 2 name: name2 - id: 3 name: name3 - id: 4 name: name4 - id: 5 name: name5 CSV src/test/resources/expected_equals.csv
assert_ordered_equals	Most restrictive of all tests. This performs the exact match of the expected and actual results. Ordering is also checked in this test.	LIST. Expects a list of list. Which means the user can provide the list of rows with comma separated column values. or MAP. Expects a list of map. Which means that one has to provide each row in form of map where the keys are the column names and values are the column values. Since there can be multiple rows, hence user will provide a list of such maps where each map correspond to a row or CSV file name. Expects a csv file with header line containing the column names and the body contains the row values. Each line of the file corresponds to a row.	LIST: - [dt=20170202/dh=00] - [dt=20170202/dh=01] - [dt=20170202/dh=02] MAP - id: 1 name: name1 - id: 2 name: name2 - id: 3 name: name3 - id: 4 name: name4 - id: 5 name: name5 CSV src/test/resources/expected_equals.csv
assert_fails	This checks for the exceptions. Some time we expect an exception to be thrown while executing the query. Unfortunately in hive, the exceptions are mostly wrapped and many times, the only way to pinpoint an issue is to look at the exception message. Hence this test expects a string message and checks for the existence of that message in the actual error stack trace.	STRING message	"org.apache.spark.sql.AnalysisException" "org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:No privilege 'Select' found for inputs { database:usage})"

Return types

Name	Description	corresponding java type	Example
LIST	this means that the expected value provided by the user is in form of a list of list	ArrayList<ArrayList<Object>>	- [dt=20170202/dh=00] - [dt=20170202/dh=01] - [dt=20170202/dh=02]
MAP	This means that the expected value provided by the user is in form of a list of map	ArrayList<Map<String,String>>	- id: 1 name: name1 - id: 2 name: name2 - id: 3 name: name3 - id: 4 name: name4 - id: 5 name: name5
CSV	The expected value provided by user is the name of a local csv file. Should be enclosed in double quotes if there is a space in the file name so as to avoid default string splitting based on space performed by java.	String	src/test/resources/expected_equals.csv
Exception	The expected value provided by user is some exception message. Should be enclosed in double quotes so as to avoid default string splitting based on space performed by java.	String	"org.apache.spark.sql.AnalysisException" "org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:No privilege 'Select' found for inputs { database:usage})"

YAML Specifications

Name	Description	Required	Example
name	Unique name of the test case	true	test assert permission exception
setup	A list of setup queries to be run before running the actual test case.	false	- use temp - drop table if exists test_list
test	A single query that needs to be run and whose result will be tested against expected data provided by user.	true	select id, name from table test_list
assert_type	One of the assert types mentioned above	true
teardown	A list of teardown queries to be run after the test has been performed	false	- drop table test_list
ignore_failures	boolean parameter indicating whether the tests should be marked as success or failure in case there is an error while executing setup or teardown queries. In case this has been set to true and there is an error while executing setup or teardown queries, the test will be failed.	false	false/true
return_type	One of the above mentioned return types	true
assert_expectations	The expected value that will be tested against the actual results obtained by running the test query. Should be in accordance with the return types. In case return type is a LIST, a list of list must be provided. In case return type is MAP, a list of maps should be provided. In case of return type being CSV, a local csv file name should be provided. In case of EXCEPTION, the expected exception message is required.	true	Mentioned above in return types and assert types.

Executing tests

The framework packages all the necessary classes and their dependencies into an uber jar. The packaging is done using SBT. In order to run the test cases, user must provide the tests in form of YAML file while adhering to the specs mentioned above. Following parameters are supported while running

Name	Description	Required	Example
MAIN_JAR_LOCATION	main jar location containing the application jar	true	lib/validation-framework-assembly-1.0.jar
EXTRA_CLASSPATH	location of extra jars containing the external driver class	true in case using any external driver, false otherwise	lib/hive-jdbc-1.2.0.jar
YML_FILE	the test yaml file containing all the tests that are required to be run	true	tests.yml
DB_URL	the jdbc endpoint to which the framework should connect to in order to execute the tests. Note that this must be enclosed in double quotes since it may contain lots of special characters that may trigger java's parsing	true	"jdbc:hive2://:/;"
USER_NAME	the user name to be used while connecting to the jdbc end point	true
USER_PASSWORD	the password to be used while connecting to the database	true
LOGGER	The log4j.properties file to be used by the framework. By default the framework includes a console logger that will print all INFO and above level of logs to console. In case user wants to provide a separate logger (may be to dump the contents in a log file or to change the logger level), this option can be utilized	false	log4j.properties
ACTIVE_CONN_COUNT	The number of active connections to maintain in the connection pool. Defaults to 5	false	Some number
CONN_IDLE_TIMEOUT	Time that the framework waits for before removing the abandoned connection. Defaults to 1	false	Some number
JDBC_DRIVER	External third party JDBC driver class name	true in case user wants to use some other third party jdbc driver	driver.class.name

A utility in form of shell script is also provided to run the framework and can be downloaded from github (https://git.autodesk.com/cloudplatform-bigdata/DataUnitTest/blob/master/run.sh)

Following command can be used to run this

./run.sh [--MAIN_JAR_LOCATION | -M <main jar location containing the application jar>] [--EXTRA_CLASSPATH | -E <location of extra jars containing the external driver class>] [--YML_FILE | -Y <test yml file>] [--DB_URL | -D <jdbc end point>] [--USER_NAME | -U <user name>] [--USER_PASSWORD | -P <password>] [--JDBC_DRIVER | -J <JDBC driver class name>] [--LOGGER | -L <log4j properties file>] [--ACTIVE_CONN_COUNT | -A <active connections count> ] [--CONN_IDLE_TIMEOUT | -T <connection abandon timeout>]

Sample Tests

Sample test yaml files can be found in the git repo. https://git.autodesk.com/cloudplatform-bigdata/DataUnitTest. under src/test/resources

Test results

The test results are generated in directory test-output. The tests.html file under directory validation-tests gives an html format output while the testng-results.xml gives the xml output.

Contributing

The data-unit-test project is meant to evolve with feedback - the project and its users greatly appreciate any thoughts on ways to improve the design or features. Read below to see how you can take part and contribute:

Contributing Guide

Read our guide to learn about the development process and how to work with the core team.

License

data-unit-test is Apache-2.0 licensed

Future Enhancements

The current scope of the test framework is rather limited. Future enhancements can include following

Templatization of common tests: Common tests like count, max, min, can be templatized. These templates can have placeholders for table name, column name, etc. and can be referred by the actual test yaml file. The framework should be made capable of replacing these placeholders with the values provided in the test yaml file and execute the corresponding queries. This feature will enable users to just provide the common parameters for the templates and reduce the scope of errors.
DB Lookup: Currently, the users have to provide a static expectation data either in form of List/Map or csv file. Feature can be built where user can provide a JDBC connection and corresponding query to be executed against that jdbc endpoint. The framework should be made capable of querying the jdbc end point to fetch the expected result and convert it into appropriate format for matching with the actual results.

isabella232 / data-unit-test Goto Github PK

data-unit-test's Introduction