sodadata / soda-sql Goto Github PK

Soda SQL and Soda Spark have been deprecated and replaced by Soda Core. docs.soda.io/soda-core/overview.html

License: Apache License 2.0

Python 99.47% Dockerfile 0.08% HTML 0.04% Shell 0.42%

soda-sql's Introduction

Soda SQL and Soda Spark have become Soda Core

Soda SQL and Soda Spark are deprecated in favor of Soda Core and the Soda Checks Language. If you are new to Soda, start with Soda Core!

Data testing, monitoring, and profiling for SQL-accessible data.

✔ Install Soda SQL from the command-line
✔ Access comprehensive Soda SQL documentation
✔ Compatible with Snowflake, Amazon Redshift, BigQuery, and more
✔ Write tests in a YAML file
✔ Run programmatic scans to test data quality

Example scan YAML file (Deprecated)

table_name: breakdowns
metrics:
  - row_count
  - missing_count
  - missing_percentage
...
# Validates that a table has rows
tests:
  - row_count > 0

# Tests that numbers in the column are entered in a valid format as whole numbers
columns:
  incident_number:
    valid_format: number_whole
    tests:
      - invalid_percentage == 0

# Tests that no values in the column are missing
  school_year:
    tests:
      - missing_count == 0

# Tests for duplicates in a column
  bus_no:
    tests:
      - duplicate_count == 0

# Compares row count between datasets
sql_metric: 
  sql: |
    SELECT COUNT(*) as other_row_count
    FROM other_table
  tests:
    - row_count == other_row_count

Install (Deprecated)

Contributors ✨

Thanks goes to these wonderful people! (emoji key)

_{Vijay Kiran} 💻	_{abhishek khare} 💻	_{Jelte Hoekstra} 💻 📖	_Cor 💻 📖	_{Milan Aleksić} 🚇	_{Ayoub Fakir} 💻	_{Alex Tonkonozhenko} 💻
_{Todd de Quincey} 💻	_{Antonin Jousson} 💻	_Jonas 🚇	_cwouter 💻	_{Janet R} 📖	_{Bastien Boutonnet} 💻	_{Tom Baeyens} 💻
_{AlessandroLollo} 💻	_mmigdiso 💻	_ericmuijs 💻	_{Lieven Govaerts} 💻	_{Milan Lukac} 💻	_{Sebastián Villarroel} 💻	_{Benjamin Berriot} 💻
_{Alexey Minakov} 💻

This project followed the all-contributors specification.

Open Telemetry Tracking (Deprecated)

Soda SQL collects statistical usage and performance information via the Open Telemetry framework. This information helps the Soda Core developer team proactively track performance issues and understand how users interact with the tool.

The collected information is strictly limited to usage and performance and does not contain Personal Identifying Information (PII). It is used for internal purposes only. Soda keeps the data in its raw form for a maximum of five years. If some information needs to be kept for longer, it will only be kept in aggregated form.

Read more about the information Soda tracks, and learn how to opt-out of tracking by consulting the Anonymous usage statistics documentation.

soda-sql's People

Contributors

Stargazers

Watchers

Forkers

konradburnik vijaykiran hnawaz007 alexanderlz vnnw abhishek-khare andresflunac miserani hgbink kulykdmytro ucg8j fetch-rewards techieteee alexbleriot nwilliams71

soda-sql's Issues

[bug] Weird test execution result when using 'named tests'

Describe the bug
Tests execution result reports None cause None=None when using named tests. This only seems to happen when using 'named tests' like:

tests:
  dataset_size: row_count == 0

  | 62 measurements computed
  | 1 tests executed
  | 1 of 1 tests failed:
  |   None cause None=None

When providing a list of 'anonymous' tests it correctly reports which tests failed.

  | 62 measurements computed
  | 1 tests executed
  | 1 of 1 tests failed:
  |   table(ORDERS) expression(row_count == 0) cause row_count=1500000

To Reproduce
Steps to reproduce the behavior:

Create a new named test in scan.yml
Run soda scan ...

Context
scan.yml

table_name: ORDERS
metrics:
  - row_count
tests:
  dataset_size: row_count == 0

OS: Linux 5.10
Python Version: 3.8
Soda SQL Version: 2.0.0b6
Warehouse Type: Snowflake

code-cleanup: empty init.py files are not needed

There is no need for adding __init__.py files (since Python 3.3), they can be removed

Error in soda-sql:tests doc

Hi team,

There seems to be an error in the soda-sql:tests doc: https://docs.soda.io/soda-sql/documentation/tests.html. More specifically, the sample snippet in the column test seems to be wrong.

In the scan.yml the tests and validity_format don't need a "-":

Expected working snippet:

table_name: yourtable
metrics:
- row_count
- missing_percentage
- invalid_percentage
columns:
start_date:
validity_format: date_eu
tests:
- invalid_percentage < 2.0

Thanks :)

Missing some default Metric types

Here are metrics I didn't find in the list of default data metrics being pushed from soda-sql which were pushed previously:

NonUniqueValuesCount
Kurtosis
Median
Percentile05
Percentile95
Skewness

Please note that the metric NonUniqueValuesCount is apparently not used by the frontend, so if we don't need it any more I can just remove it from the list. The others are still supported & expected

Investigate anomaly detection with SQL

See if we can apply this: https://hakibenita.com/sql-anomaly-detection

Skip custom sql_metrics with predefined metric names

Since we are going to send measurements from the soda-sql, we need to be careful if metricName is identical to some of our own default metrics (for example, but not limited to, rowCount/row_count):

sql: |
    SELECT coun(*) as row_count
    FROM CUSTOMER_TRANSACTIONS
    WHERE country = 'US'
tests:
    total_volume_greater_than: row_count > 5000

I think soda SQL should forbid these cases by just throwing an exception in the runtime. Backend can't actually know is the row_count metric that comes in by mistake unintended override of the default metric.

We can later on introduce namespaces which will keep metrics in separate namespaces and allow customers to use even conflicting SQL metric names

creating venv using included script prints errors

Running scripts/recreate_venv.sh shows errors because pip-compile is not in the dev-requirements.in

soda-sql on  main [!] via 🐍 v3.8.5 took 1m4s
❯ scripts/recreate_venv.sh
Collecting pip
  Using cached pip-20.3.3-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.1.1
    Uninstalling pip-20.1.1:
      Successfully uninstalled pip-20.1.1
Successfully installed pip-20.3.3
ERROR: Invalid requirement: ''
scripts/recreate_venv.sh: line 14: pip-compile: command not found
scripts/recreate_venv.sh: line 15: pip-compile: command not found

This is caused by line scripts/recreate_vnev#13

Column names are always converted to uppercase

Describe the bug
I've setup project which includes a custom SQL metric but I noticed the columns/variables are always converted or expected to be uppercase.

Take for example the following SQL Metric:

sql: |
  SELECT
    COUNT(CASE WHEN O_ORDERSTATUS = 'P' THEN 1 END) as order_p,
    COUNT(CASE WHEN O_ORDERSTATUS = 'O' THEN 1 END) as order_o
  FROM ORDERS
tests:
  open_orders: order_o == 0
  processing_orders: order_p > 0

When running soda scan it produces the following output:

(...)
  | Executing SQL query: 
SELECT
  COUNT(CASE WHEN O_ORDERSTATUS = 'P' THEN 1 END) as order_p,
  COUNT(CASE WHEN O_ORDERSTATUS = 'O' THEN 1 END) as order_o
FROM ORDERS

  | SQL took 0:00:00.083167
  | SQL metric 0 ORDER_P -> 38543
  | Query measurement: ORDER_P = 38543
  | SQL metric 0 ORDER_O -> 732044
  | Query measurement: ORDER_O = 732044
  | Test table(ORDERS) expression(row_count == 0) failed
(...)
  | stddev(O_ORDERKEY) = 1732051.3849205659
  | ORDER_P = 38543
  | ORDER_O = 732044
  | 28 measurements computed
  | 3 tests executed
  | 3 of 3 tests failed:
  |   table(ORDERS) sql_metric(0) expression(order_o == 0) cause order_o=None
  |   table(ORDERS) sql_metric(0) expression(order_p > 0) cause order_p=None
  |   table(ORDERS) expression(row_count == 0) cause row_count=1500000

As you can see order_p and order_o have been converted to uppercase metrics, which causes my tests to fail.

To Reproduce
Steps to reproduce the behavior:

Create a new SQL Meric and use lowercase metric/column names
Write some tests which utilize those, lowercase, metrics
Run soda scan
Expect the scan to fail because the metric values are None.

Context

table_name: ORDERS
metrics:
  - row_count
  - sum
columns:
  O_ORDERSTATUS:
    valid_values:
      - O
      - F
sql_metrics:
  - sql: |
      SELECT
        COUNT(CASE WHEN O_ORDERSTATUS = 'P' THEN 1 END) as order_p,
        COUNT(CASE WHEN O_ORDERSTATUS = 'O' THEN 1 END) as order_o
      FROM ORDERS
    tests:
      open_orders: order_o == 0
      processing_orders: order_p > 0
tests:
  dataset_size: row_count == 0

OS: Linux
Python Version: 3.8
Soda SQL Version: 2.0.0b8
Warehouse Type: Snowflake

[bug] when analytical type is unknown, we should have all metrics set to null

Describe the bug

Although it is not expected, we are evaluating the metrics for unknown colum type

tableBuilder.createColumn("bytea_all_nulls_", "BYTEA")
      .values(
        "null",
        "null",
        "null",
        "null",
        "null",
        "null",
        "null",
        "null")
      .check(column -> {
        ColumnProfile profile = column.getProfile();
        assertThat(profile.getSourceType(), is("bytea"));

        // When analyticalType == unknown, then all profiling except the types above are null

        assertThat(profile.getMissingValuesCount(), nullValue());
        assertThat(profile.getMissingValuesPercentage(), nullValue());
        assertThat(profile.getValidValuesCount(), nullValue());
        assertThat(profile.getInvalidValuesCount(), nullValue());
        assertThat(profile.getInvalidValuesPercentage(), nullValue());
        assertThat(profile.getDistinctValuesCount(), nullValue());
        assertThat(profile.getUniqueValuesCount(), nullValue());

        assertThat(profile.getMin(), nullValue());
        assertThat(profile.getMins(), nullValue());
        assertThat(profile.getMax(), nullValue());
        assertThat(profile.getMaxs(), nullValue());

        assertThat(profile.getHistogram(), nullValue());
      });

Warehouse Type: PostgreSQL

Support Snowflake

Support AWS Athena

Support Spark SQL

Typo in soda scan help text

When soda scan --help is run

...  Non zero exist code means tests  ...

should be

...  Non zero exit code means tests  ...

[bug] date column has invalid value which is not detected

Describe the bug

if a field has invalid value, but otherwise other values allow analytical type deduction, we should detect obviously invalid values

tableBuilder.createColumn("char_eu_date_", "CHAR(10)")
      .values(
        "'16/04/2020'",
        "null",
        "'16/04/2020'",
        "'16/04/2020'",
        "'16/04/2020'",
        "'16/04/2020'",
        "'invalid value'",
        "'16/04/2020'")
      .check(column -> {
        ColumnProfile profile = column.getProfile();
        assertThat(profile.getSourceType(), is("character"));
        assertThat(profile.getInvalidValuesCount(), is(1l)); // currently we get 0
        assertThat(profile.getInvalidValuesPercentage(), is(12.5f)); // currently we get 0
        assertThat(profile.getDistinctValuesCount(), is(1l)); // currently we get 2
      });

Warehouse Type: PostgreSQL

5min tutorial links returns Access Denied Error

The 5 min tutorial linked mentioned in the README file returns 403 Access Denied error.
https://docs.soda.io/soda-sql/getting-started/5_min_tutorial

Document scan.yml

Currently only the properties are listed in docs/scan.md
The goal is that this page becomes the reference doc for scan yaml files.
Add some more context describing the properties.

[bug] SnowflakeConnection has no attribute closed

Describe the bug
After running soda init on a Snowflake Warehouse the process and with an exception:

Traceback (most recent call last):
[...]
  File "/home/dirk/work/soda/code/soda-sql/sodasql/cli/cli.py", line 201, in init
    if warehouse and warehouse.connection and not warehouse.connection.closed:
AttributeError: 'SnowflakeConnection' object has no attribute 'closed'

To Reproduce
Steps to reproduce the behavior:

Setup a project which connects to Snowflake
Run soda init

Expected behavior
To close without throwing an exception

Context
OS: Arch Linux (5.10)
Python Version: 3.8
Soda SQL Version: 2.0.0b6

Suggestion: Setup dependabot alerts/actions

Dependabot can check the repository for outdated dependencies and check for security vulnerabilities

See the links below for more info:
https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates
https://docs.github.com/en/github/administering-a-repository/keeping-your-dependencies-updated-automatically
https://github.com/marketplace/dependabot-preview

Support AWS Redshift

Airflow integration

Probably only documentation & templates.
As a data engineer, I want to know how to configure a Soda SQL scan as a step in my Airflow job.

Move all test datawarehouse connection vars to env vars

The test configuration files have hostname/database name etc. hard coded. https://github.com/sodadata/soda-sql/blob/main/tests/warehouses/redshift_cfg.yml#L2

I suggest moving all the required connection details to env_vars

Implement testing of all datasources on CLI level

Current state

We have various kinds and levels of testing using in some cases even real warehouse types we support.

On the "acceptance level" of the soda CLI command we are testing only Postgres database (unit test source file: https://github.com/sodadata/soda-sql/blob/main/tests/local/warehouse/cli/test_cli.py). This test connects to PostgreSQL database, inserts some data (the same SQL data that tutorial uses: https://github.com/sodadata/soda-sql/blob/main/tests/demo/demodata.sql).

Desired state

We should be able to generate test data using Dialect support for each warehouse we currently use. We should commit this test data for usage in tutorials and in tests, but we should be able to re-run that script later on if we want to add more data types, different data etc.

We should add tests like above-mentioned PostgreSQL test which would run against a particular warehouse, inserting the SQL data generated by the script and verifying metric calculation works as expected.

We should be able to add tutorial pages in our documentation that refer to the demo data we have created, making sure the steps are identical as in our test.

Support for Azure Synapse Analytics

https://docs.microsoft.com/en-us/azure/synapse-analytics/

[bug] Version 2.0.0b7 is broken, sodasql.cli module can't be found

Describe the bug
Installed the latest version of soda-sql, but I'm no longer able to run any of the commands because of an import error:

(.venv) [dirk@meuk]$ soda --help
Traceback (most recent call last):
  File "/home/dirk/work/soda/code/meuk/.venv/bin/soda", line 5, in <module>
    from sodasql.cli.cli import main
ModuleNotFoundError: No module named 'sodasql.cli'

To Reproduce
Steps to reproduce the behavior:

Install latest (2.0.0b7) version of soda-sql
Run any Soda command: soda --help soda create ./sql snowflake`

Context
n/a

OS: Linux
Python Version: 3.8
Soda SQL Version: 2.0.0b7
Warehouse Type: na

Add support for `include` and `read` functions

As discussed earlier last weekend's code restructure should allow for more freedom when defining a soda-sql project directory structure. To accomplish this we no longer make 'directory' assumptions and instead allow developers to inline their sql_metrics in the scan-yml file.

In order to prevent scan-yml files from getting too large we should also support 'macro functions' which allow developers to include multiple files or read the contents of a file. As proposed we should consider adding support for either one, or both, of the following functions:

include(./path/*.yml)
This include() function can be used in combination with sql_metrics: to include the contents of a list of files. The first argument can be a path containing wildcards or a directory which will then read and parse all yml files.
read(./path/query.sql)
The read() function allows to read the contents of a given file and use it as the properties value. This can for example be used in combination with the sql_metrics[].sql property when referring to a metric.sql query file.

An example of a scan-YML file which uses read() could then look like:

// ./tables/orders.yml
table_name: CUSTOMERS
metrics:
  - row_count
sql_metrics:
  # Use inline custom metrics
  - sql: |
      SELECT
        COUNT(CASE WHEN O_COUNTRY IS "NL" THEN 1 END) as dutch_customers,
      FROM CUSTOMERS
    tests:
      no_dutch_customers: dutch_customers == 0
  # Use inline custom metrics where the SQL is defined in a separate file
  - sql: read(./sql/very_complex_sql.sql)
    tests:
      order_size: order_size > 1000
tests:
  dataset_size: row_count == 0

An example using include() would look like:

// ./tables/orders.yml
table_name: ORDERS
metrics:
  - row_count
  - sum
columns:
  O_ORDERSTATUS:
    valid_values:
      - O
      - F
sql_metrics: include(./metrics/orders/*.yml)
tests:
  dataset_size: row_count == 0

// ./metrics/orders/my_metric.yml
sql: |
  SELECT
    COUNT(CASE WHEN O_ORDERSTATUS IS "P" THEN 1 END) as order_p,
    COUNT(CASE WHEN O_ORDERSTATUS IS "O" THEN 1 END) as order_o,
  FROM ORDERS
tests:
  open_orders: order_o == 0
  processing_orders: order_p > 0

Up for discussion:

Do we go with both read() and include() or do we want to merge them into one?

My two cents: merging into one might create more automagical behavior and can therefore be less preferred.

[bug] no histogram for columns with analytical type "percent"

Describe the bug

We do not get any histogram calculated for the percentage column types

tableBuilder.createColumn("varchar_percent_", "VARCHAR(10)")
      .values(
        "'8%'",
        "null",
        "'3--%'",
        "'nopct'",
        "'6%'",
        "'6%'",
        "'6%'",
        "'77 %'")
      .check(column -> {
        ColumnProfile profile = column.getProfile();
        assertThat(profile.getSourceType(), is("character varying"));
        assertThat(profile.getHistogram(), notNullValue());
      });

Warehouse Type: PostgreSQL

Support SQL SERVER

[bug] custom SQL Metrics don't seem to be read correctly

Describe the bug
I was trying to create some custom SQL Metrics for demo purposes, but it seems like soda scan isn't able to resolve their path correctly.

The output I'm seeying:

(...)
  | Scanning ORDERS in . ...
  | sql_metric_path scan.yml does not exist
  | sql_metric_path orderstatus_count.yml does not exist
  | Executing SQL query:

It seems like the directory path is missing from the files, cause they are reported as 'missing'?

To Reproduce
Steps to reproduce the behavior:

See my attached scan.yml and orderstatus_count.yml for an example of the setup
Run soda scan . ORDERS from within the project directory

Context
scan.yml

table_name: ORDERS
metrics:
  - row_count
  - missing_count
  - missing_percentage
  - values_count
  - values_percentage
  - invalid_count
  - invalid_percentage
  - min
  - max
  - avg
  - sum
tests:
  row_count_min: row_count > 0

orderstatus_count.yml

sql: |
  SELECT
    COUNT(CASE WHEN O_ORDERSTATUS IS "P" THEN 1 END) as order_p,
    COUNT(CASE WHEN O_ORDERSTATUS IS "O" THEN 1 END) as order_o,
  FROM ORDERS
tests:
  open_orders: order_o == 0
  processing_orders: order_p > 0

OS: Linux 5.10
Python Version: 3.8
Soda SQL Version: 2.0.0b6
Warehouse Type: Snowflake

[bug] Athena: Property schema does not exist in warehouse.yml.connection

Describe the bug
When scanning an Athena table, I get the following error:

...
  | Property schema does not exist in warehouse.yml.connection
  | Scan failed: warehouse.yml configuration errors: 
  [error] Property schema does not exist in warehouse.yml.connection
Traceback (most recent call last):
  File "/Users/lgo/dev/soda-sql/sodasql/cli/cli.py", line 325, in scan
    scan = scan_builder.build()
  File "/Users/lgo/dev/soda-sql/sodasql/scan/scan_builder.py", line 76, in build
    parser.assert_no_warnings_or_errors()
  File "/Users/lgo/dev/soda-sql/sodasql/scan/parser.py", line 120, in assert_no_warnings_or_errors
    raise AssertionError(f'{self.description} configuration errors: \n  '
AssertionError: warehouse.yml configuration errors: 
  [error] Property schema does not exist in warehouse.yml.connection

To Reproduce
Steps to reproduce the behavior:

Setup soda with an Athena database
Run soda scan ... on one of the tables

OS: Mac OS X
Python Version: 3.8.6
Soda SQL Version: main branch at current HEAD
Warehouse Type: athena

No `verify` command available, but it shows up in documentation

In Check your CLI installation on page https://docs.soda.io/soda-sql/#/5_min_tutorial

The commands listed include verify, but output of soda command on 2.0.0b2 doesn't:

soda/samples/hello-soda via 🐍 v3.8.5 (venv385)
❯ soda
Usage: soda [OPTIONS] COMMAND [ARGS]...

  Soda CLI version 2.0.0b2

Options:
  --help  Show this message and exit.

Commands:
  create  Creates a new warehouse directory and prepares credentials in
          your...

  init    Finds tables in the warehouse and based on the contents, creates...
  scan    Computes all measurements and runs all tests on one table.
soda/samples/hello-soda via 🐍 v3.8.5 (venv385)
❯ soda verify
Usage: soda [OPTIONS] COMMAND [ARGS]...
Try 'soda --help' for help.

Error: No such command 'verify'.

[bug] Python 3.9 is not supported

Describe the bug
Failure to install the CLI

ERROR: Could not find a version that satisfies the requirement pyarrow<0.18.0,>=0.17.0
  ERROR: No matching distribution found for pyarrow<0.18.0,>=0.17.0
ERROR: Could not find a version that satisfies the requirement snowflake-connector-python==2.3.6
ERROR: No matching distribution found for snowflake-connector-python==2.3.6

To Reproduce
Steps to reproduce the behavior:

Install this python docker https://hub.docker.com/_/python
Run pip install soda-sql

OS: Docker
Python Version: 3.9.1
Soda SQL Version: latests

Include/exclude column measurements to Soda Cloud

Only relevant for soda-sql users with a Soda Cloud connection

Add include / exclude columns feature which should make it possible to prevent certain columns to be pushed to Soda Cloud.

Make note in docs that we recommend not to upload Personal (Sensitive) Data to cloud and include an explanation of which data is possibly uploaded to Soda Cloud, like for example the values of min and max.

Snowflake dialect implementation issue

Hi guys,

I've tried using SodaSQL with Snowflake, but I've had an issue with soda init.

First, I've run soda create ., which created the default warehouse.yml.
I modified it to look like the following:

name: .
connection:
  type: snowflake
  username: MYUSERNAME
  password: env_var(SNOWFLAKE_PWD)
  account: MYACCOUNTNAME
  database: MYDB
  warehouse: MYWH
  schema: MYSCHEMA

In ~/.soda/env_vars.yml I have:

.:
  SNOWFLAKE_PWD: '<my secret password>'

When I run soda init . I get the following error:

/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/pandas/compat/__init__.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/snowflake/connector/options.py:78: UserWarning: You have an incompatible version of 'pyarrow' installed (2.0.0), please install a version that adheres to: 'pyarrow<0.18.0,>=0.17.0; extra == "pandas"'
  warn_incompatible_dep('pyarrow', _installed_pyarrow_version.version, _expected_pyarrow_version)
  | Querying warehouse for tables
  | Exception: TODO override and implement this abstract method
Traceback (most recent call last):
  File "/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/sodasql/cli/cli.py", line 196, in init
    warehouse.dialect.sql_tables_metadata_query())
  File "/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/sodasql/scan/dialect.py", line 98, in sql_tables_metadata_query
    raise RuntimeError('TODO override and implement this abstract method')
RuntimeError: TODO override and implement this abstract method

A part from pyarrow and lzma warnings, it seems that the sql_tables_metadata_query method is not implemented for Snowflake. I have not looked into SodaSQL code, though.

Add support for PostgreSQL column data type JSON

For this PostgreSQL table:

CREATE TABLE test_table (
    varchar_with_nulls_ VARCHAR(255),
    text_custom_missing_ TEXT,
    char_eu_date_ CHAR(10),
    varchar_percent_ VARCHAR(10),
    double_precision_ DOUBLE PRECISION,
    integer_ INTEGER,
    bytea_all_nulls_ BYTEA,
    timestamp_ TIMESTAMP,
    time_ TIME,
    date_ DATE,
    json_ JSON
);

profiling fails since data type JSON is not supported:

14:13:46.322 INFO  PostgresCo | STDERR: STATEMENT:  WITH group_by_value AS ( 
14:13:46.322 INFO  PostgresCo | STDERR: 	  SELECT 
14:13:46.322 INFO  PostgresCo | STDERR: 	    json_ AS value, 
14:13:46.322 INFO  PostgresCo | STDERR: 	    COUNT(*) AS frequency 
14:13:46.322 INFO  PostgresCo | STDERR: 	  FROM "public"."test_table" 
14:13:46.322 INFO  PostgresCo | STDERR: 	  WHERE NOT (json_ IS NULL) 
14:13:46.322 INFO  PostgresCo | STDERR: 	  GROUP BY json_ 
14:13:46.322 INFO  PostgresCo | STDERR: 	) 
14:13:46.322 INFO  PostgresCo | STDERR: 	SELECT COUNT(*), 
14:13:46.322 INFO  PostgresCo | STDERR: 	       COUNT(CASE WHEN frequency = 1 THEN 1 END), 
14:13:46.322 INFO  PostgresCo | STDERR: 	       SUM(frequency) 
14:13:46.322 INFO  PostgresCo | STDERR: 	FROM group_by_value
14:13:46.326 INFO  SodaScanne | STDOUT: Scan failed
14:13:46.326 INFO  SodaScanne | STDOUT: Traceback (most recent call last):
14:13:46.326 INFO  SodaScanne | STDOUT:   File "/usr/local/lib/python3.8/site-packages/sodasql/scan/scan.py", line 77, in execute
14:13:46.326 INFO  SodaScanne | STDOUT:     self._query_group_by_value()
14:13:46.326 INFO  SodaScanne | STDOUT:   File "/usr/local/lib/python3.8/site-packages/sodasql/scan/scan.py", line 275, in _query_group_by_value
14:13:46.327 INFO  SodaScanne | STDOUT:     query_result_tuple = self.warehouse.sql_fetchone(sql)
14:13:46.327 INFO  SodaScanne | STDOUT:   File "/usr/local/lib/python3.8/site-packages/sodasql/scan/warehouse.py", line 27, in sql_fetchone
14:13:46.327 INFO  SodaScanne | STDOUT:     return sql_fetchone(self.connection, sql)
14:13:46.327 INFO  SodaScanne | STDOUT:   File "/usr/local/lib/python3.8/site-packages/sodasql/scan/db.py", line 20, in sql_fetchone
14:13:46.327 INFO  SodaScanne | STDOUT:     return sql_fetchone_description(connection, sql)[0]
14:13:46.327 INFO  SodaScanne | STDOUT:   File "/usr/local/lib/python3.8/site-packages/sodasql/scan/db.py", line 33, in sql_fetchone_description
14:13:46.327 INFO  SodaScanne | STDOUT:     cursor.execute(sql)
14:13:46.327 INFO  SodaScanne | STDOUT: psycopg2.errors.UndefinedFunction: could not identify an equality operator for type json
14:13:46.328 INFO  SodaScanne | STDOUT: LINE 7:   GROUP BY json_

Why Athena params in Redshift default config?

Why are Athena parameters used as defaults for a Redshift connection?

soda-sql/sodasql/dialects/redshift_dialect.py

Line 46 in 9cae2c3

def default_connection_properties(self, params: dict):

Wouldn't it be better to use parameters defined in https://docs.soda.io/soda-sql/#/warehouse_types?id=redshift ?

Update setup.py with git based build/release tagging

Currently building the local version uses a hardcoded __dev__ as a version, it is better to use the git SHA/ref as a version.

update setup.py to pickup the git ref and set SODA_SQL_VERSION
Update documentation on how to build/install locally
Make sure that CI/CD uses the same process

[bug] valid_values causes an exception when running 'scan'

Describe the bug
After adding a list of valid_values to my scan.yml running soda scan started throwing exceptions.

FROM "ORDERS"
  | Scan failed
Traceback (most recent call last):
  File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/sodasql/scan/scan.py", line 78, in execute
    self._query_aggregations()
  File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/sodasql/scan/scan.py", line 211, in _query_aggregations
    query_result_tuple = self.warehouse.sql_fetchone(sql)
  File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/sodasql/scan/warehouse.py", line 26, in sql_fetchone
    return sql_fetchone(self.connection, sql)
  File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/sodasql/scan/db.py", line 20, in sql_fetchone
    return sql_fetchone_description(connection, sql)[0]
  File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/sodasql/scan/db.py", line 33, in sql_fetchone_description
    cursor.execute(sql)
  File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/snowflake/connector/cursor.py", line 603, in execute
    Error.errorhandler_wrapper(self.connection, self,
  File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/snowflake/connector/errors.py", line 125, in errorhandler_wrapper
    cursor.errorhandler(connection, cursor, error_class, error_value)
  File "/home/dirk/work/soda/code/meuk/sql_snowflake_demo/.venv/lib/python3.8/site-packages/snowflake/connector/errors.py", line 85, in default_errorhandler
    raise error_class(
snowflake.connector.errors.ProgrammingError: 001003 (42000): SQL compilation error:
syntax error line 20 at position 53 unexpected ')'.
  | Measurement(metric='schema', column_name=None, value=[{'name': 'O_COMMENT', 'type': 'TEXT'}, {'name': 'O_ORDERPRIORITY', 'type': 'TEXT'}, {'name': 'O_TOTALPRICE', 'type': 'NUMBER'}, {'name': 'O_CUSTKEY', 'type': 'NUMBER'}, {'name': 'O_ORDERSTATUS', 'type': 'TEXT'}, {'name': 'O_SHIPPRIORITY', 'type': 'NUMBER'}, {'name': 'O_CLERK', 'type': 'TEXT'}, {'name': 'O_ORDERKEY', 'type': 'NUMBER'}, {'name': 'O_ORDERDATE', 'type': 'DATE'}], group_values=None)
  | 1 measurements computed
  | 0 tests executed
  | All is good. No tests failed.

To Reproduce
Steps to reproduce the behavior:

Alter a column in your scan.yml to contain a list of valid_values
Run soda scan ...

Expected behavior
I was expecting my valid_values to be taken into account when computing the invalid_values metric, but an exception was thrown during the scan.

Context
scan.yml

table_name: ORDERS
metrics:
  - row_count
  - missing_count
  - missing_percentage
  - values_count
  - values_percentage
  - invalid_count
  - invalid_percentage
  - min
  - max
  - avg
  - sum
columns:
  O_ORDERSTATUS:
    valid_values:
      - O
      - F
      - P
tests:
  row_count_min: row_count > 0
  no_invalid_values: invalid_count == 0

OS: Linux (5.10)
Python Version: 3.8
Soda SQL Version: 2.0.0b6
Warehouse Type: Snowflake

[bug] AWS Athena - error "TODO override and implement this abstract method"

Describe the bug
I am trying to get soda-sql connect to some AWS Athena tables and stumble into an issue:

/Users/lgo/dev/soda-sql/venv/bin/python /Users/lgo/dev/soda-sql/venv/bin/soda init
  | 2.0.0b10
  | Initializing warehouse.yml ...
  | Querying warehouse for tables
  | Directory tables already exists
  | Exception: TODO override and implement this abstract method
Traceback (most recent call last):
  File "/Users/lgo/dev/soda-sql/venv/lib/python3.8/site-packages/sodasql/cli/cli.py", line 171, in init
    scan_initializer.initialize_scan_ymls()
  File "/Users/lgo/dev/soda-sql/venv/lib/python3.8/site-packages/sodasql/cli/scan_initializer.py", line 45, in initialize_scan_ymls
    self.warehouse.dialect.sql_tables_metadata_query())
  File "/Users/lgo/dev/soda-sql/venv/lib/python3.8/site-packages/sodasql/scan/dialect.py", line 109, in sql_tables_metadata_query
    raise RuntimeError('TODO override and implement this abstract method')
RuntimeError: TODO override and implement this abstract method

Process finished with exit code 0

Not sure why I end up here, AWS Athena support is supposed work in this beta right? Besides this stack track, it would be good if the cli return a non-zero exit code in case of a fatal error like this.

Context
My warehouse.yml:

name: athena
connection:
    type: athena
    database: xxxxx_reporting_db
    access_key_id: env_var(AWS_ACCESS_KEY_ID)
    secret_access_key: env_var(AWS_SECRET_ACCESS_KEY)
    role_arn:
    region: us-east-1
    staging_dir: s3://xxxxx-reporting-db/db/

My env_vars.yml:

athena:
  AWS_ACCESS_KEY_ID: 'xxxxxxxxxxxxxx'
  AWS_SECRET_ACCESS_KEY: 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

OS: Mac OS X
Python Version: Python 3.8.6
Soda SQL Version: 2.0.0b10
Warehouse Type: athena

Update documentation to include the list of possible validity formats which can be used in soda-sql tests

Update documentation to include the list of possible validity formats and definition which can be used in soda-sql tests.

Current list copied from source code (https://github.com/sodadata/soda-sql/blob/main/sodasql/scan/validity.py):

'number_whole'
'number_decimal_point'
'number_decimal_comma'
'number_percentage'
'date_eu'
'date_us'
'date_inverse'
'time'
'uuid'
'email'
'phone'
'credit_card_number'
'ip_address'

Thanks

Support Dremio

Support PostgreSQL

Snowflake SUM aggregate causes value overflow

Hello there,

Describe the bug
soda scan SUM aggregate overflows

snowflake.connector.errors.ProgrammingError: 100058 (22000): Value overflow in a SUM aggregate

To Reproduce
Steps to reproduce the behavior:
soda scan on a big table

OS: Unix
Python Version: 3.8.7
Soda SQL Version: 2.0.0b10
Warehouse Type: Snowflake

[bug] min, mins, max and maxs metrics do not arrive for varchar column type

Describe the bug

Some metrics do not arrive for varchar column type which are expected & supported.

    tableBuilder.createColumn("varchar_with_nulls_", "VARCHAR(255)")
      .values(
        "null",
        "'a'")
      .check(column -> {
        ColumnProfile profile = column.getProfile();
        assertThat(profile.getMin(), is("a"));
        assertThat(profile.getMins(), is(Lists.of("a")));
        assertThat(profile.getMax(), is("a"));
        assertThat(profile.getMaxs(), is(Lists.of("a")));
      });

[bug] soda-scanner doesn't mark empty strings as missing

Describe the bug

It is expected that Soda SQL detects null and '<empty string>' as missing values out of the box + additionally declared missingValues. For the data set up like this:

    tableBuilder
      .createColumn(COL_MISSING2, "VARCHAR(255)")
      .scanConfiguration("missingValues", Lists.of("missing"))
      .values(
        "''",
        "'missing'"
      );

the missing count should be 2 (1 for "", and 1 for "missing"), but we get 1 since only missing is detected.

Generated SQL:

COUNT(CASE WHEN NOT (missing2 IS NULL OR missing2 IN ('missing')) THEN 1 END),

Warehouse Type: PostgreSQL

[bug] number column (in varchar column type) with invalid value is not detected

Describe the bug

if a field has invalid value, but otherwise other values allow analytical type deduction, we should detect obviously invalid values

tableBuilder.createColumn("varchar_percent_", "VARCHAR(10)")
      .values(
        "'8%'",
        "null",
        "'3--%'",
        "'nopct'",
        "'6%'",
        "'6%'",
        "'6%'",
        "'77 %'")
      .check(column -> {
        ColumnProfile profile = column.getProfile();
        assertThat(profile.getSourceType(), is("character varying"));
        assertThat(profile.getValidValuesCount(), is(5l)); // we get 7
        assertThat(profile.getInvalidValuesCount(), is(2l)); // we get 0
        assertThat(profile.getInvalidValuesPercentage(), is(25f)); // we get 0
        assertThat(profile.getDistinctValuesCount(), is(3l)); // we get 5
        assertThat(profile.getUniqueValuesCount(), is(2l)); // we get 4
      });

Warehouse Type: PostgreSQL

List potential data issues for demo data

Currently tests/demo/print_demodata_sql_script.py produces SQL statements to (re)create a table called demodata.
The goal is eventually to be able to load this demo data into new accounts.

In order to do this, we need to improve on the demo data.

We should end up with 2 or 3 datasets. One incremental, one snapshot and maybe another one.
Each of the tables should have realistic names like eg customer_transactions or so...

The demo data scans should be generated over a 7 day period.

The following data issues should be included:

On day 4, a column should be dropped. And on day 5 it should reappear.
On day 2, a column should have only null values whereas on the other days it should have almost all non nulls
On day 3, a table should have zero rows
...

Step 1 is to collect all potential data issues and turn them into a 2 or 3 realistic datasets, in which it is really easy to show how Soda captures each issue. Let's list all the potential data issues we want to include in the demo data below.

Split soda-sql in separate modules

Folder structure to be discussed

Libraries use dashes in the names, no underscores. Folder names as indicated below all use underscores.

+ src
    + sodasql (why was this again?)
        + setup.py
    + sodasql_cloud
        + setup.py
        + sodasql_cloud
        + tests
    + sodasql_scan
        + setup.py
        + sodasql_scan
        + tests
    + sodasql_cli
        + setup.py
        + sodasql_cli
        + tests
    + sodasql_airflow
    + connectors
        + sodasql_postgres
            + setup.py
            + sodasql_postgres
            + tests
        + sodasql_snowflake
        + sodasql_athena
        + sodasql_redshift
        + sodasql_bigquery
        + sodasql_spark

Also, before starting I would like to understand how we'll deal with versioning the individual libs and how this impacts the release process.

TODO investigate if dialects can be extracted. Passing data from core lib to dialects is no prob. But The dialects should not invoke methods on core stuff. That would cause circular dependencies. To be investigated before starting this.

Create a slack community

We currently think that GitHub discussions serves best as communication medium in our community.

But Slack also is on our radar. If you think we should add a Slack community given that we already have GitHub discussions, vote up this issue by adding a +1 reaction to this post.

BigQuery documentation is misleading

Describe the bug
Our documentation currently says:

name: my_bigquery_project
connection:
    type: bigquery
    account_info: <PATH TO YOUR BIGQUERY ACCOUNT INFO JSON FILE>
    dataset: sodasql
...

but our code expects

account_info_json, and
that property should have JSON object, not the location of it

OS:
Python Version: *
Soda SQL Version: main
Warehouse Type: BigQuery

soda-sql doesn't show the test names of the tests executed during a scan.

Hi team,

Soda-SQL doens't show the test names of the test executed during a scan. It does so for measurements though. Please see the output below:

(.venv) mathissedestrooper@Mathisses-MacBook-Pro soda-sql-project % soda scan ./snowflake-soda-warehouse ORDERS
| 2.0.0b5
| Scanning ORDERS in ./snowflake-soda-warehouse ...
| sql_metric_path scan.yml does not exist
| Executing SQL query:
(...)
| missing_count(O_ORDERDATE) = 0
| values_percentage(O_ORDERDATE) = 100.0
| 62 measurements computed
| 1 tests executed
| All is good. No tests failed.

It would be handy to show the test names and results instead of just showing x tests executed and x failed. This would allow the user to immediately see what's wrong and act upon it accordingly.

Thanks!

sodadata / soda-sql Goto Github PK

soda-sql's Introduction

Soda SQL and Soda Spark have become Soda Core

Example scan YAML file (Deprecated)

Install (Deprecated)

Contributors ✨

Open Telemetry Tracking (Deprecated)

soda-sql's People

Contributors

Stargazers

Watchers

Forkers

soda-sql's Issues

Current state

Desired state

Recommend Projects

Recommend Topics

Recommend Org