Git Product home page Git Product logo

validly's Introduction

WHAT ?

Validly is a fast-api application that helps to validate specific domain of columns in csv datasets with the help of Great-expectation


WHY ?

Lets consider an arbitrary dataset which gives you detail about Agriculture production every year :

year category value unit note
2009 Rice 150 Metric Tonne
2009-10 Wheat 175 value in Metric Tonne
2009 Pulses "110" value in Metric Tonne

Suppose , if there is no data-quality check after, data-cleaning then the would be certain issues that an analyst might face like :

  • There are mixed representation of year 2009 and 2009-10
  • Numeric Values are represented as string as "110" for pulses
  • No consistent manner of representing units, ..... and the list may goes on

So, ideally the dataset should look like :

year category value unit note
2009 Rice 150 value in Metric Tonne
2009 Wheat 175 value in Metric Tonne
2009 Pulses 110 value in Metric Tonne

If there would be a tool where a user can upload its dataset / csv file and can figure out its potential problems then data-cleaning can be revisited and a much proper dataset can be acquired.


HOW ?

To run the application ?

cd validly
docker compose up

Serveer will be up and running at : http://localhost:8000

To stop the application ?

  • Command to stop Validly
docker compose stop
  • Command to stop the application and remove all the containers and networks that were created:
docker compose down

Metadata Validations

WHAT ?

Validly is a fast-api application that helps to validate specific domain of columns in csv metadata datasets with the help of Great-expectation


WHY ?

Lets consider an arbitrary metadata sheet which gives you detail about AISHE :

sector organization short_form .... time_saved_in_hours price
Educations AISHE .... 4 1996

Suppose , if there is no data-quality check after, data-cleaning then there would be certain issues that an analyst might face like :

  • Sector, organization, etc columns values should take only few expected values only
  • Time Saved in Hours should be in the range of 2 - 6 hours
  • There should be few columns which doesnt accept null values ..... and the list may goes on

So, ideally the dataset should look like :

sector organization short_form .... time_saved_in_hours price
Education All India Survey on Higher Education AISHE .... 4 1996

If there would be a tool where a user can upload its metadata/csv file and can figure out its potential problems then metadata sheet can be revisited and updated properly.


validly's People

Contributors

100mi avatar paul-tharun avatar shreeharsha-factly avatar deshetti avatar hemanthm005 avatar venu-sambarapu-ds avatar

Stargazers

 avatar  avatar

Watchers

Rakesh Dubbudu avatar  avatar

validly's Issues

Add new validations to check the bank name upto proper standardisation

Description

Similar to Airline names we must add Bank names from the data dictionary as a part of validly-server.

Task

  • Column mapping should include a new column which will incorporate bank name columns
  • Then the validation rule will be added similar to how state validation is present

Updating Changes in validly dictionary dynamically

Is there a way that we would be able to update csv and dictionary files dynamically rather than making them only static files.

For every change made in the data dictionary , we have to made the same change and deploy the whole repo again with a new tag and deploy it.

Task : Create a new endpoint series which would:

  • Take sheets from the Google sheet and create csv file accordingly for some name.
  • would able to delete a csv file which contains data (in case if we would like to delete something)

Issues with Regex patterns

  1. Candidate column is treated as 'Date', so validations failed. The regex pattern to identify 'Date' columns should be modified.
  2. We don't have any regex pattern to identify 'Currency' columns.

Validly only support file decoding for csv files other than utf-8

Validations and metadata is not generated for few csv processed files that are not in CSV - (utf-8) format

Below is the error log while uploading a non utf-8 file with validly server

INFO:fastapi:Created Minio Client
INFO:fastapi:Folder name :bbfa2a4f-353f-4698-b455-e70fff120822
INFO:fastapi:File name : output.csv
INFO:fastapi:Uploaded output.csv to minio
INFO:fastapi:Error reading Dataset from : bbfa2a4f-353f-4698-b455-e70fff120822/output.csv: 'utf-8' codec can't decode byte 0xa0 in position 794: invalid start byte
INFO:     10.40.3.10:50116 - "POST /expectation/datasets/?format=json HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py", line 376, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/usr/local/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/fastapi/applications.py", line 208, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/cors.py", line 92, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "/usr/local/lib/python3.9/site-packages/starlette/middleware/cors.py", line 147, in simple_response
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/usr/local/lib/python3.9/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 226, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 159, in run_endpoint_function
    return await dependant.call(**values)
  File "/app/./app/api/api_v1/routers/dataset.py", line 96, in execute_dataset_expectation_post
    expectations = await datasets_expectation(s3_files_key, result_type)
  File "/app/./app/utils/dataset.py", line 60, in datasets_expectation
    expectations = await asyncio.gather(
  File "/app/./app/utils/dataset.py", line 24, in dataset_expectation
    dataset = await read_dataset(dataset_path, s3_client, bucket_name)
  File "/app/./app/utils/common.py", line 39, in read_dataset
    return dataset
UnboundLocalError: local variable 'dataset' referenced before assignment
ERROR:uvicorn.error:Exception in ASGI application

Remove Existing Jinja Templates

Description

For the first version of validly , endpoints are coupled with the Jinja Templates, thus , there are various instances if HTML-templates, CSS and other unnecessary stuff that is clutter to the repository

Tasks

Make a list of all such files (HTML/CSS) that are not used in newer versions of validly and remove them

Issues from Google Sheets

validly(Dataset check)

  • Datetime column lengths(4-10)
  • - numbers in text fields
  • - Minimum no. of rows or size
  • - latest data on top
  • - update data dictionaries like countries and districts
  • - Detecting special characters
  • - Flag Data-Time empty cells
  • - column_names in lower case
  • - Display local path for Directory
  • Numeric values format(roundoff to 2 decimal digits)
  • Identify Negative Values(Flag)

metafacts(Generating Metadata from Dataset)

  • adding temporal coverage for datasets with dates
  • Duplicate units
  • Granularity (airline, language, districts, crops, gender)
  • Separate granularity for Fiscal Years
  • remove formats_available, is_public
  • Display local path for Directory

validly metafacts(Metadata check)

  • length of title (factly_dataset_name) < 200
  • Spell Check
  • to check whether file_path in s3
  • Columns order
  • length of description to be > 50

Validly: Modify the way of reading the CSV files from core

Introduction

Instead of reading the standard names from local file read the standard names from google sheet will be more easy for reading updates from NR team so we are changing way of reading the standard name

  • Make necessary changes to read standard names from google sheet.

Validly to support metadata validation

Description

While creating metadata , there are certain error that could be automatically check. Thus we must create validation rules for meta-data in the similar fashion it is present for Datasets validation.

Consider below mention rules to start with :

  • Set of values for : Sectors, Organizations, Short-form, Frequency of update
  • Proper pattern for : File path, Data next update date
  • Range of values : Time Saved in hours

Task

  • Endpoint route that take an upload file object and does validations on those.
  • Endpoint route that accepts Google Sheet ID and worksheet name where meta-data is stored

Flag missing values numerical/quantity columns

Description

Cross-check if missing values are present due to Cleaning issue or due to source only.
We must flag any null row with missing value so that user can check if it is because the raw file is empty or because of some error made while cleaning the datasets

How to setup validly dev environment

Need documentation related to following topics :

  • Set up validly project locally and setting up it for development
  • How to add and new expectation for a column type
  • How to create custom expectation and add it to expectation suite

Validate if time series dataset is complete

Description

There should be validation within time series datasets that checks if a portion within time-series is left out or not.
For example : If a monthly granular file from 2010-2022 dataset if 2014 is missing(or any month) of it , then it should throw an error.

Multiple sector values causing issue in metadata

Description
Validly raises an error if Multiple sector values are present in a single cell.

Task
Identity all the sector names in a single cell and check if all sector names are proper or not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.