Git Product home page Git Product logo

meta-facts's Introduction

meta-facts

Automatic generation of Meta-Data for a dataset


Table of Contents
  1. Motivation
  2. How to run the application
  3. Project Structure
  4. Methodology
    1. Where this Library fits in the overall architecture
    2. Approach to determine Meta-Data
      1. Column Names
      2. File Path
      3. Units
      4. Temporal Coverage
      5. Granularity
      6. Spatial Coverage
      7. File Formats Available
      8. Is Public Dataset

Motivation


How to run the application

Runnning Localhost

poetry run uvicorn app.main:app --reload --port 8005

Deploy app

docker compose up --build

Access Swagger Documentation

http://localhost:8005/api/docs


Project structure

Files related to application are in the app or tests directories. Application parts are:

app
├── api              - web related stuff.
│   └── routes       - web routes.
├── core             - application configuration, startup events, logging.
├── models           - pydantic models for this application.
├── services         - logic that is not just crud related.
└── main.py          - FastAPI application creation and configuration.

Methodology


Approach to determine Meta-Data


Column Names

  • How are columns categorised?
    • The library categorises columns into Following Categories:
      Column Entity Columns
      Date-Time non_calendar_year
      calender_year
      other_year
      quarter
      month
      date
      Geography country
      state
      district
      Unit unit
      Note note
      Unmapped Any unmapped columns

Units :

  • General Workflow

    graph LR;
      A[Dataset]-->B{Unit Column Exists ?};
      
      B -- NO --> C(RETURN Null String);
      B -- Yes --> D[Get all  unique units from UNIT Column];
    
      D --> E[Prepare List of all separate units];
      E --> F(RETURN all units as STRING SEPARATED WITH COMMAS)
    
    Loading

Temporal Coverage :

  • General Workflow

    flowchart LR
    
    A(Dataset) -->  B{Year column exists ?}
    B -- NO --> C(RETURN Null String) 
    B -- Yes --> D[Calender / Non-Calender Year Columns]
    D --> E{Years are in Sequence ?}
    E -- YES --> F(RETURN string represntation of range \n example : 2012 to 2020 or \n 2012-13 to 2020-21)
    E -- NO --> G(RETURN  comma separated values for all years, \n exmaple : 2012,2015,2018 or \n 2012-13, 2015-16, 2018-19)
    
    Loading

    Notes:

    • Determination of Temporal coverage is based on the presence of year column.
    • If both Calender year and Non-Calender year are presnet in dataset then priority will be given to Calender year.

Granulaity :

  • General Workflow

      flowchart LR
      A(Dataset) --> B{If any of Date-time or \nGeography columns exists ?}
      B -- No --> C(RETURN Null String)
      B -- YES -->  D[Map all Columns levels in \nSorted Order for respective Domains]
      D --> E[Map the columns groups according to \nproper naming convention Granularity]
      E --> F(RETURN Comma Separated Values of all Granularitues \n example : Quarterly, District)
    
    Loading

    Notes:

    • Granularity is calculated for 2 domains.
      • Geography
      • Date-Time
    • In config.py There are granularity ranks mentioned for each domain.
    • In config.py there are Keywords also present for Granularity if found in Datasets.

Spatial Coverage :

Mentioned below are the Cases for Spatial Covererage :

Spatial Location Dataset with categories as Methodology Spatial Coverage
Countries India, Pakisthan, China, etc Country
Specific Country India represent it with the specific Country Name India
States of a Country Andhra Pradesh, Assam, etc States of India
Regions of a country South India, NE states etc Regions of India
Specific State of a country Andhra Pradesh represent it with the specific State Name Andhra Pradesh
Districts of a State/ States Adilabad, Hyderabad etc Districts of Telangana or Districts of India
Specific District of a state Hyderabad represent it with specific District Name Hyderabad

  • General Workflow

      flowchart LR
      A(Dataset) --> B{If Geographical Columns exists ?}
      B -- NO --> C(RETURN Default Value as INDIA)
      B -- YES --> D[Sort the order of different \nGeographical Level]
      D --> E(RETURN Value of biggest order of Geographical Column \nwith proper naming convention)
    
    Loading

    Notes:

    • This library currently facilitates only for Country, State and District level of Spatial Coverage.
    • Mapping of levels of Geographic Columns is decided by corresponding column names and not the values, hence change in Column names will impact the mapping.
    • If there is no Geographic column , then the result would be default for INDIA.
    • Spatial coverage order, keyword Mapping and Naming Convention are mentioned in config.py.

File Formats Available :

Notes:

  • Reads the format of file from the file name.

meta-facts's People

Contributors

100mi avatar paul-tharun avatar hemanthm005 avatar deshetti avatar

Watchers

Rakesh Dubbudu avatar  avatar

meta-facts's Issues

Error in determining Temporal coverage

meta-facts-metafacts-server-1  | /app/./app/utils/spatial_coverage.py:17: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
meta-facts-metafacts-server-1  |   self.dataset[self.get_geographic_columns].nunique(dropna=True)
meta-facts-metafacts-server-1  | /app/./app/utils/spatial_coverage.py:17: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
meta-facts-metafacts-server-1  |   self.dataset[self.get_geographic_columns].nunique(dropna=True)
meta-facts-metafacts-server-1  | INFO:     192.168.96.1:61472 - "POST /meta-data/s3 HTTP/1.1" 500 Internal Server Error
meta-facts-metafacts-server-1  | ERROR:    Exception in ASGI application
meta-facts-metafacts-server-1  | Traceback (most recent call last):
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py", line 390, in run_asgi
meta-facts-metafacts-server-1  |     result = await app(self.scope, self.receive, self.send)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
meta-facts-metafacts-server-1  |     return await self.app(scope, receive, send)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/fastapi/applications.py", line 208, in __call__
meta-facts-metafacts-server-1  |     await super().__call__(scope, receive, send)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/starlette/applications.py", line 112, in __call__
meta-facts-metafacts-server-1  |     await self.middleware_stack(scope, receive, send)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 181, in __call__
meta-facts-metafacts-server-1  |     raise exc
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 159, in __call__
meta-facts-metafacts-server-1  |     await self.app(scope, receive, _send)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/starlette/middleware/cors.py", line 92, in __call__
meta-facts-metafacts-server-1  |     await self.simple_response(scope, receive, send, request_headers=headers)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/starlette/middleware/cors.py", line 147, in simple_response
meta-facts-metafacts-server-1  |     await self.app(scope, receive, send)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/starlette/exceptions.py", line 82, in __call__
meta-facts-metafacts-server-1  |     raise exc
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/starlette/exceptions.py", line 71, in __call__
meta-facts-metafacts-server-1  |     await self.app(scope, receive, sender)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 656, in __call__
meta-facts-metafacts-server-1  |     await route.handle(scope, receive, send)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 259, in handle
meta-facts-metafacts-server-1  |     await self.app(scope, receive, send)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 61, in app
meta-facts-metafacts-server-1  |     response = await func(request)
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 226, in app
meta-facts-metafacts-server-1  |     raw_response = await run_endpoint_function(
meta-facts-metafacts-server-1  |   File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 159, in run_endpoint_function
meta-facts-metafacts-server-1  |     return await dependant.call(**values)
meta-facts-metafacts-server-1  |   File "/app/./app/api/api_v1/routers/meta_data.py", line 70, in get_meta_data_from_s3
meta-facts-metafacts-server-1  |     meta_data = await create_meta_data_for_s3_bucket(
meta-facts-metafacts-server-1  |   File "/app/./app/utils/meta_data.py", line 135, in create_meta_data_for_s3_bucket
meta-facts-metafacts-server-1  |     results = await asyncio.gather(*tasks)
meta-facts-metafacts-server-1  |   File "/app/./app/utils/meta_data.py", line 106, in get_dataset_meta_data_for_s3_file
meta-facts-metafacts-server-1  |     result = await asyncio.gather(
meta-facts-metafacts-server-1  |   File "/app/./app/utils/temporal_coverage.py", line 107, in get_temporal_coverage
meta-facts-metafacts-server-1  |     year_in_sequence = is_sequence(year_mapping)
meta-facts-metafacts-server-1  |   File "/app/./app/utils/temporal_coverage.py", line 65, in is_sequence
meta-facts-metafacts-server-1  |     min_val = min(combine_all_years)
meta-facts-metafacts-server-1  | ValueError: min() arg is an empty sequence
meta-facts-metafacts-server-1  | /app/./app/utils/spatial_coverage.py:17: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
meta-facts-metafacts-server-1  |   self.dataset[self.get_geographic_columns].nunique(dropna=True)
meta-facts-metafacts-server-1  | /app/./app/utils/spatial_coverage.py:17: FutureWarning: Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
  1. Any error in determining any property in meta-data should not terminate the complete process.
  2. Fix the FututreWarning issue.

Reassign order for Meta-facts

  • Put new order for meta-facts as :
output_file_name, units, temporal_coverage, granularity, spatial_coverage, formats_available, is_public

Discrete values in Temporal Coverage should combination of discrete years

Description

Temporal Coverage for a dataset with discrete year (example: 2010,2011,2013,2014,2015,2016,2017,2019,2020,2021) is currently represented as Comma Separated String and looks like in the below image:

Screenshot 2022-11-01 at 3 16 27 PM

But what we aim for better representation is to have list of year in continuous fashion as : 2010-11, 2013-17 and 2019-2021

Task

All the Temporal Coverage code is present inside app > utils > temporal_coverage.py. Go through the existing code and logic that gets temporal coverage and then modify accordingly and Test for all possible cases.

Temporal Coverage for Date columns

Description
Temporal Coverage for a dataset with dates (example:10-11-2019, 23-12-2021, 10-11-2022, 23-12-2023, etc) is currently being not generated:

matafacts

From the above example mentioned in Description, what we aim is : 2019, 2021-23.

Task
All the Temporal Coverage code is present inside app > utils > temporal_coverage.py. Go through the existing code and logic that gets temporal coverage and then modify accordingly and Test for all possible cases.

Add more Granular values

Description:
Currently we have Date and Geographical Granularity, we need to add Airline, Crops, Gender, Language.

Add error messages and headers in response for failed requests on the s3 endpoint

  1. when the user sends an invalid bucket name or path name , this must be included as a error message in the failed response . The below error in the server logs The specified bucket does not exist must be included in the failed response message .
    expected sample error response -
{
    "status": 422 , // http status
    "details":{
        "field":"s3_bucket" ,  // JSON field which errored
        "message":"specified bucket does not exist" // error message to be shown in form
    }
}

Server logs -

metafacts-server_1  | ERROR:    Exception in ASGI application
metafacts-server_1  | Traceback (most recent call last):
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/uvicorn/protocols/http/httptools_impl.py", line 390, in run_asgi
metafacts-server_1  |     result = await app(self.scope, self.receive, self.send)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
metafacts-server_1  |     return await self.app(scope, receive, send)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/fastapi/applications.py", line 208, in __call__
metafacts-server_1  |     await super().__call__(scope, receive, send)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/starlette/applications.py", line 112, in __call__
metafacts-server_1  |     await self.middleware_stack(scope, receive, send)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 181, in __call__
metafacts-server_1  |     raise exc
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/starlette/middleware/errors.py", line 159, in __call__
metafacts-server_1  |     await self.app(scope, receive, _send)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/starlette/middleware/cors.py", line 92, in __call__
metafacts-server_1  |     await self.simple_response(scope, receive, send, request_headers=headers)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/starlette/middleware/cors.py", line 147, in simple_response
metafacts-server_1  |     await self.app(scope, receive, send)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/starlette/exceptions.py", line 82, in __call__
metafacts-server_1  |     raise exc
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/starlette/exceptions.py", line 71, in __call__
metafacts-server_1  |     await self.app(scope, receive, sender)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 656, in __call__
metafacts-server_1  |     await route.handle(scope, receive, send)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 259, in handle
metafacts-server_1  |     await self.app(scope, receive, send)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/starlette/routing.py", line 61, in app
metafacts-server_1  |     response = await func(request)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 226, in app
metafacts-server_1  |     raw_response = await run_endpoint_function(
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/fastapi/routing.py", line 159, in run_endpoint_function
metafacts-server_1  |     return await dependant.call(**values)
metafacts-server_1  |   File "/app/./app/api/api_v1/routers/meta_data.py", line 71, in get_meta_data_from_s3
metafacts-server_1  |     meta_data = await create_meta_data_for_s3_bucket(
metafacts-server_1  |   File "/app/./app/utils/meta_data.py", line 124, in create_meta_data_for_s3_bucket
metafacts-server_1  |     tasks = [
metafacts-server_1  |   File "/app/./app/utils/meta_data.py", line 124, in <listcomp>
metafacts-server_1  |     tasks = [
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/boto3/resources/collection.py", line 81, in __iter__
metafacts-server_1  |     for page in self.pages():
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/boto3/resources/collection.py", line 171, in pages
metafacts-server_1  |     for page in pages:
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/botocore/paginate.py", line 269, in __iter__
metafacts-server_1  |     response = self._make_request(current_kwargs)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/botocore/paginate.py", line 357, in _make_request
metafacts-server_1  |     return self._method(**current_kwargs)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 508, in _api_call
metafacts-server_1  |     return self._make_api_call(operation_name, kwargs)
metafacts-server_1  |   File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 915, in _make_api_call
metafacts-server_1  |     raise error_class(parsed_response, operation_name)
metafacts-server_1  | botocore.errorfactory.NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjects operation: The specified bucket does not exist
  1. The response headers of a failed request are not the same as a successful request this is causing errors in the front end . i think it is mostly due to the cors headers not being included in a failed request .

response headers received for successful request -

access-control-allow-credentials: true
access-control-allow-origin: *
content-length: 928
content-type: application/json
date: Tue, 29 Nov 2022 06:42:23 GMT
server: uvicorn

response headers received for failed request -

content-length: 2707
content-type: text/plain; charset=utf-8
date: Tue, 29 Nov 2022 06:43:17 GMT
server: uvicorn

Request entity too large

The error occurred for Nginx saying that "Entity too large" while uploading for multiple files.
Seems like we are trying to pass multiple files in a single go and nginx is not able to transfer all those thing at once. Thus we must check about passing concurrent request for a single file every time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.