Git Product home page Git Product logo

latest-ingestion-pipeline's Introduction

Build Status

IUDX

Latest Ingestion Pipeline

The new latest ingestion pipeline is designed to ingest data asynchronously into Redis Database. This pipeline would enable the IUDX Resource Server to serve latest data for IUDX specified resources that are available in the Database.

Design Implementation for IUDX release v2:

The implementation for retrieving latest data for IUDX resources in IUDX v2 gurantees performant retrieval of latest data from ElasticSearch Database which otherwise involved large computational overhead of sorting directly on the database (IUDX v1). This was not a desirable trait as the dataset became increasingly huge, sorting incurred huge latency in response.

The implementation consists of 3 main components-

  1. Rabbit MQ
  2. Logstash
  3. ElasticSearch

Logstash pushes the data subscribed through RabbitMQ (hereafter referred as RMQ) data broker using RabbitMQ plugin into a separate ElaticSearch (hereafter referred as ES) Index- Latest. The Latest Index contains one document per IUDX resource ID. The data is always upserted, i.e. if the resource is not present in the index, it is inserted or otherwise updated. The document-id for ES has been chosen to be the SHA-1 value of the resource-id to enable upsertion. Therefore, when the Latest API is triggered in the Resource Server, there is only one document available in the Latest Index which can be easily retrieved using the ES REST GET API. No "search" is required thus ensuring faster retrieval of data. To ensure data consistency, we used version feature in ES. The version for each individual document was chosen to be the TimeStamp at which the data was published, thereby the document is never updated if it's not recent data. This also ensures that any out-of-order data packets arriving at RMQ does not lead to inconsistent data.

Motivation for new design

It was then observed that the index size for Latest is growing abnormally because of frequent updates. Since the data from all the IUDX resources were smushed into one index, the update rate shot upto hundreds of thousands per second due to which the index size was growing abnormally. This was attributed to the underlying implementation of ES. It was a classic trade off between indexing and querying speed. Although a manual flush or reducing the refresh rate (which was already at 1 second) would have aided.

To get rid of this manual process of flushing whenever the index size was growing tremendously, we decided to move to a new caching mechanism using Redis.

We tried a few design and pro-con'ed which suited the best for our use case. Eventually, we decided to write our own ingestion pipeline.

RMQ Logstash-websocket Redis Plugin

The implementation consists of 5 main components-

  1. RabbitMQ
  2. Logstash
  3. Websocket server
  4. Python Client
  5. Redis

Drawbacks

  • The Redis input plugin for Logstash did not support any other data structures except list which is why a simple Websocket plugin posed a suitable candidate to pull the data from Logstash and ingest into Redis.
  • The system although very simple in both design and implementation introduced huge latency problems. The end-to-end ingestion duration for just one packet from RMQ to Redis Database took around 30 seconds which was not desired.This was due to the fact that there were multiple hops between individual modules [RMQ -> Logstash -> Websocket Server -> Python Client -> Redis Database].
  • It was synchronous.

This led to moving into a asynchronous event loop based system which is discussed below.

Asynchronous using Aio_pika

The implementation consists of 3 main components-

  1. RabbitMQ
  2. Redis Client written in Python using Aio_pika
  3. Redis

Asynchronous using Vertx

The implementation consists of 3 main components-

  1. RabbitMQ
  2. Redis Client written in Java using Vertx and JReJSON
  3. Redis

A detailed explanation of the implementation can be found here.

Reactive Pattern using aio_pika

The design is similar to the async Python + Aio_pika but the implemenation it follows the reactive pattern where we isolate and parallelize the IO and CPU bound computations.

Testing

Unit tests

  1. Run the tests using mvn clean test checkstyle:checkstyle pmd:pmd
  2. Reports are stored in ./target/

Integration tests

Integration tests are through Rest Assured

  1. Run the server through either docker, maven or redeployer
  2. Run the integration tests mvn test-compile failsafe:integration-test -DskipUnitTests=true -DintTestHost=localhost -DintTestPort=8080
  3. Reports are stored in ./target/

Contributing

We follow Git Merge based workflow

  1. Fork this repo
  2. Create a new feature branch in your fork. Multiple features must have a hyphen separated name, or refer to a milestone name as mentioned in Github -> Projects
  3. Commit to your fork and raise a Pull Request with upstream
  4. If you find any issues with the implementation please raise an issue on Issues in GitHub.

License

View License

latest-ingestion-pipeline's People

Contributors

abhi4578 avatar ananjaykumar2 avatar ankitmashu avatar code-akki avatar gopal-mahajan avatar isridharrao avatar kailash avatar karun-singh avatar kranthi-guribilli avatar manasakoraganji avatar pranavrd avatar sushanthakumar avatar swaminathanvasanth avatar tharak-ram1 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

latest-ingestion-pipeline's Issues

2.2. Broadcast Integration

The Broadcast Unique Attributes Information as per section 1.5 shall be consumed and updated accordingly.

2.1. Read Unique Attributes from Database

The required attribute information for understanding the unique attribute to be used to query for the latest data is now read from a configuration file. With this requirement, we need to read the unique-id and unique attribute from a database whenever required.

Database to use: PostgreSQL
Database schema: id,unique-attribute

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.