Just (Ad)D : Adding Advertisements on the fly

A Distributed Streaming Data Pipeline

Introduction

Just (Ad)D is a distributed streaming data pipeline for analyzing an ad performance in real-time to potentially add them onto websites that have high user traffic.

ProblemStatement

Advertisements are all about gaining user attention, which includes targeting the right consumers, at the right times, through the right channels. Therefore, optimizing user attention to improving conversion rates is crucial for advertisers.

DataPipeline

For ease of deployment and to avoid gruesome network configurations with individual EC2 instances, this pipeline design utilizes managed services like Confluent cloud for Kafka cluster and AWS EMR for Spark cluster.

Workflow

Sample dataset is stored in an EC2 instance.
In EC2, simulated messages are produced to page views and click event topics in Confluent Cloud which provides Kafka cluster as a service.
The messages are consumed by Spark to process the stream of messages.
Two main calculations:
Counting the number of clicks/views for each advertisement within the event-time window
Top 3 websites with user surge from which platform(Mobile/Tablet/Desktop) and source (Internal/Social/Search)
Windowing and watermark usages are demonstrated to handle late or out of order data.
Each stream processed data is stored in MySQL database with timestamp.
The continuous update on to the dB is queried and visualized on to live dashboard built using Plotly Dash.

DataSource

A subset of Outbrain Click Prediction Kaggle dataset

RepoStructure

Just-Ad-D/
├── dash_frontend
│   └── app.py
├── kafka_clicks.sh
├── kafka_ingestion
│   ├── ccloud_lib.py
│   ├── producer.py
│   ├── pvproducer.py
├── kafka_pv.sh
├── LICENSE
├── README.md
├── sparkjob_clicks.sh
├── sparkjob_pv.sh
├── spark_processing
│   ├── kafspar2.py
│   └── pvkafspar2.py

macjei / just-ad-d Goto Github PK

just-ad-d's Introduction

Just (Ad)D : Adding Advertisements on the fly

A Distributed Streaming Data Pipeline

Table of Contents

Introduction

ProblemStatement

DataPipeline

Workflow

DataSource

RepoStructure

just-ad-d's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent