Just (Ad)D is a distributed streaming data pipeline for analyzing an ad performance in real-time to potentially add them onto websites that have high user traffic.
Advertisements are all about gaining user attention, which includes targeting the right consumers, at the right times, through the right channels. Therefore, optimizing user attention to improving conversion rates is crucial for advertisers.
For ease of deployment and to avoid gruesome network configurations with individual EC2 instances, this pipeline design utilizes managed services like Confluent cloud for Kafka cluster and AWS EMR for Spark cluster.
- Sample dataset is stored in an EC2 instance.
- In EC2, simulated messages are produced to page views and click event topics in Confluent Cloud which provides Kafka cluster as a service.
- The messages are consumed by Spark to process the stream of messages.
- Two main calculations:
Counting the number of clicks/views for each advertisement within the event-time window
Top 3 websites with user surge from which platform(Mobile/Tablet/Desktop) and source (Internal/Social/Search) - Windowing and watermark usages are demonstrated to handle late or out of order data.
- Each stream processed data is stored in MySQL database with timestamp.
- The continuous update on to the dB is queried and visualized on to live dashboard built using Plotly Dash.
A subset of Outbrain Click Prediction Kaggle dataset
Just-Ad-D/
├── dash_frontend
│ └── app.py
├── kafka_clicks.sh
├── kafka_ingestion
│ ├── ccloud_lib.py
│ ├── producer.py
│ ├── pvproducer.py
├── kafka_pv.sh
├── LICENSE
├── README.md
├── sparkjob_clicks.sh
├── sparkjob_pv.sh
├── spark_processing
│ ├── kafspar2.py
│ └── pvkafspar2.py