Git Product home page Git Product logo

mohammad-malik / frequent-items-kafka Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 501 KB

This repository houses an implementation of finding frequent items utilizing A-Priori and PCY Algorithms on Apache Kafka. It leverages a 15GB .json file as a sample of the 100+GB Amazon_Reviews_Metadata Dataset. This was developed as part of an assignment for the course Fundamentals of Big Data Analytics (DS2004).

License: MIT License

Python 93.94% Shell 6.06%
a-priori-algorithm amazon metadata pcy python similarity-search

frequent-items-kafka's Introduction

Amazon Metadata Streaming Data Pipeline and Itemset Mining

This repository houses an implementation of finding similar items utilising A-Priori and PCY Algorithms on Apache Kafka.

Using a 12GB .json file as a sample of the 100+GB Amazon_Reviews Dataset, it was developed as part of an assignment for the course Fundamentals of Big Data Analytics (DS2004).

The project leverages:

  1. Apache Kafka for robust real-time data streaming.
  2. (Optional) Use of Azure VMs and Blobs, providing a scalable solution for large datasets.

Repository Structure:

└── preprocessing.py            # Script for preprocessing data locally
└── sampling.py                 # Script used to randomly sample the original 100+GB into 15 GB.
├── preprocessing_for_azure.py  # Script for preprocessing and loading data to Azure Blob Storage
├── blob_to_kafka_producer.py  # Script for streaming data from Azure Blob to Kafka
├── consumer1.py                 # Kafka consumer implementing the Apriori algorithm
├── consumer2.py                 # Kafka consumer implementing the PCY algorithm
└── consumer3.py                # Kafka consumer for anomaly detection 
└── producer_for_1_2.py          # Kafka producer for Apriori and PCY consumers
└── producer_for_3.py            # Kafka producer for Anomaly detection consumer

Setup Instructions

1. Data Preparation

The first step is to download and preprocess the Amazon Metadata dataset.

  • Download the dataset from the provided Amazon link. Use EITHER of:
        └── Preprocessing_for_azure.py if using Azure,
        └── Preprocessing.py if not.

  • Upload the preprocessed data to Azure Blob Storage (set blob and connection string in the script) (If not using Azure, skip this step).
  • The original dataset's size necessitated sampling for efficient analysis. We ensured a good mix of metadata for our analysis.
  • 2. Streaming Pipeline

    Next up is setting up Kafka (and optionally Azure Blob Storage):

  • Deploy Apache Kafka. Ensure Kafka brokers are accessible.
  • Modify azure_blob_data.py with your Azure Blob Storage connection details and Kafka bootstrap servers.
  • Run blob_to_kafka_producer.py to stream data from Azure Blob Storage to Kafka.
  • 3. Consumer Applications

    Then deploy the consumer scripts:

  • consumer1.py: Consumes data for frequent itemset mining using Apriori. Adjust Kafka topic and MongoDB details.
  • consumer2.py: Similar setup as Apriori, but implements the PCY algorithm.
  • consumer3.py: Implements anomaly detection. Configure for the relevant Kafka topic.

    Technologies and Challenges:

    Used Technologies:

  • Azure Blob Storage: For storing and managing large-scale dataset preprocessing.
  • Apache Kafka: Utilized for robust real-time data streaming.
  • Python: Scripting language for data processing and mining algorithms.
  • MongoDB (optional): Recommended for storing consumer application outputs for persistent analysis

    Streaming Challenges and Solutions:

    Sliding Window Approach Approximation Techniques

    Why This Implementation with Kafka and Sliding Window Approach?

    This project leverages Apache Kafka and a sliding window approach for real-time data processing due to several key advantages:

    Scalability of Kafka:

    Kafka's distributed architecture allows for horizontal scaling by adding more nodes to the cluster. This ensures the system can handle ever-increasing data volumes in e-commerce scenarios without performance degradation.

    Real-time Processing with Sliding Window:

    Traditional batch processing wouldn't be suitable for real-time analytics. The sliding window approach, implemented within Kafka consumers, enables processing data chunks (windows) as they arrive in the stream. This provides near real-time insights without waiting for the entire dataset.

    Low Latency with Kafka:

    Kafka's high throughput and low latency are crucial for e-commerce applications. With minimal delays in data processing, businesses can gain quicker insights into customer behavior and product trends, allowing for faster decision-making.


    While Azure Blob Storage provides excellent cloud storage for the preprocessed data, and Azure VMs allow for easier clustering, it's Kafka that facilitates the real-time processing aspects crucial for this assignment's goals. The combination of Kafka's streaming capabilities and the sliding window approach within consumers unlocks the power of real-time analytics for e-commerce data.

    Team:

  • frequent-items-kafka's People

    Contributors

    mohammad-malik avatar manal-aamir avatar

    Watchers

     avatar

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. 📊📈🎉

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google ❤️ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.