Git Product home page Git Product logo

big-data---instagram-by-location's Introduction

Big-Data---Instagram-by-location

This repository contains the coded artifacts for the final project for COMP-548DL - Big Data Management and Processing of the University of Nicosia.

The project's purpose is to analyse data from Instagram posts and provide insights on the popularity of different locations on different time periods. The dataset contains information on 42M Posts, 1.2M Locations and 4.5M Profiles from Instagram from 2010 to 2019.

A document store database was selected (Firestore), based on the fact that Instagram posts would have more read operations than write operations (a post is viewed multiple times but is created only once). Then if someone wants to view a post, all the relevant information would be gathered in a compact document format.

If the data are to be processed a lot of times, or on a regular basis, a better idea would be to join the collections only once at the beginning and add some more information on a location or a user on each post's document. However, joining these 3 collections with millions of documents each would be very intensive and costly. So, after processing the posts, I only then joined the results with some fields from the other collections.

The procedured I followed is divided into the 3 notebooks of the repository as follows:

  1. Load_data

The first notebook is responsible for loading the data, and creating the files needed for the project. The notebook explains how the data was loaded to Firestore and how to create the files needed for the processing part and the streaming demo.

  1. Historical_Data_Processing

This notebook shows the processing that was made on Historical Data that we have stored. Since the volume of the data was very high, this notebook was ran on a Dataproc Cluster that consisted of 1 master node and 5 worker nodes (E2 -4CPU). (Google Cloud Platform) Processing was made using PySpark RDDs through some MapReduce operations and the SparkSQL DataFrame API for the sake of comparison.

  1. Streaming_Data_Demo

The third notebook contains the streaming data simulation. The last 30 days of the dataset were taken, split into daily batches and fed into a stream processing pipeline using SparkStreaming. Some simple aggregation statistics were calculated after each trigger, and a Results table was constantly being updated.

To visualize the results I prepared a report with some Map Charts on Google Looker at:
https://datastudio.google.com/reporting/e647d5ac-e2e2-437f-ac48-cb63d82fe382/page/p_zbw36zck2c

A brief overview of the project can be found on the below video:
https://youtu.be/BcUGLOHGWo0

big-data---instagram-by-location's People

Contributors

andreasrousos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.