Environmental Sensor Data Management System

Introduction

The analytics department of a municipality in my country asked me to design and implement a data system capable of storing sensor data generated by sensors that they installed throughout the city to measure environmental metrics. The objective is to provide valuable insights to urban planners to enhance the city's environmental conditions. Additionally, the collected data will be leveraged to develop an application that promptly alerts citizens when measurements exceed recommended thresholds. While the specifics of the implementation cannot be disclosed, this overview outlines the fundamental structure of the system.

Project Description

1. Sample Data Source

To initiate the project, I have chosen an open dataset from Kaggle as the sample data source. This dataset contains environmental sensor data, encompassing metrics such as temperature, humidity, and air quality. The structured format of this dataset mirrors the initial data gathered by the municipality's sensors. The dataset originates from custom-built sensor arrays connected to Raspberry Pi devices. Each device recorded multiple readings from various sensors, including temperature, humidity, carbon monoxide (CO), liquid petroleum gas (LPG), smoke, light, and motion. The data spans a timeframe from 07/12/2020 to 07/19/2020, comprising a total of 405,184 records. The sensor readings were transmitted using the MQTT network protocol.

2. Overall Goal

The primary objective of the data system is to effectively manage and store the environmental sensor data acquired from diverse sensors dispersed across the city. The system's architecture must accommodate substantial data volumes, diverse environmental metrics, and guarantee reliability and scalability. Additionally, the system should be adaptable for migration to larger setups, including cloud-based solutions.

3. Database Solution

1.1.1. Conception Phase

For this project, Apache Cassandra was selected as the preferred database solution. Apache Cassandra is a distributed NoSQL database renowned for handling extensive time-series data efficiently. It boasts scalability, fault-tolerance, and performance capabilities. Cassandra's adaptable data model obviates the need for predefined schemas, enabling seamless integration of future sensor data with varying structures.

4. Justification and Alternatives

The selection of Apache Cassandra is well-justified, given its ability to handle large distributed deployments, ensuring scalability and reliability. Traditional relational databases, such as MySQL or PostgreSQL, may not be as suitable due to challenges associated with scalability and high write loads. Furthermore, Cassandra's dynamic data model alleviates complexities related to schema management, a crucial aspect when dealing with diverse and evolving sensor data.

5. Implementation Plan

Develop scripts for database schema setup and configuration.
Craft a script to establish a connection with the Cassandra database and load sample data into relevant tables.
Set up a Docker container housing Apache Cassandra from Docker Hub.
Create a Dockerfile incorporating all necessary steps for automated setup, including Cassandra installation, script execution, and data loading.
Establish a GitHub repository for storing code and associated files.
Upload Dockerfile and pertinent files to the repository, ensuring straightforward container building and execution across different environments.

Files

This project comprises a variety of files and scripts that collectively lay the foundation for the Environmental Sensor Data Management System. These files are essential for various stages, from setup and data population to smooth system operation. Together, they offer a comprehensive solution for efficiently managing the municipality's environmental data.

create_tables.py: This script is responsible for dropping and creating tables. It's intended to reset the tables each time before executing the ETL scripts.
etl.py: This script reads and processes data from the telemetry_source_data folder, subsequently loading it into the designated tables.
sql_queries.py: Within this file, you'll find all the SQL queries necessary for the project's database interactions. It is imported into the three aforementioned scripts to ensure consistency.
requirements.txt: Listed within this file are all the essential dependencies and libraries required for the project.
Dockerfile: Contained in this file are the instructions for constructing a Docker image for the project, simplifying the process of containerization and deployment.
README.md: You are reading it!!

Extracting and Transforming the Data

The ETL pipeline extracts data from files in the telemetry_source_data directory:

It then transforms and loads the data into the four tables of the iot keyspace. This is handled by four files using Python and SQL:

Running create_tables.py creates and initializes the tables for the iot keyspace.
sql_queries.py contains all SQL queries and is imported into create_tables.py and etl.py

Want to use the system?

Take the following steps:

Clone the repository: git clone repo-link
Build the docker image: docker build . -t sensor-cassandra-image
Create docker container: docker run -v ./cassandra_data:/usr/src/app -it --name sensor-cassandra-container sensor-cassandra-image:latest /bin/bash
You will be taken to the terminal of the docker container. From there you can now run the pipeline

Running the pipeline is as follows:

In the container terminal, run python create_tables.py to reset the tables in the iot keyspace.
In the container terminal, run python etl.py to process all the datasets.

edbali / telemetry_data_system Goto Github PK

telemetry_data_system's Introduction

Environmental Sensor Data Management System

Introduction

Project Description

1. Sample Data Source

2. Overall Goal

3. Database Solution

1.1.1. Conception Phase

4. Justification and Alternatives

5. Implementation Plan

Files

Extracting and Transforming the Data

Want to use the system?

Running the pipeline is as follows:

telemetry_data_system's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent