The analytics department of a municipality in my country asked me to design and implement a data system capable of storing sensor data generated by sensors that they installed throughout the city to measure environmental metrics. The objective is to provide valuable insights to urban planners to enhance the city's environmental conditions. Additionally, the collected data will be leveraged to develop an application that promptly alerts citizens when measurements exceed recommended thresholds. While the specifics of the implementation cannot be disclosed, this overview outlines the fundamental structure of the system.
To initiate the project, I have chosen an open dataset from Kaggle as the sample data source. This dataset contains environmental sensor data, encompassing metrics such as temperature, humidity, and air quality. The structured format of this dataset mirrors the initial data gathered by the municipality's sensors. The dataset originates from custom-built sensor arrays connected to Raspberry Pi devices. Each device recorded multiple readings from various sensors, including temperature, humidity, carbon monoxide (CO), liquid petroleum gas (LPG), smoke, light, and motion. The data spans a timeframe from 07/12/2020 to 07/19/2020, comprising a total of 405,184 records. The sensor readings were transmitted using the MQTT network protocol.
The primary objective of the data system is to effectively manage and store the environmental sensor data acquired from diverse sensors dispersed across the city. The system's architecture must accommodate substantial data volumes, diverse environmental metrics, and guarantee reliability and scalability. Additionally, the system should be adaptable for migration to larger setups, including cloud-based solutions.
For this project, Apache Cassandra was selected as the preferred database solution. Apache Cassandra is a distributed NoSQL database renowned for handling extensive time-series data efficiently. It boasts scalability, fault-tolerance, and performance capabilities. Cassandra's adaptable data model obviates the need for predefined schemas, enabling seamless integration of future sensor data with varying structures.
The selection of Apache Cassandra is well-justified, given its ability to handle large distributed deployments, ensuring scalability and reliability. Traditional relational databases, such as MySQL or PostgreSQL, may not be as suitable due to challenges associated with scalability and high write loads. Furthermore, Cassandra's dynamic data model alleviates complexities related to schema management, a crucial aspect when dealing with diverse and evolving sensor data.
- Develop scripts for database schema setup and configuration.
- Craft a script to establish a connection with the Cassandra database and load sample data into relevant tables.
- Set up a Docker container housing Apache Cassandra from Docker Hub.
- Create a Dockerfile incorporating all necessary steps for automated setup, including Cassandra installation, script execution, and data loading.
- Establish a GitHub repository for storing code and associated files.
- Upload Dockerfile and pertinent files to the repository, ensuring straightforward container building and execution across different environments.
This project comprises a variety of files and scripts that collectively lay the foundation for the Environmental Sensor Data Management System. These files are essential for various stages, from setup and data population to smooth system operation. Together, they offer a comprehensive solution for efficiently managing the municipality's environmental data.
- create_tables.py: This script is responsible for dropping and creating tables. It's intended to reset the tables each time before executing the ETL scripts.
- etl.py: This script reads and processes data from the
telemetry_source_data
folder, subsequently loading it into the designated tables. - sql_queries.py: Within this file, you'll find all the SQL queries necessary for the project's database interactions. It is imported into the three aforementioned scripts to ensure consistency.
- requirements.txt: Listed within this file are all the essential dependencies and libraries required for the project.
- Dockerfile: Contained in this file are the instructions for constructing a Docker image for the project, simplifying the process of containerization and deployment.
- README.md: You are reading it!!
The ETL pipeline extracts data from files in the telemetry_source_data
directory:
It then transforms and loads the data into the four tables of the iot
keyspace. This is handled by four files using Python and SQL:
- Running
create_tables.py
creates and initializes the tables for theiot
keyspace. sql_queries.py
contains all SQL queries and is imported intocreate_tables.py
andetl.py
Take the following steps:
- Clone the repository:
git clone repo-link
- Build the docker image:
docker build . -t sensor-cassandra-image
- Create docker container:
docker run -v ./cassandra_data:/usr/src/app -it --name sensor-cassandra-container sensor-cassandra-image:latest /bin/bash
- You will be taken to the terminal of the docker container. From there you can now run the pipeline
- In the container terminal, run
python create_tables.py
to reset the tables in theiot
keyspace. - In the container terminal, run
python etl.py
to process all the datasets.