Git Product home page Git Product logo

open-street-works-data-pipeline's Introduction

Open Street Works Data Pipeline

Progress: █████████████████████████████████████████████████░░░░░ (90%)

Street Manager Monthly Permit Pipeline example: permit-pipeline-data-flow

Important

Word on the Street using Evidence can now be foud here

Currently working on

  1. Integrating OS' Open USRNs data (complete and now testing)

  2. Integrating Geoplace's SWA Code List.

  3. Speeding up SRWR processing and improving its integration into the project.

Quickstart Guide:

This repository contains an efficient data pipeline for processing:

  1. DfT's Street Manager archived permit data.

  2. Scottish Road Works Register (SRWR) archived permit data.

  3. OS' Open USRN data

The goal is to create a flexible, easily deployable pipeline to produce insights from street work data

Open Street Works Data Pipeline in 3 points:

Note

The aim of this project is simple...

Reduce the time it takes to deliver value from open street works data.

  1. It's fast
  • Process an entire month of Street Manager archived permit data ready for analysis in 5 minutes.
  • Process an entire year of Street Manager archived permit data ready for anylsis in around 1 hour.
  • Process all Street Manager archived pemit data from 2020 to 2024 in the morning and be ready to analyse the data in the afternoon.
  • The pipeline utilises batch processing so no need to download, unzip, and deal with saving files to disk - everything is kept in memory.
  1. It's not fussy
  • Run it where you want!
  • Run it locally with Docker or a Python Venv if you want.
  • Run it on a Google Cloud Function/AWS Lambda Function (with some caveats).
  • Run it as Fargate Task on AWS or a Google Compute Engine.
  1. It's flexible
  • The project is modular so you can customise it to fit your own needs.
  • Don't want to use AWS Secrets Manager for environment variables? Fine! Use another provider or a simple .env file (recommended for local dev only).
  • Don't want to use MotherDuck as your data warehouse? Fine! Add in a function so the end destination is Google Big Query instead.
  • Only want to focus on Street Manager data? Fine! Launch the entry point that doesn't process SRWR data.
  • You can integrate other tools from the Modern Data Stack such as DLT, DBT, or orchestrators like Airflow and Mage if you want more functionality.
  • You can run several instances of the project for different purposes.

Why use this Project?

Both DfT's Street Manager and Scotland's SRWR are the authoritative sources of street work permit data for England and Scotland.

They make available large quantities of archived permit data every month and have done so for several years.

This equates to a lot of data and processing it can be slow and painful if you're not careful.

This project can help you:

  • Maintain a consistent and structured way to develop, test, and deploy street work permit data pipelines.
  • Automate your development and deployment so you can focus on analysis and delivering value from the data.
  • Utilise the power of Cloud Compute to process data faster.
  • Utilise elements of the Modern Data Stack to allow for easy Dashboarding.

Please follow the steps below to clone, set up, and run the project.

Important

This is only meant as a quickstart guide. It's not perfect and I'll try to make it more detailed as time goes on :)

You will need to complete a few extra steps before being able to fully use this pipeline so please read the pre-requisites.

Local Deployment

0. Pre-Requisites (if you want to follow the Street Manager Monthly Permit Pipeline example)

  1. You'll need Python installed locally on your system - I use Python 3.11.

  2. You'll need a MotherDuck account.

  3. You'll need an AWS service account.

  4. You'll need both the AWS CLI and Terraform correctly configured on your local system.

  5. You'll need to ensure that you have the correct environment variables.

    If you follow my set up you'll need...

    • AWS Secret Manager.
    • A MotherDuck connection token.
    • A MotherDuck database name.

    If you want to use MotherDuck as your end destination then you'll need to make sure that you have env variables set up that point to the correct database and schemas.

1. Clone the Repository

git clone https://github.com/CHRISCARLON/Open-Street-Works-Data-Pipeline.git

cd Open-Street-Works-Data-Pipeline

This will create a local copy of the repository on your machine and take you into it.

2. Create a Virtual Environment

python -m venv venv
source venv/bin/activate

3. Install Dependencies

pip install poetry
poetry install

Poetry will read the pyproject.toml file and install the required packages in your newly created virtual environment.

4. Set Up AWS Secrets Manager

Create an AWS account if you don't have one already.

If you're not comfortable using the AWS CLI then navigate to the AWS Secrets Manager console in your browser and create a new secret to store your environment variables.

At runtime, call the below function in the code to retrieve your secrets using the AWS SDK (boto3).

Please note that there are charges when using AWS Secrets Manager - but it's peanuts.

Screenshot 2024-05-06 at 20 34 00

Alternatively, you can use a different method to manage environment variables, such as a .env file or another secrets management tool.

5. Set Up MotherDuck

Sign up for a MotherDuck account at MotherDuck.

Obtain the necessary credentials and connection details for your MotherDuck account - a MotherDuck Token, for example.

You can use the default my_db that is provided with your new MotherDuck account, but I recommend that you create a specific database for your permit data.

Create database schemas for each year - this is where you'll load your data tables for each month of the relevant year.

For example...

Screenshot 2024-05-06 at 20 28 45

6. Run the Pipeline

Note

For the most recent permit data use monthly_permit_main.py and for bulk historic data use historic_permit_main.py.

For this example use the most recent permit data.

Important

Make sure that you have configured everything before running.

I'd recommend going through the monthly_permit_main.py file and become comfortable with it first.

Ensure that your MotherDuck token is accessible at runtime and that you have the correct schemas set up. For local development, you could EXPORT the token via the terminal.

Run the following commands to launch the pipeline and process your data.

python src/monthly_permit_main.py

or if you want to use Docker

docker-compose up 

This docker-compose.yml is already set up and so is the Dockerfile.

This will execute the pipeline, process the Street Manager permit data, and then load it into MotherDuck for further processing and/or analytics.

7. Analyse the Street Manager permit data and/or perform further transformations

This project leverages the power of DBT to make analysis quick, easy, and scalable.

If you run this in a Python venv then you will need to run the run_dbt_jobs.sh seperately to run the DBT models. If don't run the DBT models you will only have the raw data tables in MotherDuck.

There are 10+ DBT models that automate the creation of different aggregated analysis tables - each one is focused on a different type of insight.

To run the dbt shell script open up your terminal cd into the DBT folder and run:

chmod +x run_dbt_jobs.sh

and then

./run_dbt_jobs.sh

If you run this in Docker then the DBT models automatically run.

You can use MotherDuck's UI to run further queries on the data if you'd like to!

MotherDuck connects into many modern BI tools such as HEX and Preset.

Check out the videos below from MotherDuck for some cool tutorials on how to build dashboards with data in MotherDuck.

  1. https://www.youtube.com/watch?v=F9yHuAO50PQ

  2. https://www.youtube.com/watch?v=gemksL8YvOQ

Cloud Deployment

Note

I'm still writing this up!

It will contain instructions on:

  • Deploying a Docker Container to an AWS ECR (making use the of the handy Makefile!)

  • Using Terraform to configure and deploy an AWS Fargate Task on AWS ECS.

open-street-works-data-pipeline's People

Contributors

chriscarlon avatar

Stargazers

 avatar Chris Williams avatar

Watchers

Chris Williams avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.