Git Product home page Git Product logo

fetch-data-engineer-intern's Introduction

Fetch Data Engineering Take Home: ETL off a SQS Queue

pii-masking script

The pii-masking script reads user login information from an SQS queue, encrypts the personal data in it and persists the data in a postgres database.

The encryption/masking is done such that the data can be decrypted and duplicate values can be identified.

Requirements and Installations

  1. docker: Installation Steps
  2. python: Installation Steps
  3. localstack pip install localstack
  4. aws-local pip install awscli
  5. pandas pip install pandas
  6. psycopg2 pip install psycopg2-binary
  7. cryptography pip install cryptography
  8. Homebrew: /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  9. Postgres: brew install postgresql@14
  10. boto3: pip install boto3

Usage

  1. Having the directory containing pii-masking.py and docker-compose.yml as the working directory in the terminal run: docker-compose up -d
  2. Run the python script using the command python pii-masking.py
  3. To bring down docker, run: docker-compose down --remove-orphans
  4. To clean docker build files, run: docker system prune -f

Next Steps

  1. The SQS queue, server, and postgres configs can be moved to a dedicated config file away from the core logic.
  2. For encryption, I have initialized a new private key every time the script is run. But, this key must be generated only once and stored somewhere safe for later decryption. We must use the same key everytime.
  3. DB insert, select, and truncate queries must ideally sit in dedicated SQL files outside of the core logic.
  4. Due to lack of time, in the application, I read the data from the queue completely, and then proceeded towards data encryption using a private key, and finally, persisted the masked data in the db. If I had more time, I would use an in-memory buffer to act as an interface between the SQS queue consumer and postgres db. The consumer and db writer should run on dedicated threads or processes to isolate/containerize them. The current implementation: reading everything, then processing, then writing is not ideal for queue based or event-driven based systems. A fault or break in either the queue consumer or db writer can break the whole application, but the damage can be contained if the cosnumer and writer were isolated.

Please read Data-Engineer-Intern - Questions.txt for more elaborate next steps towards deploying this application to production

fetch-data-engineer-intern's People

Contributors

hafeezali avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.