Git Product home page Git Product logo

data-engineering-zoomcamp's Introduction

Data Engineering Zoomcamp

Syllabus

Note: This is preliminary and may change

Week 1: Introduction & Prerequisites

Duration: 1h

Week 2: Data ingestion + data lake + exploration

  • Data ingestion: 2 step process
    • Download and unpack the data
    • Save the data to GCS
  • Data Lake (20 min)
    • What is data lake?
    • Convert this raw data to parquet, partition
    • Alternatives to gcs (S3/HDFS)
  • Exploration (20 min)
    • Taking a look at the data
    • Data fusion => Glue crawler equivalent
    • Partitioning
    • Google data studio -> Dashboard
  • Terraform code for that

Duration: 1h

Week 3 & 4: Batch processing (BigQuery, Spark and Airflow)

  • Data warehouse (BigQuery) (25 minutes)
    • What is a data warehouse solution
    • What is big query, why is so fast (5 min)
    • Partitoning and clustering (10 min)
    • Pointing to a location in google storage (5 min)
    • Putting data to big query (5 min)
    • Alternatives (Snowflake/Redshift)
  • Distributed processing (Spark) (40 + ? minutes)
    • What is Spark, spark cluster (5 mins)
    • Explaining potential of Spark (10 mins)
    • What is broadcast variables, partitioning, shuffle (10 mins)
    • Pre-joining data (10 mins)
    • use-case ?
    • What else is out there (Flink) (5 mins)
  • Orchestration tool (airflow) (30 minutes)
    • Basic: Airflow dags (10 mins)
    • Big query on airflow (10 mins)
    • Spark on airflow (10 mins)
  • Terraform code for that

Duration: 2h

Week 5: Analytics engineering

  • Basics (15 mins)
    • What is DBT?
    • ETL vs ELT
    • Data modeling
    • DBT fit of the tool in the tech stack
  • Usage (Combination of coding + theory) (1:30-1:45 mins)
    • Anatomy of a dbt model: written code vs compiled Sources
    • Materialisations: table, view, incremental, ephemeral
    • Seeds
    • Sources and ref
    • Jinja and Macros
    • Tests
    • Documentation
    • Packages
    • Deployment: local development vs production
    • DBT cloud: scheduler, sources and data catalog (Airflow)
  • Extra knowledge:
    • DBT cli (local)

Duration: 1.5-2h

Week 6: Streaming

  • Basics
    • What is Kafka
    • Internals of Kafka, broker
    • Partitoning of Kafka topic
    • Replication of Kafka topic
  • Consumer-producer
  • Streaming
    • Kafka streams
    • spark streaming-Transformation
  • Kafka connect
  • KSQLDB?
  • streaming analytics ???
  • (pretend rides are coming in a stream)
  • Alternatives (PubSub/Pulsar)

Duration: 1-1.5h

Upcoming buzzwords

  • Delta Lake/Lakehouse
    • Databricks
    • Apache iceberg
    • Apache hudi
  • Data mesh

Duration: 10 mins

Week 7, 8 & 9: Project

  • Putting everything we learned to practice

Duration: 2-3 weeks

Architecture diagram

Instructors

FAQ

  • Q: I registered, but haven't received a confirmation email. Is it normal? A: Yes, it's normal. It's not automated. But you will receive an email eventually
  • Q: At what time of the day will it happen? A: Most likely on Mondays at 17:00 CET. But everything will be recorded, so you can watch it whenever it's convenient for you
  • Q: Will there be a certificate? A: Yes, if you complete the project
  • Q: I'm 100% not sure I'll be able to attend. Can I still sign up? A: Yes, please do! You'll receive all the updates and then you can watch the course at your own pace.
  • Q: Do you plan to run a ML engineering course as well? A: Glad you asked. We do :)

data-engineering-zoomcamp's People

Contributors

alexeygrigorev avatar ankushkhanna avatar victoriapm avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.