Git Product home page Git Product logo

dataengineerzoomcamp2024week5's Introduction

Data Engineering Zoomcamp 2024 Week 5

Batch processing using Apache Spark

Lesson materials

Lesson Learned

mindmap
    id1)Week 5: Batch Processing(

1. Introduction to Batch Processing

1. Batch vs Streaming

2. Types of Batch Processing

3. Orchestrating batch jobs

4. Advantages and disadvantages of Batch jobs

2. Spark Introduction

  • What is Apache Spark?
    • Data Processing Engine for batch and streaming jobs
    • Support different languages
      • Java
      • Scala
      • Python
      • R
  • When to use Spark?
    • dataset that you cannot be handled in SQL (e.g. Machine Learning)

4. First look at Spark/ PySpark

5. Spark DataFrames

6. Spark SQL

7. Joins in Spark

8. RDDs

9. Spark Internals

10. Spark and Docker

11. Running Spark in the Cloud

12. Connecting Spark to a DWH

Homework

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the FHV 2019-10 data found here. FHV Data

Question 1:

Install Spark and PySpark

  • Install Spark
  • Run PySpark
  • Create a local spark session
  • Execute spark.version.

What's the output? 3.4.1

Note

To install PySpark follow this guide

Question 2:

FHV October 2019

Read the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons.

Repartition the Dataframe to 6 partitions and save it to parquet.

What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.

  • 1MB
  • -> 6MB
  • 25MB
  • 87MB

Question 3:

Count records

How many taxi trips were there on the 15th of October?

Consider only trips that started on the 15th of October.

  • 108,164
  • 12,856
  • 452,470
  • -> 62,610

Important

Be aware of columns order when defining schema

Question 4:

Longest trip for each day

What is the length of the longest trip in the dataset in hours?

  • -> 631,152.50 Hours
  • 243.44 Hours
  • 7.68 Hours
  • 3.32 Hours

Question 5:

User Interface

Spark’s User Interface which shows the application's dashboard runs on which local port?

  • 80
  • 443
  • -> 4040
  • 8080

Question 6:

Least frequent pickup location zone

Load the zone lookup data into a temp view in Spark
Zone Data

Using the zone lookup data and the FHV October 2019 data, what is the name of the LEAST frequent pickup location Zone?

  • East Chelsea
  • -> Jamaica Bay
  • Union Sq
  • Crown Heights North

Submitting the solutions

dataengineerzoomcamp2024week5's People

Contributors

alangan17 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.