We will be exploring Polars
, DuckDB
and RAPIDS
. These tools represent somewhat of a shift from horizontal scaling to vertical scaling i.e. relying on the improvements in the underlying compute and also making better use of that compute.
Note: remember you need to be on the Saxanet wifi if you are on campus.
In this lab we will do some analysis the famous NYC TLC dataset using Python Polars.
-
We will start by doing some analysis using Pandas and then compare the time taken to do the same analysis using Polars.
-
We will follow it up with an exercise to find anomalies in the dataset using Polars.
In this lab we will do some analysis the famous NYC TLC dataset using Python Polars.
-
We will start by doing some analysis using Pandas and then compare the time taken to do the same analysis using DuckdB.
-
We will follow it up with an exercise to find the most frequent pickup and dropoff locations using DuckDB.
Watch the video for setting up an Amazon SageMaker Notebook (link) and create a SageMaker Notebook with an ml.g4dn.xlarge
instance type and 100 GB
disk space. Open Jupyter Lab and then a new terminal and create a new conda environment using the following command conda create--solver=libmamba -n rapids-24.02 -c rapidsai -c conda-forge -c nvidia rapids=24.02 python=3.9 cuda-version=12.0 dask-sql jupyterlab
. Creating this conda environment usually takes 15 to 20 minutes. Now run the RAPIDS
notebook. This activity is optional, to be done by interested students on their own time, there is nothing to submit for this activity.
Make sure you commit and push your repository to GitHub!
The files to be committed and pushed to the repository for this lab are:
polars.ipynb
ducdkb.ipynb
polars_12month_count.csv
polars_12month_daily_averages.csv
polars_12months_fareamount_anomalies.csv
duckdb_nyc_frequent_pickup_dropoff_pairs.csv
- Submit a final commit message called "final-submission" to your repo. This is critical so that instructional team can evaluate your work. Do not change your GitHub repo after submitting the "final-submission" commit message.