- read/listen/watch as much as you can from the annotated materials below
- watch pre-recorded lectures on Canvas
Data Science work in general must fulfill three aims to be useful for collaboration:
- replicability
- portability
- scalability
This week we'll discuss some principles - and tricks - that take your projects to that level. To prepare the discussion, you may want to skim through some of the links below to get a sense of how people have implemented DS projects and the structures they have though behind their proposed structures.
People have put thought into standardizing project creation practices in Data Science, to the point of automating the creation of projects in ways that are generic and flexible enough to encompass multiple use cases. Two very useful ones are Cookiecutter for Data Science tailored for Python, and ProjectTemplate developed for R.
While you may or may not want to use them to create your projects, they are certainly a rich source of food for thought on what needs to be considered when creating a Data Science project.
As Cookiecutter would put it: "be liberal in changing the folders around for your project, but be conservative in changing the default structure for all projects."
We will be using "the Cloud" for the majority of our collaborative work in ths course, including our live in-class workshops. It is important to understand the basic and meaning of what Cloud Computing means. This Medium Article talks about the traditional IT Infrastructure, types of cloud services and benefits of the cloud.
What is Apache Spark? - Many organizations have adapted to using Apache Spark for large scale data processing and machine learning models. Apache Spark has been proven to provide speed, efficiency and reliability in many use cases. In this course we will be using Databricks which is a managed Apache Spark Platform that allows the management of Apache Spark easy. In addition, Databricks provides a very collaborative platform for both Data Scientists and Data Engineers to build data pipelines and productionalize machine learning models very easily. Skim as much as you can from the materials below (it will pay off later in the class):
- Intro to Apache Spark: a 60-min video to provide an easy to digest introduction to Spark for all audiences
- "An Architecture for Fast and General Data Processing on Large Clusters": Mattei Zaharia's doctoral dissertation to get a great theoretical introduction to Spark
- Learning Spark, 2nd edition: a book-length introduction to Spark - from zero to Spark in 12 chapters!
- free self-paced learning courses from Databricks for university students
- Databricks notebook gallery featuring sample notebooks for a large range of use cases
- Apache Spark GitHub Repo
- Research Papers
We will cover many current best practices in the field throughout the course. But there is one which still stirs passions: Agile. Some swear by it, while others swear it's not s good idea for Data Science work. (Not surprisingly, one of us is a big believer in the Agile way, but the other is a big believer in only implementing Agile-like principles when it makes sense). The jury is still out on Agile, but precisely for that reason you should understand what it is and what makes it attractive in the field. A few recommended posts will help you get there.
Start by reading The Psychology of Agile that summarizes the benefits that people perceive from the Agile way. Read also Stop Brainstorming and Start Sprinting about what a sprint is, how it works, and why it's useful, but don't forget to read a complementary perspective in From Agile to Fragile: How to unravel your team in one sprint on how and where sprints can go wrong. An incredibly helpful reading is Don’t Make Data Scientists Do Scrum that makes a seasoned and compelling argument explaining why it may not always be the path to follow for Data Science. Wrap up with MVP Paradox And Here’s How To Fix Your MVP Before Its Too Late! on how to think about minimum viable products (MVPs).