Git Product home page Git Product logo

testrepo's Introduction

WEEK 3 - WORKSHOP: Setting up Projects: Data Science & Data Engineering Perspectives

In preparation for this week:

  • read/listen/watch as much as you can from the annotated materials below
  • watch pre-recorded lectures on Canvas

Data Science work in general must fulfill three aims to be useful for collaboration:

  • replicability
  • portability
  • scalability

This week we'll discuss some principles - and tricks - that take your projects to that level. To prepare the discussion, you may want to skim through some of the links below to get a sense of how people have implemented DS projects and the structures they have though behind their proposed structures.

Creating DS project structures

People have put thought into standardizing project creation practices in Data Science, to the point of automating the creation of projects in ways that are generic and flexible enough to encompass multiple use cases. Two very useful ones are Cookiecutter for Data Science tailored for Python, and ProjectTemplate developed for R.

While you may or may not want to use them to create your projects, they are certainly a rich source of food for thought on what needs to be considered when creating a Data Science project.

As Cookiecutter would put it: "be liberal in changing the folders around for your project, but be conservative in changing the default structure for all projects."

Cloud computing for our live workshops

We will be using "the Cloud" for the majority of our collaborative work in ths course, including our live in-class workshops. It is important to understand the basic and meaning of what Cloud Computing means. This Medium Article talks about the traditional IT Infrastructure, types of cloud services and benefits of the cloud.

An Introduction to Apache Spark

What is Apache Spark? - Many organizations have adapted to using Apache Spark for large scale data processing and machine learning models. Apache Spark has been proven to provide speed, efficiency and reliability in many use cases. In this course we will be using Databricks which is a managed Apache Spark Platform that allows the management of Apache Spark easy. In addition, Databricks provides a very collaborative platform for both Data Scientists and Data Engineers to build data pipelines and productionalize machine learning models very easily. Skim as much as you can from the materials below (it will pay off later in the class):

Current Best Practices in Data Science

We will cover many current best practices in the field throughout the course. But there is one which still stirs passions: Agile. Some swear by it, while others swear it's not s good idea for Data Science work. (Not surprisingly, one of us is a big believer in the Agile way, but the other is a big believer in only implementing Agile-like principles when it makes sense). The jury is still out on Agile, but precisely for that reason you should understand what it is and what makes it attractive in the field. A few recommended posts will help you get there.

Start by reading The Psychology of Agile that summarizes the benefits that people perceive from the Agile way. Read also Stop Brainstorming and Start Sprinting about what a sprint is, how it works, and why it's useful, but don't forget to read a complementary perspective in From Agile to Fragile: How to unravel your team in one sprint on how and where sprints can go wrong. An incredibly helpful reading is Don’t Make Data Scientists Do Scrum that makes a seasoned and compelling argument explaining why it may not always be the path to follow for Data Science. Wrap up with MVP Paradox And Here’s How To Fix Your MVP Before Its Too Late! on how to think about minimum viable products (MVPs).

testrepo's People

Contributors

shreyansko avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.