Tagup data science challenge

Getting started

You will need the following programs available:

git
python3
the AWS client
pipenv

Once all are properly installed and configured, you are ready to begin. First, fork this repository. Then run make jupyter; this will download the dataset, install dependencies, and start a jupyter server in the current directory. It should also open a browser tab to the jupyter interface. Open the data science challenge notebook, and follow the instructions it contains. Those instructions are repeated here for convenience.

The data science challenge

ExampleCo, Inc has a problem: maintenance on their widgets is expensive. They have contracted with Tagup to help them implement predictive maintenance. They want us to provide a model that will allow them to prioritize maintenance for those units most likely to fail, and in particular to gain some warning---even just a few hours!---before a unit does fail.

They collect two kinds of data for each unit. First, they have a remote monitoring system for the motors in each unit, which collects information about the motor (rotation speed, voltage, current) as well as two temperature probes (one on the motor and one at the inlet). Unfortunately, this system is antiquated and prone to communication errors, which manifest as nonsense measurements. Second, they have a rule-based alarming system, which can emit either warnings or errors; the system is known to be noisy, but it's the best they've got.

They have given us just over 100MB of historical remote monitoring data from twenty of their units that failed in the field. The shortest-lived units failed after a few days; the longest-lived units failed after several years. Typical lifetimes are on the order of a year. This data is available in .csv files under data/train in this repository. In addition, they have provided us with operating data from their thirty active units for the past month; this data is available under data/test in this repository.

You have two main objectives. First, tell us as much as you can about the process that generated the data. Does it show meaningful clustering? Do the observations appear independent? How accurately can we forecast future observations, and how long a window do we need to make an accurate forecast? Feel free to propose multiple models, but be sure to discuss the ways each is useful and the ways each is not useful. Second, predict which of the thirty active units are most likely to fail. The data from these units are in data/test. Be sure to quantify these predictions, and especially your certainty.

A few notes to help:

A good place to start is by addressing the noise due to comm errors.
There is a signal in the data that you can identify and exploit to predict failure.
If you can't find the signal in the noise, don't despair! We're much more interested in what you try and how you try it than in how successful you are at helping a fictional company with their fictional problems.
Feel free to use any libraries you like, or even other programming languages. Your final results should be presented in this notebook, however.
There are no constraints on the models or algorithms you can bring to bear. Some ideas include: unsupervised clustering algorithms such as k-means; hidden Markov models; forecasting models like ARMA; neural networks; survival models built using features extracted from the data; etc.
Don't feel compelled to use all the data if you're not sure how. Feel free to focus on data from a single unit if that makes it easier to get started.
Be sure to clearly articulate what you did, why you did it, and how the results should be interpreted. In particular you should be aware of the limitations of whatever approach or approaches you take.
Don't hesitate to reach out with any questions.

edwardclem / data-science-challenge Goto Github PK

data-science-challenge's Introduction

Tagup data science challenge

Getting started

The data science challenge

data-science-challenge's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent