This course follows the Machine Learning and the High Dimensional & Deep Learning. In theses courses, you have acquired knowledge in machine and deep learning algoritms and their application on various type of data. This knowledge is primordial to become a DataScientist.
As a DataScientist, you will also need to know the tool that wil allow you to perform these algorithms efficiently. The two main goals of this course are :
- Discover these different tools:
- Distributed computation with Spark.
- Cloud computing with Google Cloud.
- Container with Docker.
- Use this tools on various domain of learning with real dataset and with usefull learning librairy.
- Natural language processing with Nltk, Scikit-Learn and Gensim
- Recommendation system.
- Reinforcement Learning with Gym Open AI
You will follow introduction to these different technologies.
- R Tutorial
- Python Tutorial
- Elementary statistic tools
- Data Exploration and Clustering.
- Machine Learning
- High Dimensional & Deep Learning
The course is divided in 5 sessions.
- Session 1 - 04-11-19 Spark
- Python complement
- Introduction to Spark via API
PySpark
API.
- Session 2 - 25-11-19 Cloud computing and containerization.
- Configure and start an Google Cloud instance.
- Build Docker Image and run container.
- Session 3 - 02-12-19 NLP (Natural Language Processing)
- Cdiscount dataset : Classification of product description
- Text cleaning, Vectorization, Words Embedding, Supervised classification, RNN.
- Session 4 - 16-12-19 Reinforcment Learning
- Use open AI environment
- Policy gradient, Q-Learning.
- Session 5 Recomentdation System
- MovieLens dataset.
You will be evaluated on your capacity of acting like a Data Scientist, i.e.
- Choose an algorithm you haven't seen during course understand it.
- Make it run on an dataset to evaluate its performances.
- Make it run on the appropriate tools (SPark? Cloud? GPu?)
- Share it and make your results easily reproducible (Git - docker? , conda environment?).
Examen
- Project - (50%): A Git repository where
- All code will be available to easily reproduce your result
- Instruction will be clear
- Deadline : January 11th, 2019.
- Oral presentation - (50%):
- In-Deep explanation of the chosen algorithm.
- Comment results (with critical mind)
- Choice of the tools-infrastructure used.
- Difficulty you've met.
- Date : January 14th, 2019.