Progetto svolto in collaborazione con Martina Trigilia, Francesco Santucciu e Michele Andreucci
Distributed Data Analysis and Mining - Spark (Hadoop)
Analysis of the dataset Australia, Rain Tomorrow.
Tasks:
- Data Understanding
- Data Preparation
- Classification and Clustering
- Regression
About the course: "this course aims at teaching the basic theoretical concepts behind the MapReduce distributed computing paradigm, and Hadoop in particular, and at building expertise in the practical usage of high-performance computing tools for data engineering, analysis and mining. In particular, the students will learn how classical data mining algorithms can be applied to Big Data using Hadoop (Spark). Real (and open source) datasets will be used to present examples and to let the students build their own projects".