https://www.edx.org/course/big-data-analysis-spark-uc-berkeleyx-cs110x
This project is course that I took in the EDX platform (The course was archived)
- Learning objectives
- How to use Apache Spark to perform data analysis
- First steps with machine learn library MLlib
- How to use parallel programming and distributed file systems to explore and analyze massive data sets
- Use Spark to analyze log mining, textual entity recognition and collaborative filtering techniques to real-world data questions
- Start learn about recommendation system (movie recommendation)
- Use emacs, tramp and sunrise to mage and edit remote file
- Get familiar with vagrant (Manager virtual development environments)
See mooc-setup-master/README.md
Spark is a general purpose cluster computing framework that provides efficient in-memory computations for large data sets by distributing computation across multiple computers.
- Workflow
- Create or load a RDD (Resilient Distributed Data)
- Invoke operation on RDD by passing function closures to each
element
- When Spark runs a closure on a worker, any variables used in the closure are copied to that node, but are maintained within the local scope of that closure.
- Shared variables: Broadcast variables (read only) is distributed to all cluster
- Use the result RDD with actions: count, collect (the result is not big) and save
- Tips
- Try to think as much as possible in term of Key value pairs
- Do not copy all elements of large RDD to the driver (Avoid
collect() in big data) because of the memory limit
- Use take() or takeSample()
- Or use filter
- Avoid GroupByKey() when is possible. GrouByKey shuffle more data
than ReduceByKey. Think in the traffic and the cost to move data
in massive data sets
- Instead of GrouByKey, prefer to use: combineByKey() or foldByKey()
To see the labs exercise, see Iptyhon Notebooks labs
- text analytics exercise: mooc-setup-master/lab3_text_analysis_and_entity_resolution_student.ipynb
- log analysis: mooc-setup-master/lab2_apache_log_student.ipynb
- MLlib and movie recommendations: mooc-setup-master/lab4_machine_learning_student.ipynb
Vagrant is software to help you to manage virtual environments. It becomes easy to configure, reproducible and portable work environments.
Emacs is powerful editor that provides modes to manager vagrants virtual machines, access files and folder remotely transparently. So it is a piece of cake to develop and work with vagrant. For this project we worked with 2 modes: vagrant-tramp and sunrise (see also: pragmatic_emacs)
The snapshot bellow shows a print screen of the sunrise file manager while I am managing remote files (vagrant virtual machine) on the left and local files on the right
cd mooc-setup-master
vagrant up --provider=virtualbox