In this project, a service that delivers assessments from an education tech firm was created. The data outcome is ready for further queries work according to customer's requirements.
Main tasks of this project
- Publish and consume messages with Kafka
- Use Spark to transform the messages.
- Use Spark to transform the messages so that can be landed in HDFS
The original data was acquired by
curl -L -o assessment-attempts-20180128-121051-nested.json https://goo.gl/ME6hjp`
The data is contained in the file assessment-attempts-20180128-121051-nested.json
in the Data folder, which is inclued in this repository.
Questions are answered in this project:
- How many assesstments are in the dataset?
- What's the name of your Kafka topic? How did you come up with that name?
- How many people took Learning Git?
- What is the least common course taken? And the most common?
- Add any query(ies) you think will help the data science team
-
Docker Images:
- cloudera
- kafka
- mids
- spark
- zookeeper_1
(docker configuration file docker-compose.yml is included in the Code foler)
-
Google cloud virtual machine
-
Jupyter Notebook
-
Publish and consume messages with Kafka in topic assessment
-
Use Spark to transform the messages(Spark TempTable: select_assessment, forced schema: final_schema) and can be landed in HDFS
-
Question solved
- 3242 distinct assesstments in the dataset
- 390 people took Learning Git
- The least common courses: Learning to Visualize Data with D3.js, Native Web Apps for Android, Nulls, Three-valued Logic and Missing Information and The Closed World Assumption
- The most common course: Learning Git