Jobs for transforming data logged from Reddit into useful analytics.
./gradlew build
java -jar build/lib/<thejar>.jar tstData/ tmp/ouput
Upload the jar somewhere in S3 (s3JarLoc). Have an input bucket folder and an output bucket folder (s3In s3Out, can be in same bucket)
- Upload the jar somewhere into S3 (s3JarLoc)
- Create an input bucket folder (s3In)
- Create an ouput bucket folder (s3Out), can be in same bucket as above as long as folders are distinct.
- Create an EMR job and specify custom JAR step using the s3JarLoc. It should take two arguments s3In and s3Out
- Run!
Make sure that s3Out does not exist before the job is run, this will fail the job.
- Resolve the MRUnit issue, currently I have hardcoded the jar into the repo. MVN needs to fix the 404 issue when downloading MRUnit.
- Address some of the TODOs in comments.
- Add more analytics jobs.