Simple Spark project built in Scala
https://circleci.com/gh/necosta/spark-x
With source (data for January, which will be around 100+MB deflated): https://transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
- Create the dataframe with joins with master (optimization using broadcasts)
- Based on the dataset find the Airports with the least delay (sorting and selecting the top)
- Multiple groupings like which Airline has most flights to New York (uses reduce and combine operators)
- Secondary sorts like which airlines arrive the worst on which airport and by what delay
- Custom partitions using airline Id (in combination with - d)
- Any other interesting insights
- Install SBT
Available here: https://transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
Select fields:
- Time Period > Flight Date
- Airline > AirlineID (L_AIRLINE_ID lookup)
- Airline > FlightNum
- Origin > OriginAirportID (L_AIRPORT_ID lookup)
- Destination > DestAirportID (L_AIRPORT_ID lookup)
- Destination > DestCityName
- Departure Perf > DepDelay
- Arrival Perf > ArrDelay
- Cancellations and Diversions > Cancelled
ToDo: Automate import of base table from webpage
- Build:
sbt compile
- Test:
sbt test
- Package:
sbt package
- Build image:
docker build --build-arg VERSION=x.y.z -t sparkx .
Note: Get version from version.sbt file
- Run image:
docker run sparkx