Git Product home page Git Product logo

ga_capstone's Introduction

Determining Airline Prices

By: Chirstopher Kuzemka : Github

Problem Statement

Aviation is one of the largest industries dominating our global market today. Commercial aviation has made it possible for people to connect with each other in ways that may have been unimaginable over a century ago. However, a lot of thought must be put into the FAA standards and routes that modern planes must make today to make such connections possible.

Consider the case example where a startup airliner, known as "Kruze" (pronounced like "Cruise"), wants to establish itself as a top competitor against existing airliners today. A part of this startup process focuses on understanding the costs that will come into play when managing flights. Our job as data scientists today is to help Kruze determine the minimum threshold cost the airliner must charge their passengers in order to break even with a profit. To do this, we are going to analyze seven different locations where Kruze would like to establish themselves and examine existing flight route data as well as existing flight ticket prices (as a prediction) to help us create a supervised learning model.

To start, we will approach the project with the intention of expressing a minimum proof of concept. With such introduction, we will make some limitations to our study and decrease the potential for scope increase by:

  • only analyzing the seven airports as destination hubs for different flights:
    • New York John F. Kennedy Airport (KJFK)
    • Chicago O'Hare International Airport (KORD)
    • Los Angeles International Airport (KLAX)
    • Houston George Bush Intercontinental Airport (KIAH)
    • Miami International Airport (KMIA)
    • Hartsfield-Jackson Atlanta International Airport (KATL)
    • Portland International Airport (KPDX)
  • assuming external costs from the study (including maintenance/crew salary) to be negligible
  • using price data from future flights as opposed to previous flights as previous flight pricing is not readily available

All current assumptions labeled are set to allow us to achieve (or attempt to achieve) our goal within a certain time frame, as Kruze is requiring an answer from us quickly! With this in mind, we will consider discussing how such assumptions can contribute to any error throughout our study, as well as remind ourselves that integrating negated features for future work may actually be very beneficial to us in achieving a stronger prediction.

As we are working with what is considered to be a continuous variable, we will analyze common price trends utilizing a supervised regression model, such as Linear Regression, KNNRegression, Bagging Regression, Decision Tree Regression, and Random Forest Regression. We will ultimately be using the Mean Absolute Error against our predictions to help us gauge how well our selected model predicts the price and discuss what issues may be observed from the limitations of this study. The Mean Absolute Error seems the most appropriate to analyze pricing values as the error is in the scale of our predictions and is much clearer for anyone to understand the general error of the model in absolute terms.

Executive Summary

A study was conducted on flight data and flight pricing data to help the startup airliner, known as Kruze, determine ticket price thresholds for passengers. The data was gathered through an API, produced by, FlightAware.com known as "FlightXML2." FlightXML2 is a popular API utilized by various companies affiliated with the airline and travel business to observe different types of flight data. With this API, a search was conducted on the seven destination airports explained in the problem statement above where up to 15 flights were searched on a real-time basis on a given day in May. These 15 different flights (origin-destination combinations) per airport were then searched through the API again to determine their May 2020 schedule on an 8 hour frequency. The search ultimately costed the data scientist approximately $120. The data returned were different departure and arrival times for every flight combination discovered within the month of May, including airline data, aircraft type data, technical details, and more.

To gather quotes on flights, the Skyscanner API was utilized. It was impossible to determine the pricing data of past flights, so future quotes were substituted for the flight pricing -- this meant past flight information was utilizing future pricing. The Skyscanner search analyzed the best monthly quote for every origin-destination combination from June 2020 to December 2021. Various features including actual city names, carrier id's, and country information was also gathered from this quotes search. This dataframe containing quotes was combined with the flight scheduling dataframe to create a large dataframe directly meant for modeling. Through some bootstrapping methods, where different prices were sampled randomly onto the flight scheduling dataframe, an approximate 5,000 row dataframe was ultimately utilized to show pattern relationships between features and our main target, the pricing.

After all the arduous cleaning necessary to create a dataframe for modeling, four different supervised models were implemented: Linear Regression, KNN Regression, Bagging Regression, and Decision Tree Regression. The Bagging regression model was considered to be the best performing model out of all of the models due to the smaller differences found in the mean absolute error scores in the training, testing, and cross validated scores. With all this mentioned, the study has room for improvement. Much bias exists in the data, mainly regarding the role the Covid-19 pandemic has played, as well as how such bias combatted by implementing other biases into the search. The bootstrapping method was also contributing to inaccuracies with regard to proper carrier identification implementation.

Data Dictionary

Data Variable Type Significance
ident String Object identification of a flight
actual_ident String Object original identification of a codeshared flight
departuretime String Object departure time of flight including date, hours, and minutes
arrival_time String Object arrival time of flight including date, hours, and minutes
origin String Object origin ICAO code
destination String Object destination ICAO code
aircrafttype String Object type of aircraft flown
meal_service String Object type of meal service provided on flight
seats_cabin_first Integer number of first class seats available
seats_cabin_business Integer number of business class seats available
seats_cabin_coach Integer number of coach class seats available
origin_IATA String Object origin IATA code
destination_IATA String Object destination IATA code
flight_duration String Object flight duration in seconds
MinPrice Float minimum price for a ticket on the flight
Direct Boolean Object boolean state if the route was direct or had stops
OriginName String Object origin airport name
OriginCityName String Object origin city name
OriginCountryName String Object origin country name
DestinationName String Object destination airport name
DestinationCityName String Object destination city name
DestinationCountryName String Object destination country name
CarrierName String Object carrier name affiliated with quote

Conclusions and Future Work

The model performed well, but there was a great cost to the performance of this model. We must remember that:

  • this study was conducted during the Covid-19 pandemic where data gathered was affecting domain

  • the prices were bootstrapped in a manner where some Carrier Id's do not accurately depict the true carrier of a flight

  • we are using past flights performed to predict future flight pricing. Coordination on this end would be needed for more accurate models (considering grabbing future quotes first, then grabbing flight data to search upon for each quote when the flight occurs.)

  • there are many skews in our data and biases including geographical location limitations, dominating imbalances in different categorical types, risk of overfitting.

With such mentions above, which isn't inclusive of naming all of the various issues across this study, there is much future work to be done to create a better model and create a better platform for Kruze. For the future, some considerations would be to incorporate more relevant features, such as: jet fuel pricing, passenger data, jet engine analysis, more routes, travel trends, seasonality, housing prices of planes, and more. The project was semi-successful in that it showcases a minimum proof of concept, but fails to serve as a reliable model for price prediction.

References

ga_capstone's People

Contributors

chriskuz avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.