top of page

 Data Preprocessing

      We accessed our datasets from the website of Bureau of Transportation Statistics, which has a comprehensive set of on-time flight performance data. 
     

      The inputs include airline-relevant features (airline name, flight number), airport-related features (departure and destination ports), flight-relevant features (scheduled departure and arrival time, scheduled elapsed time etc). During the peer review, many people suggested that weather should be a powerful input. We have to admit that weather could be a critical contributor to flight delays. However, since weather data will not always be available beforehand for predictions in the real world, we will not include weather as an input feature in our model. The output is a binary variable with 1 indicating the flight is delayed (meaning arriving more than 15 minutes after the scheduled time), and 0 indicating the flight arrives on-time or within 15 minutes after scheduled time.

​

      Our inputs are listed below: 

​

     

 

 

 

 

 

 

 

 

 

 

​

​

      Outputs: arrival delay (binary, YES/NO).

​

      We collected the monthly airline on-time performance data from 03/2017 to 02/2018. As for preprocessing, we used python library, pandas, to merge the monthly datasets, and removed any record of flight that was cancelled or diverted because these flights didn’t have any information regarding their arrival time. We also removed all the records of flights that was delayed due to extreme weather because weather is not an input feature of our model.

    

      The original dataset contains a total of 6,363,687 records. Due to the enormous size of the dataset, we decided to narrow it down to flights departing from the busiest US airport, Atlanta International Airport. We then partitioned the dataset into training set (75%) and testing set (25%). For our first experiment, we hope to investigate the impact of dataset size on trained model performances, so we built 3 datasets of varying sizes through random sampling from the training set: small (7576 records), medium (18939 records), large (75753 records).

For the final evaluations of trained classifiers, we randomly sampled from the testing set to build a smaller one with 1894 records (0.02% of the testing set). 
    
      

bottom of page