top of page

Evaluation Metrics

      In the preliminary analysis, we found the majority of flights in the dataset were on-time, so the accuracy of ZeroR reached around 85%. While accuracy is one of the most common metrics for evaluating performance of ML models, the unbalanced nature of the dataset motivated us to explore other evaluation metrics. In the case of our task, it is particularly interesting and relevant to look at the following two measures:

​

      (1) recall for delayed flight: the percentage of correct classification among all the actual late flights  
      (2) precision for on-time flight prediction: the percentage of correct classification among all the  predicted on-time flights

     

      The priority of our task is to ensure high precision for on-time flight prediction, so people could have more trust in the on-time prediction generated by the model. Additionally, the model should have high recall for delayed flights, meaning the system is capable of identifying flights that will indeed be delayed. 
 

 

 

     

 

 

     

      While it is always annoying to get the wrong classification, the costs associated with each type of error are empirically different. For instance, a false alarm (predicted to be delayed, but actually on time) is much less detrimental than a miss (predicted to be on-time, but actually delayed) because it doesn’t hurt to set up a backup plan but end up not using it, while it is disturbing to deal with unexpected delay.

​

​

​

​

 

      So we experimented with cost-sensitive classifier. The default cost matrix for a 2-class problem is shown below, where Miss and False Alarm induced the same cost.

​

​

​

​

​

      Given our reasoning above, we decided to increase the cost for Miss, because we want our model to have higher precision for on-time prediction, so we penalized it more when classifying a delayed flight as on-time. The new cost matrix is shown below.

​

​

​

​

​

​

​

​

Weighted Cost Matrix

bottom of page