Predicting flight delay using Machine Learning models and Azure Notebook
A Step by Step implementation Guide
This story is a part of MSP Developer Stories Initiative by Microsoft Student Partner (India) Program. Microsoft Student partners (MSP)are the on-campus leaders with a passion making a difference, building vibrant communities and sharing latest tech with their peers.
If you think you have the above qualities then apply for becoming an MSP at https://studentpartners.microsoft.com/
INTRODUCTION
In this story we will predict whether a flight will be delayed or not depending on a previously available dataset. The goal of the model is to predict whether a flight you are considering to book is likely to arrive on time? We will go in step by step approach. We will first learn how to create notebooks in Azure Notebook, then move on to importing data in notebook using curl. Then learn to use Pandas for cleaning and preparing data, scikit-learn to build the model and lastly learn to use Matplotlib for visualising the results.
Let’s begin by creating Azure Notebook.
Creating Azure Notebook
Azure Notebook is a Jupyter(formerly IPython) Notebook showing python code, markdown and interactive graphics. Moreover its a free service for anyone to run code in their browser using Jupyter. Currently its available kernels include Python 2.7, Python 3.5, Python 3.6, R 3.4.1 and F# 4.1.9 .
We will be using Python throughout. Python kernel also include the numpy, matplotlib, scikit-learn, pandas, and bokeh libraries beyond base Python distribution.
If you wish to know more about utilities of Azure notebooks you can read this blog by one of the Microsoft employee.
To start with, navigate to https://notebooks.azure.com/ then signin using your Microsoft Account. You will see something like this :
Click on My projects on the top left corner.
In the My Project page, click on “+ New Project” at the Top, shown below.
Name the New project as “ Flight Delay” or something similar. You can make your project Public. But you can change this setting later. And finally click Create button.
Finally landed with in Flight Delay project.
Now click on +New . You will see options like Notebook, Folder, Blank File, Markdown.
Here we will use Notebook, so click on Notebook. Just for information the Blank File gives you option to write both the name of file and its extension. Use it when needed.
Name the notebook as “Predicting Flight Delay.ipynb” or something similar. Select Python 3.6 as Language. This will add Python 3.6 Kernel to the Notebook.
Then Click on the Notebook to open it. This will allow you to edit this as well.
Importing Data to work on
Now we have learnt how to create Azure notebook. As you start working with Azure Notebook, you can create Additional Projects and Notebooks too. You can also upload Notebooks or just start working with it from scratch.
We will now start with importing Data. To do so, we will use Bash command curl . To use bash command we need to add exclamation mark to its prefix.
My notebook containing all the code is shared at the end of this story.
Add the following command to the first cell of the Notebook and press Shift + Enter (or click on RUN button)to execute it. You should see the following output as shown below.
Now transform the data into Pandas Dataframe, name it as df and display using dataframe.head(). A Dataframe is a 2-D labeled data structure, just like Spreadsheet.
The dataframe contains on-time arrival information of major US airlines.
It has 26 columns and the first 5 rows are displayed. Each row here represent one flight details.
By looking at the column name, can you guess what each column represent?
Understanding the Data
If you are curious about how many rows the dataframe contains. Then run df.shape in the next empty cell.
Lets understand what each Column in dataframe mean.
YEAR — Year in which flight took place
QUARTER — Quarter in which flight took place (1–4)
MONTH — Month in which flight took place (1–12)
DAY_OF_MONTH — Day of the month in which flight took place (1–31)
DAY_OF_WEEK — 1 for Monday, 2 for Tuesday,etc. in which flight took place
UNIQUE_CARRIER — Airline carrier code
TAIL_NUM — Aircraft tail number
FL_NUM — Flight number
ORIGIN_AIRPORT_ID — ID of origin airport
ORIGIN — Code of origin airport(ATL, DFW, SEA, etc.)
DEST_AIRPORT_ID — ID of destination airport
DEST — Code of destination airport (ATL, DFW, SEA, etc.)
CRS_DEP_TIME — Scheduled departure time
DEP_TIME — Actual departure time
DEP_DELAY — Departure Delay in minutes
DEP_DEL15 — 1 if departure is delayed by 15 minutes or more else 0
CRS_ARR_TIME — Scheduled arrival time
ARR_TIME — Actual arrival time
ARR_DELAY — Arrival Delay in minutes
ARR_DEL15 — 1 if arrived late by 15 minutes or more else 0
CANCELLED — 1 if Flight was cancelled else 0
DIVERTED — 1 if Flight was diverted else 0
CRS_ELAPSED_TIME — Scheduled flight time in minutes
ACTUAL_ELAPSED_TIME — Actual flight time in minutes
DISTANCE — Distance traveled in miles
Clean and Prepare Data
Before we implement Machine Learning there are two more task we need to do:
- Figure out “feature ” Columns that are relevant to the output we are predicting.
- Eliminate Missing values either by deleting rows or columns or adding meaningful values
As a data scientist first you will look for missing values in dataset. There’s an easy way to do it in Pandas.
Run df.isnull().values.any() and if the output is True then dataset has missing values.
The next task is to find out missing values. Run df.isnull().sum() , this would show the number of missing values in each column.
We can see that the 26th row, “Unnamed: 25” has 11231 missing values. This is because the CSV file that you imported contains a comma at the end of each line.Let’s delete that row.
The tail number of a flight has least influence on predicting whether a flight will be delayed or not. Similarly at the time of booking, you dont have any idea whether the flight will be delayed at departure, cancelled or diverted.So our model has nothing to do with this data.
So, Lets filter out dataset with the needed columns.
Now we see that in the ARR_DEL15 column there is still some missing data. Pandas term missing files as NaN -Not a number. you can see those by typing df[df.isnull().values.any(axis=1)].head(188) .
We will now use filllna method to make those as 1s. Do this using
df = df.fillna({‘ARR_DEL15’: 1}) This would serve the perpose.
Use df.iloc[177:185] to view the dataset in range of 177 to 185. You will find some rows shown NaN before now have 1.
Observe that the CRS_DEP_TIME is in military time. Let’s bin the departure time and make it fall in the range of 0 to 23.
Lets now generate the column indicator from ORIGIN and DEST columns, while dropping them.
Now finally our Dataset is prepared and ready for a model to work on.
Build the model
While creating a machine learning model, we need two dataset, one for training and other for testing. But now we have only one. So lets split this in two with a ratio of 80:20. We will also divide the dataframe into feature column and label column.
Here we imported train_test_split function of sklearn. Then use it to split the dataset. Also, test_size = 0.2, it makes the split with 80% as train dataset and 20% as test dataset.
The random_state parameter seeds random number generator that helps to split the dataset.
The function returns four datasets. Labelled them as train_x, train_y, test_x, test_y. If we see shape of this datasets we can see the split of dataset.
Can you guess what would happen if you called shape on other two datasets?Try it out yourself!
Using scikit-learn gives an added advantage, it includes a variety of machine learning models and we don’t need to implement models or the algorithms they use — by hand.
We will use RandomForestClassifier, which fits multiple decision tree to the data. Finally I train the model by passing train_x, train_y to the fit method.
Once the model is trained, we need to Test the model. For that we will pass test_x to the predict method.
The we will check the accuracy of the model by checking the score.
You can see that the model gave an mean accuracy of 86%. However in classification model, mean accuracy isn’t always reliable. Lets dig a little deeper and figure out how good our model is.
One of the best measures for a binary classification model like this is Area Under Receiver Operating Characteristic Curve (commonly referred as “ROC AUC”). Before generating ROC AUC we must find prediction probabilities on the test dataset.
Why the AUC score is less than mean accuracy?
The output of the score method says how many items of the test set the model predicted correctly. The dataset contained many more rows where there is on-time arrival than late arrival. Due to this imbalance, if model predict on time arrival, it is more likely to be correct.
But ROU AUC curve takes this problem in to account and gives more accurate indication.
To learn more about the behaviour of the model, we will generate confusion matrix, also known as error matrix. It quantifies the number of true positives, true negatives, false positives, and false negatives.
The first row shows number of flights that were on time. The first column on that row shows the number of flights correctly predicted to be on time. The second column shows how many flights were predicted as delays but weren’t.
Other measures of accuracy of Classification models includes precision and recall. Up next, we will predict the precision of the model. Scikit-learn helps with precision_score and recall_score function for computing precision.
Visualise Model
Visualising the output of a Model is the most exciting part, as you get to see your work. We will use matplotlib for the perpose.
The first line is one of the Magic Commands that enables Jupyter to render matplotlib output without repetitive call to show function.The final statement is to enhance output of matplotlib.
Next we will plot the ROC curve that we created before.
The dotted line in the middle shows 50–50 chance of getting a correct answer The blue curve represents the model’s accuracy. If this chart appears then it means you can use matplotlib in Jupyter notebook.
The main aim behind developing this model is to predict whether a flight will be delayed or not.
So, let’s write a function that calls the above model and calculate the chances of flight arriving on time.
The function takes a date (in dd/mm/yyyy format), time, an origin and a destination airport code, as input and returns a value between 0.0 and 1.0 indicating the probability of a flight arriving on time.
Let’s test it.
What if we check it for the next day or at a different time? Try it yourself !
Just experiment as you like. But use only ATL, DTW, JFK, MSP, and SEA as airport codes, since this are the only airport the model is trained on.
Now lastly, let’s plot probabilities of a evening flight from JFK to ATL over a week range.