Data Loading and Cleaning
Firstly, we started by removing columns of data that are either unrelated to the subset of data we want to focus on, or are merely repetitive and unnecessary. Some columns that we removed in the cleaning process are:
“data_dte” (Data Date), since there are already “year” and “month” columns that would make filtering easier when we come closer to deciding exactly how we wish to organize the data
“type”, since there’s only one option for this variable (“passengers”), and that distinction is unnecessary considering the title of the data set and doesn’t affect the data.
“usg_apt_id” (US Gateway Airport ID), “fg_apt_id” (Foreign Gateway Airport ID), and “airlineid” (Airline ID), since these are numerical IDs that are uninformative without these ID meanings being presented. This data is also redundant as they all have paired data (“usg_apt”, “fg_apt”, and “carrier”) which are also more readable in the form of character codes.
Considering this dataset covers a period of 31 years, we will clean the data by shortening the timeframe of the data in the future when we are considering what secondary dataset we want to work with.