Data

Data Sources:

The US international travel data was collected from the United States Department of Transportation.

https://data.transportation.gov/Aviation/International_Report_Passengers/xgub-n9bw

The data was put together by Randall Keizer of the Department of Transportation. It was collected in order for the United States government to keep track of the number of passengers on non-stop flights to most of the countries around the world.

The data for GDP, GDP growth, GDP per capita, and Population for these countries was collected from the World Bank.

GDP: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
GDP growth: https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG
GDP per Capita: https://data.worldbank.org/indicator/NY.GDP.PCAP.CD
Population: https://data.worldbank.org/indicator/SP.POP.TOTL

The data was collected by the World Bank in order to reduce global poverty. The data help the organization to gauge global prosperity.

Data File Explanation

plane_data_clean is the file for international travel data. Year and month indicates the year and month of the flight, US_Airport indicates the us airport code, FG_Airport indicates the foreign airport code, US_wac indicates the 3 digit world area code of the us airport, FG_wac indicates the 3 digit world area code of the foreign airport, and Total indicates the total passengers for that flight. The world area codes indicate a specific area of the world that the airport is in, and foreign world area codes typically correspond to a single specific country.

gdp_data_clean, gdp_percapita_data_clean, and gdp_growth_data_clean are for the different countries’ GDP data. Each file has a Country Name column which indicates the name of the country, and year which indicates the year for the specified data. In gdp_data_clean, GDP indicates the GDP in US dollars, in gdp_percapita_clean, GDPPerCapita indicates the GDP per capita in US dollars, while in gdp_growth_data_clean, GDPGrowth indicates the GDP growth by positive or negative percent of the previous year.

population_data_clean is formatted in the same way as the GDP datasets, but with population instead of GDP value. The Country Name and Year columns correspond to the year and country of that population.

Data Cleaning

load_and_clean_data.R

The plane data from the US department of transportation was cleaned by only keeping relevant columns from the original dataset. The following redundant columns were removed:

A date column that had the date formatted by MM/DD/Year as separate Month and Year columns were present.
Any numeric id columns which already had a corresponding 3 character code, such as US airport id, foreign airport id, and airlineid.
The type column, as there was only one option available which was “passenger”.

We then loaded and cleaned the GDP data. The country code, indicator name and indicator code columns of the dataset were removed, as they were unnecessary for our analysis. We then applied a pivot longer to the dataset in order to match the formatting of the original international passenger data. All the years were put into a “Year” column, instead of each year having its own column, and the corresponding GDP was put into a “GDP” column. The final format had a “Country Name” column, a “Year” column, and a “GDP” column for the GDP of that country in that year.

This data was then joined to the cleaned passenger data. However, the original passenger data only specified countries in the form of a 3 digit world area code. Since the GDP data specified country by country name, we had to load and join a dataset which specified the world area code, and corresponding country name for the code. The data for this was collected from the following wikipedia page https://en.wikipedia.org/wiki/World_Area_Codes. Joining this data then allowed us to left_join the cleaned international flight data by Country Name and Year.

The GDP per capita, GDP growth data, and Population data was loaded, cleaned, and joined in the same way, as these datasets were formatted the same way as the GDP data as the World Bank collected all 4 datasets.

Previous Big Picture