Dataset Proposal

2022-03-04

Dataset 1

NBA Players

The dataset is from Kaggle and describes NBA player data from the 1996-1997 season up to the 2020-2021 season. The data has 22 columns and about 11.7K rows. The rows either describe player information, such as name, year drafted, and number drafted, or statistics such as games played, points, assists, and rebounds. Players who are in the league for multiple years have a row for each year they are in the league. For example, LeBron James has 18 rows, as up until the 2021 season, he has played 18 seasons. The data was originally collected via the NBA Stats API and Basketball Reference and was likely aggregated into a csv file to allow for faster analysis of player data for both the creator of the dataset and other fans of the sport.

Loading and cleaning the data should not be a problem, as functions such as filters will allow us to specify what values we are interested in. The main question I hope to answer is analyzing how scoring has increased over various eras of basketball. We will likely divide up player data into five year chunks and observe leaders in points per game, as well as the average points per game in each era analyzed. We can also analyze the average draft position of players per season, and see if there is a trend in what the average draft position is for players still in the league from 1996-2021. This question can provide insight into NBA front offices and their drafting ability. The main challenge I see with this data is navigating analyzing non numerical data, such as player name, as this might lead to data cleaning issues.

(link: https://www.kaggle.com/justinas/nba-players-data)

Dataset 2

Commercial Passenger Traffic International Report

The dataset is from the U.S. Department of Transportation and it describes commercial passenger traffic between international points and U.S. airports starting from the early 2000s to 2022. The data has about 698K rows, with 16 columns. The columns describe the date of the traffic, US gateway information such as airport ID and airport Code, similar information about the foreign gateway, the airline ID, as well as carrier information. The data was collected by the Department of Transportation Office in order to collect information on foreign commercial passenger travel to and from the United States.

One question I hope to address is how international travel trends changed over time. For example, we can analyze which foreign gateways were most popular in a certain time frame. We can also analyze when foreign travel increases and decreases by month of the year, as well as what years seem to have decreased foreign travel. We can also analyze which airlines have increased or decreased customer usage. Some challenges I foresee are being able to figure out what exactly the foreign gateway airport IDs and codes stand for, and how to connect the different identifiers to a specific gateway.

(link: https://data.transportation.gov/Aviation/International_Report_Passengers/xgub-n9bw)

Dataset 3

Border Crossing Entry Data

The dataset is from the U.S. Bureau of Transportation Statistics and it provides summary statistics for inbound crossings at U.S./Canada and U.S./Mexico border ports. The data is collected from the U.S. Customs and Border Protection at ports of entry. The data reflects the number of vehicles, passengers, containers, and pedestrians that enter the United States through these ports. The dataset consists of 371,000 rows that each represent a unique inbound Border Crossing Entry. The Border Crossing Entry Data consists of 7 columns, each representing a variable. These variables include Port Name, State, Port Code, Border, Date, Measure and Value. Where Measure is what type of inbound crossing it is (trucks, pedestrians, etc.), and Value represents the count for an inbound crossing.

The data is used by the CBP and BTS to analyze trends in North American trade and travel. This set is used in combination with other BTS sets to assess trade and travel flows in North America. BTS obtains monthly data reports gathered from the CBP at ports, which BTS then disseminates and summarizes for monthly reports. This data also provides expected port traffic, and allows visualization of which ports are busiest when. I anticipate a few challenges when working with this dataset. The first being that this data is limited to solely land ports, which excludes all information on air and sea ports. Another limitation/challenge may be the ability to organize the variables in such a way that the resulting set is informative. Port Code + Port City seem slightly redundant for example, but it could certainly be fixed/modified. As far as questions to answer through data exploration, I believe using Value as a dependent variable would be the best way to begin, then determine which other columns may have some correlation/predictability on Value.

(link: https://data.bts.gov/Research-and-Statistics/Border-Crossing-Entry-Data/keg4-3bc2)

Previous Data Loading and Cleaning

Next Example Post