Chapter 3 Data transformation

The script for data transformation can be found here:

3.1 For Short Count Volume dataset

3.1.1 Drop unnecessary columns

We dropped the following columns as they are not necessary for our tasks


3.1.2 Aggregate the Time Columns

The original dataset includes the volumn count for every 15 mins, which is too detailed and such many columns would make our plots to be complex. Thus, we combined them to have one count per hour, for each set of observation.

3.1.3 Use region name to replace region code

We need to show the name for each region to let the user to know which region it is instead of using a code, this is needed when we made many plots, so we transformed the dataset.

3.1.4 Use county name to replace the combination of region code and county code

We need to use the county name as the key to pair with the data in the shapefile, but there is no county name provided. So we have to combine the region code and county code to get the county and translate it to the county name following the official document “NYSDOT Region and County Codes”.

3.1.5 Fixing the obvious errors in the dataset

As we mentioned in the second chapter, there are a few records with data in the wrong columns, we made fixed them by hand as the number of these errors is small (12 in total).

3.2 For Average Weekday Speed dataset

We extracted the column ‘SPEED’, and pair each records to the vehicle count dataset by “COUNT_ID”, “YEAR”, “MONTH”, “DAY” and “FEDERAL_DIRECTION”. However, as we further noticed there are too much data missing for ‘SPEED’, we decided not analyzing the ‘SPEED’ data anymore.(See chapter 4 - ‘Missing Values’ for more details)

3.3 For New York State Statewide COVID-19 Testing dataset

We did not transform the data as the data is already in the format we need and it is simply by itself.