Chapter 2 Data sources

The primary data sources of this project are New York State Department of Transportation and New York State Department of Health. They are the organizations who are responbsible for collecting thest data.

The New York State Department of Transportation(NYSDOT) monitored and gathered the traffic data including vehicle classification, volume and speed, and they are accessible as comma delimited text files.The data is collected by using the portable traffic counters. All portable traffic counters have been tested annually, to ensure that their count deviation does not exceed 2%. To be more specific, each count is collected by one of the following equipment:

1. Diamond Traffic Products(models Unicorn & Phoenix)
2. MetroCount MC5805
3. MetroCount 5600

TRAFMAN (Diamond Traffic Products), Road Reporter (International Road Dynamics), and Traffic Executive (MetroCount) software are used by the Main Office and Regional Offices of NYSDOT to read traffic count data from electronic counters and prepare the requisite output format files. These counters and programs relevant to the generation of NYSDOT format files may be assisted by the Main Office.

The New York State Department of Health(NYSDOH) reported the New York State Statewide COVID-19 testing results and and it is also accessible as comma delimited text files.

There are also some other data sources similar to what we used, but they are not published by official orginazation and they are adopted from the data sources above. Thus, to make our result accurate, we used the original and official data.

In summary, we utilized two datasets from NYSDOT: Short Count Volume dataset and Average Weekday Speed dataset, and one dataset from NYSDOH: New York State Statewide COVID-19 Testing dataset.

2.1 Short Count Volume dataset from the New York State Department of Transportation

Link to the dataset: https://www.dot.ny.gov/divisions/engineering/technical-services/highway-data-services/hdsb

This dataset contains information about the traffic count for different time slots on different days from 2018 to 2020, so we can get the general pattern, and help us analyze how COVID-19 impact the pattern of traffic volume.

It has very detailed information for each record, such as region, county, direction, etc., so that we can group data by different categories and get insights about the patterns for each kinds of groups.

Table 2.1: Major columns used
Column Description
COUNT_ID A unique ID for each count session loaded, each count has one Count_ID for all data types.
REGION A single digit code for each NYSDOT Region. Can be concatenated with County_Code and Station number to create a unique ID
COUNTY_CODE A single digit code for each County within a NYSDOT Region. Can be concatenated with Region_Code and Station number to create a unique ID
STATION A four digit number unique within a county representing a specific segment of road for traffic counting purposes. Can be concatenated with Region_Code and County_Code to create a unique ID.
SPECIFIC_RECORDER_PLACEMENT Verbal description of the primary counter placement.
YEAR The year in which the data was collected.
MONTH The month in which the data was collected.
DAY The Day of data collection.
DAY_OF_WEEK The day of week the data was collected.
FEDERAL_DIRECTION The federal direction code for the data record. 1 – North, 3 – East, 5 – South, 7 – West, 9 – North/South Combined, 0 – East/West combined
LANE_CODE The lane code by direction, starting with 1 as the rightmost lane.
LANES_IN_DIRECTION The total number of lanes expected in this direction.
Interval_1_1 through Interval_24_4 Volume Data for each 15 minute interval. The hourly data will be represented in intervals 1_1, 2_1, etc.
TOTAL The sum of all intervals for the record.

Number of Records:

There are 99454 records for year 2018, 122967 records for year 2019, and 110536 records for year 2020. In total there are 332,957 records of data.

Potential issues found in this dataset:

  1. The data for January 2020 and February 2020 seems to be missing.
  2. There are some records with the data in the wrong columns, for example, there are 6 rows with DATATYPE=2020, YEAR=6, MONTH=1, DAY=Tuesday, which seems not valid but may be an input error, we need to fix them manually.

2.2 Average Weekday Speed dataset from the New York State Department of Transportation

Link to the dataset: https://www.dot.ny.gov/divisions/engineering/technical-services/highway-data-services/hdsb

As it provided the average speed data for each day, we can combine it with the short count volume dataset to see the relationship between the traffic count and the average speed for each day on average. This helps us solve the second question above. We only used the column ‘SPEED’(Which represents the average speed for a testing period), and pair each records to the vehicle count dataset by “COUNT_ID”, “YEAR”, “MONTH”, “DAY” and “FEDERAL_DIRECTION”.

As we have discussed in chapter 4, we found too many missing data for speed, so we did not use them in analysis. Thus, we did not includes too much details about this dataset here.

2.3 New York State Statewide COVID-19 Testing dataset from the New York State Department of Health

Link to the dataset: https://health.data.ny.gov/Health/New-York-State-Statewide-COVID-19-Testing/xdss-u53e/data

This dataset provides the number of new positive and cumulative cases of Covid-19 in the New York State from 2020 March 1st to the present. Thus, we can use it to make a plot showing the trend of the pandemic for the entire state, and then compare the trend with traffic volume in the New York state. This helps us solve the third question above.

Table 2.2: Major columns used
Column Description
Test Date The date the test result was processed by the NYS Electronic Clinical Laboratory Reporting System (ECLRS).
County The county of residence for the person tested.
New positives The number of new persons tested positive for COVID-19 infection on the test date in each county.
Cumulative number of positives Running total for the number of persons tested positive for COVID19 infection in each county as of the test date.
Total number of tests performed The number of tests of individuals performed on the test date in each county. This total includes positives, negatives, and inconclusive results.
Cumulative number of tests performed Running total for the number of tests of individuals performed in each county as of the last update to the dataset. This total includes positives, negatives, and inconclusive results.

Number of Records:

There are 40,424 records of data in total

Potential issues found in this dataset:

No potential issues found in this simple dataset

2.4 Other resources

  1. https://www.dot.ny.gov/regional-offices

The above website provided us the way to interpret the region data, the original region data is the number corresponding to each regions.(ex. 11 for New York City)

  1. https://www.dot.ny.gov/gisapps/functional-class-maps

This website provided us the way to interpret the region county codes, the original region data are 2 separated number corresponding to region and county.

  1. https://www.dot.ny.gov/divisions/engineering/technical-services/hds-respository/Traffic%20Data%20Report%202011%20Appendix%20D%20-%20NYSDOT%20Regions%20and%20County%20Codes.pdf

This document provided us the way to interpret the functional class data, the original region data is the number corresponding to each classification.(ex. 01 represents interstate principal arterial in the rural area)

  1. https://www.dot.ny.gov/divisions/engineering/technical-services/highway-data-services/hdsb/repository/Field_Definitions_SC%20Formats.pdf

This document provided us the way to know the meaning of each columns in the ‘New York State Statewide COVID-19 Testing dataset’

  1. https://health.data.ny.gov/Health/New-York-State-Statewide-COVID-19-Testing/xdss-u53e/data

This document provided us the way to know the meaning of each columns in the ‘Short Count Volume dataset’

  1. https://cugir.library.cornell.edu/catalog/cugir-007865

The above website provided us the shapefile for New York Sate and it is used in the interactive component