Investigation of US Traffic Accidents and Prediction of Accident Severity


Traffic accidents are a leading cause of non-natural death for US citizens. There were 33,244 fatal motor vehicle crashes in the US in 2019 in which 36,096 deaths occurred. This resulted in 11 deaths per 100,000 people and 1.11 deaths per 100 million miles traveled ( An additional 4.4 million are injured seriously enough to require medical attention. All the evidence suggests that the US suffers the most road crash deaths among high-income countries, about 50% higher than similar countries in Western Europe, Canada, Australia and Japan ( Therefore, it is urgent to understand the underlying mechanisms of the occurrence of traffic accidents. This analysis aims to investigate the relevance of accident occurrence to time, name of day, season, and weather conditions and to build a neural network for instantaneous prediction of accident severity. Findings provide useful information for the police department to distribute patrols and an efficient tool to allow an instantaneous prediction of accident severity.


The dataset was downloaded from (Moosavi et al. (2019a); Moosavi et al. (2019b)), where the authors collected the data from MapQuest and bing and are continuously updating the dataset. The dataset includes about 4.2 million traffic accidents which cover 49 states of the US from 2016 to 2020.

Data cleaning

First, all the accidents with missing values, regardless of features, were excluded to avoid the influence of manual operation on the performance of the neural network. Next, for the point of interest, the features with very limited number of samples were excluded. To this end, the percentage of “True” values for each point of interest feature was calculated. A threshold using the 75th percentile of all these percentage values of point of interest features were then determined. The point of interest features whose percentage of “True” values was smaller than the threshold were excluded. The final retained features included “Crossing”, “Junction”, “Traffic_Signal”. Finally, all other useless columns were deleted. They included ID, Source, TMC, Start_Time, End_Time, End_Lat, End_Lng, Number, Street, City, County, Country, Zipcode, Timezone, Airport_Code, Wind_Chill(f), Turning_Loop, Sunrise_Sunset, Civil_Twilight, Nautical_Twilight, Astronomical_Twilight, Weather_Timestamp, Description.

Data preprocessing

First, the start time of accidents was converted to representative time periods, i.e. early morning, morning, afternoon, evening, night, and late night. The date and month when each accident occurred were converted to the name of day and season, respectively. Second, because various weather conditions were reported in the data, only the most common weather conditions were assigned to the data. For example, “Snow”, “Wintry”, “Sleet”, and “Ice” were considered as “Snow”. The most common weather conditions included clear, fair, cloudy, windy, rain, snow, obscuration, and sandstorm. Wind directions were simplified to north, south, west, east, northeast, northwest, southeast, and southwest. Third, a box-cox transformation was applied to the continuous features, i.e. Visibility (mi), Pressure(in), Wind_Speed(mph), Precipitation(in), to mitigate their skewness and to make the data distribution of these features more Gaussian-like. Fourth, a one hot encoding was performed for the categorical features whose values were not in Boolean type. Next, the entire dataset was split into training and testing datasets (80% vs. 20%). For the training dataset, 80% was used as the training dataset and the remaining 20% was used as validation dataset. Finally, the continuous variables in each dataset were standardized to avoid poor performance due to large unstable weight during learning and minimize generalization errors (REF).

First, for each severity level and state, the number of accidents was counted and divided by the total number of accidents for the corresponding severity to calculate the accident rate for each state. Second, for each severity, the percentage of accident occurrence by time period, name of day, season, and weather was calculated, respectively.

A neural network with three fully connected hidden layers were created with a pyramid structure to allow a dimension reduction layer by layer towards the output layer (REF). This structure allows the number of hidden units to be sequentially halved on the subsequent layer. A relu activation function along with “he_uniform” kernel initializer followed by a batch normalization were utilized on each hidden layer (REF). Batch normalization stabilizes the learning process and dramatically reduce the learning time (REF). On the output layer, a softmax activation function was employed for severity classification. An early stopping which monitors validation loss with patience of 10 epochs was used to prevent overfitting and ensure sufficient training epochs.

Neural Network Architecture

Because the data has highly imbalanced class distribution as shown in the figure below, the weight of each accident severity was calculated using scikit-learn and assigned to each class in the training process.

Highly Imbalanced Class Distribution

A batch size of 1024 was adopted in the training process. This resulted in about 2.6k batches, which have sufficient generalization ability (REF). With a learning rate of 1e-6, the neural network was trained in 50 epochs which led to the most balanced performance for the four severity levels.

Finally, features used as the input of the neural network included locations, time, season, and weather when accidents occurred. The details can be found in this notebook.

Results and Discussion

The figure below shows accident rate by state. It demonstrates that Arizona ranked as the first state in terms of mild accidents. California and Florida had accidents ranging across all severity levels. Texas had moderate to serious accidents. And Georgia and New York had serious and severe accidents.

Accident Rate by State

For the occurrence of accidents by time period (the figure below), most accidents occurred in the morning and evening, as these two periods are traffic peak. If accidents occurred at night or late night, they tended to be severe.

Percentage of Occurrence by Time

More accidents typically occurred on weekdays, regardless of severity, as shown in the figure below. But if accidents occurred on weekends, they were more likely to be severe cases.

Percentage of Occurrence by Day Name

Most mild accidents occurred in the summer, while most moderate to severe accidents occurred in the winter.

Percentage of Occurrence by Season

Regardless of severity, most accidents occurred in the nice weather such as clear, fair, and cloudy. But when accidents occurred in windy, rainy, snowy, and obscure weather, they were most likely to be serious.

Percentage of Occurrence by Weather

The police department may distribute their patrols according to this analysis. For example, an increase in police patrols may be needed in California and Florida because they had the accidents ranging across all the severity levels during 2016–2020. More patrols would be required in the morning and evening to reduce the number of accidents. In addition, increasing patrols at night and late night may reduce the number of severe accidents. The police department may pay more attention to winter to reduce the occurrence of moderate to severe accidents and may need to increase patrols in nice weather, such as clear, fair, and cloudy to reduce the occurrence of accidents.

The confusion matrix below demonstrates the performance of the neural network using an independent testing dataset and suggests that the neural network has the capability to correctly predict 75.4%, 58.6%, 65.0%, and 53.3% of the mild, moderate, serious, and severe accident, respectively. The recall scores are shown below.

Confusion Matrix
Recall Score of Each Severity Level

But there were a certain number of misclassifications for each severity. This could be a result of the data itself, where there is not a clear boundary for each severity in terms of traffic delay, which is used to assess the severity of accidents.

Nevertheless, the neural network offers a useful tool for the police department and companies working on traffic accidents to instantaneously predict the severity of accidents whenever a new accident case is given. In particular, the police department would have better control on the distribution of patrols because the softmax activation function returns the probability of severity.


In this analysis, the relevance of accident occurrence to time period, location, season, and weather has been investigated. These results provide the police department with useful information to reduce the occurrence of traffic accidents. And the neural network will serve as powerful tool for instantaneous prediction of accident severity. Future work may include natural language processing for the descriptions about accidents to enhance the predictive power of the neural network. Also, the neural network could be retrained to predict the traffic delay and distance affected by accidents.

This analysis was done at Metis Data Science Bootcamp and the code is posted here.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store