Tornado Data Analysis Project – Intro to Industrial Engineering

Authors
Published

May 7, 2025

As part of my Intro to Industrial Engineering course, I completed a data analysis project focused on tornado occurrences in the United States between 1950 and 2021. Using Excel, I created four assertion-evidence slides to explore different questions related to tornado frequency, severity, location, and impact. Each graph was designed to support a clear conclusion based on the data, following principles of effective data visualization and storytelling.

The dataset contained historical records of all reported tornadoes in the U.S. during that time period, and included both numerical and categorical variables. The represented columns were as follows: year, month, day, state, EF level (magnitude), number of injuries, number of deaths, start and end longitude, start and end latitude, tornado path length (in miles), and width (in yards). This wide range of variables allowed for a detailed exploration of various tornado-related phenomena, both temporal and geographic.

The first question I explored was: From 1950 to 2021, is there a change in the average number of injuries associated with the average number of tornadoes each year? To answer this, I created two separate time-series plots, each using “year” as the x-axis. The first graph shows the average number of injuries per tornado in each year, and the second shows the total number of tornadoes recorded per year. I built both graphs using pivot tables. For the first plot, I grouped the data by year and calculated the average of the “injuries” column. For the second, I simply counted the number of tornadoes per year by summarizing the “year” column. The results revealed a clear inverse pattern: while tornado frequency has steadily increased over the decades, the average number of injuries per tornado has decreased. This suggests improvements in tornado detection systems, public warning protocols, and shelter infrastructure. All formatting was done in Excel, where I cleaned up the visuals by removing gridlines, adding axis titles, and applying consistent styling across both charts for easy comparison.

The second question asked: Historically, what is the distribution of tornado widths? To investigate this, I created a histogram that visualizes how tornado widths are distributed across the dataset. I first cleaned the data to exclude any missing or invalid width values. Then, using Excel’s histogram tool, I created bins with intervals of 75 yards, adjusting the default bin size for better readability. I also added an overflow bin for extreme cases where tornadoes were wider than 600 yards. The resulting histogram shows a clear exponential distribution: the majority of tornadoes are under 150 yards wide, while only a small fraction exceed that. This pattern aligns with meteorological expectations and validates the data. For aesthetics, I edited the bar colors and axis labels directly in Excel, and added the vertical axis title in PowerPoint since Excel doesn’t allow for horizontal formatting of vertical text.

The third question I analyzed was: Do states in Tornado Alley show similar patterns in tornado occurrences from 1950 to 2021, or are some consistently more affected than others? I started by filtering the “state” column to include only the core Tornado Alley states: Oklahoma, Kansas, Nebraska, Texas, and South Dakota. I then used a pivot table to count the number of tornadoes recorded in each of these states. However, I realized that comparing raw counts alone wouldn’t be fair, since states like Texas are much larger than others. To address this, I normalized the data by calculating the number of tornadoes per 10,000 square miles for each state, using publicly available land area data. This adjustment revealed that Oklahoma had the highest tornado density, followed by Kansas and Nebraska. The final bar chart was created in Excel with clearly labeled axes and bars shaded in similar tones, with Oklahoma highlighted to emphasize the key finding. I also rounded all values to whole numbers for readability and realism.

The fourth and final question was: Is there a geographical pattern of the most deadly tornadoes (EF4 and EF5) recorded in the U.S. from 1950 to 2021? For this, I focused on the most severe tornadoes in the dataset. I filtered the EF level column to show only EF4 and EF5 tornadoes. Although I initially tried to use a pivot table, it wasn’t the most effective method for plotting geographic data, so instead I copied the filtered “start latitude” and “start longitude” columns to a new worksheet. Then I used Excel’s scatterplot tool to graph the points. I color-coded the points by EF rating (EF4 in blue and EF5 in orange) to make differences more visible. The final scatterplot revealed a dense clustering of deadly tornadoes in the central region of the U.S.—consistent with the area known as Tornado Alley. I added a legend and full axis titles for clarity. Although I attempted to overlay a transparent U.S. map to better illustrate the locations, this proved difficult to align accurately in Excel, so I focused instead on keeping the chart clean and readable.

This project gave me the opportunity to apply Excel-based data analysis tools such as pivot tables, custom binning, and multi-variable filtering. More importantly, it taught me how to craft compelling visualizations that connect data to insights, and how to interpret large datasets to answer complex real-world questions. Each slide was carefully designed to balance clarity, simplicity, and impact—core principles in both engineering analysis and effective communication.