Prepare Data

Prioritize...

At the end of this section you should be able to download storm events data and fill in missing values using various methods, and understand when one method may be more appropriate over another.

Read...

Data mining begins with the data. It’s the most important part of the process. You should carefully consider which variables to actually examine and how to create a clean dataset. Although data mining can look at an abundance of variables, we still want to be able to understand the results, and too many variables can be overwhelming to digest.

Let’s begin by selecting the data and the specific variables for this case study. There are two types of variables we will need. The first is the response variable(s) or the variable(s) we will try to predict. The second type are the factor variables or the variables that will be used to predict the response variable(s).

Corn Data

Let’s start by choosing the response variables which are going to be related to corn harvest. We need to rely on what’s freely available. For agriculture purposes in the United States, a good place to start is the Department of Agriculture. They have an Economics, Statistics and Market Information System that archives statistics on different agricultural crops. One such crop is sweet corn. (Note that there are other types of corn, specifically feed corn, but for now I’m going to focus on sweet corn).

Before even looking at the details in the data (search for sweet corn), you will probably notice that some data is organized by states while others represent the entire Continental United States (CONUS). Weather is location-specific, so right away I know I want to look at the individual states instead of CONUS.

You might also notice that some datasets are labeled ‘fresh’ while others are labeled ‘processed’. If you remember from the motivational video at the beginning of the lesson, they specifically discussed processed corn. Corn futures were bought and sold primarily for companies that would use the corn to create a product. Sticking with that route, I will focus on processed sweet corn.

If you open up one of the Excel sheets, you will find seven variables: Year, Acreage Planted, Acreage Harvested, Yield per Acre, Production, Season-Average Price, and Value of Production. Most of these variables are self-explanatory. The only one I will mention is the value of production, which is simply the production amount times the average price. All these variables are metrics used to measure how good the harvest was, with the exception of acreage planted. Six response variables to predict is not unreasonable; so we will keep them all. The acreage planted can be used as a known or factor variable to aid in prediction since we will know that information each year before the growing season begins. Let’s refine the question now.

Question:

How does the weather, along with acreage planted, affect the value of production, the average price, the total production, the yield per acre, or the acreage harvested for sweet corn intended for processing in the United States?

You will notice that not all states in CONUS are listed. This could be due to lack of data or simply that those states do not grow corn. Whatever the reason, we will stick with what’s available and let those states ‘represent’ the outcome for the United States. Remember, the futures market is not on a state by state basis so we need to create a model that reflects the United States.

I’ve already combined all of the state data into one file. You can download the file here.

Storm Events Summary

We have the response variables, now we need the factor variables, which are the ones that we will use to predict the response variables. There are a number of weather variables we could look at such as daily mean temperature or precipitation. Although these more typical variables are important, I think that high impact events might have a bigger effect on the corn harvest.

For this example, I’m going to look at the storm events database. This Storm Events Database provides a list of several different weather phenomena. Data is available from January 1950 - July 2016, however, only tornados were documented from 1950-1954; tornados, hail, and thunderstorm wind events from 1955-1996; and all 48 event types from 1996 onward. To learn more about each variable you can find the documentation here.

What’s great about this database is that it's extremely localized, meaning events are listed at the county level. This is beneficial to us because corn is not grown everywhere in one state. In fact, most states have specific counties where corn production is more prevalent. You can find this list here: Vegetables - Usual Planting and Harvesting Dates.

If we know the counties in each state where corn is grown, we can extract out the storm events data for only those counties. This means we are examining only cases where a storm or weather event likely affected the crop. Furthermore, we have typical harvesting dates by county. We can extract out events for each county only between the planting and harvesting dates. This narrows our time and spatial sampling to something more specific and realistic. Let’s rework our question now into the final form.

Question:

How does the frequency of storm events during the growing season, along with acreage planted, affect the value of production, the average price, the total production, the yield per acre, or the acreage harvested for sweet corn intended for processing in the United States?

We have a refined and clear question. Now we need to prepare the data. We want the frequency of each type of storm event only for the counties where corn is typically grown during the growing season. How do we actually extract that?

Here is an Excel sheet that contains the harvest dates and a sheet that contains the principle counties where corn is grown broken down by state. All you need to do is bulk download the storm events data. You want to download only the files that say ‘StormEvents_details’. Download these files (by year) into a folder. Now we can start extracting out the data. The goal is to have one file that contains the response and factor variables temporally and spatially matched up. To begin, load in the growing season dates and key counties by state as well as the sweet corn data.

Show me the code...

Your script should look something like this:

# load in Growing Season dates for all statesObs
growingDates <- read.csv("ProcessedSweetCorn_GrowingDates.csv", header=TRUE)

# load in Key counties for Corn
keyCounties <- read.csv("ProcessedSweetCorn_PrincipleCounties.csv", header=TRUE)

# load in processed sweet corn data
ProcessedSweetCorn <- read.csv("ProcessedCorn.csv", header=TRUE)

# determine which states we have data for
states <- levels(ProcessedSweetCorn$State)

Next, we are going to prep the storm events data.

Show me the code...

Your script should look something like this:

### Storm Event data is in multiple files (1 for each year) 

# path of data 
path <- "StormEvents/” 

# names of files 
file.names <- dir(path, pattern=”.csv”) 

# create an Array for meteorological variables we will include 
tornado <- array(data=NA, dim=length(ProcessedSweetCorn$Year)) 
hail <- array(data=NA, dim=length(ProcessedSweetCorn$Year)) 
tsWind <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 

# the rest of these variables are only available from 1996 onward 
flashFlood <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
iceStorm <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
heavySnow <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
coldWindChill <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
strongWind <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
tropicalStorm <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
winterStorm <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
flood <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
winterWeather <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
blizzard <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
excessiveHeat <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
freezingFog <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
heat <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
highWind <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
sleet <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
drought <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
extremeColdWindChill <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
frostFreeze <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
heavyRain <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
hurricane <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
tropicalDepression <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) 
wildfire <- array(data=NA,dim=length(ProcessedSweetCorn$Year))

The code above saves the path of your CSV files for storm events (the ones you downloaded in bulk form). It also creates an array with length ‘Years’ for each variable you will include. I selected a sub-selection of what is available. For each year, the variable will contain the frequency of the event. Now we will loop through each year and count the number of times each storm event type occurred.

Show me the code...

Your script should look something like this:

# loop through each year of storm event files 
for(ifile in 1:length(file.names)){
 # open the file
 file <- read.csv(paste0(path,file.names[ifile]),header=TRUE)

For each year, we open the corresponding storm events file. Next, we loop through each state and extract out the key counties for that state.

Show me the code...

Your script should look something like this:

# loop through each state 
for(istate in 1:length(states)){
 # Determine the counties to look for --> change to uppercase
 counties <- factor(toupper(keyCounties$Principal.Producing.Counties[which(keyCounties$State == states[istate])]))

For this given state, we have the counties in which corn is grown. We can use this to subset the storm events data. We want to also subset the storm data for the growing season. Let’s begin by determining the typical planting date for this state.

Show me the code...

Your script should look something like this:

### Determine the growing season dates
# extract out typical planting dates
plantingStartDate <- growingDates$Planting.Begins[which(growingDates$State == states[istate])]
plantingEndDate <- growingDates$Planting.Ends[which(growingDates$State == states[istate])]

# create planting date that is in the middle of usual start/end dates
plantingDate <- as.Date(plantingStartDate)+floor((as.Date(plantingEndDate)-as.Date(plantingStartDate))/2)

# change planting date to julian day for year of data examining
plantingDateJulian <- format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH[1],1,4),
                                              format(plantingDate,'%m'),format(plantingDate,'%d'))),'%j')

The data provided a range of when planting typically begins. What I've done is taken an average of that range, which will be used as the planting date. Now we need to determine the harvest date.

Show me the code...

Your script should look something like this:

# extract out typical harvest dates
harvestStartDate <- growingDates$Harvest.Begins[which(growingDates$State == states[istate])]
harvestEndDate <- growingDates$Harvest.Ends[which(growingDates$State == states[istate])]

# create harvest date that is in the middle of usual start/end dates
harvestDate <- as.Date(harvestStartDate)+floor((as.Date(harvestEndDate)-as.Date(harvestStartDate))/2)

# change harvest date to julian day for year of data examining
harvestDateJulian <- format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH[1],1,4),
                                             format(harvestDate,'%m'),format(harvestDate,'%d'))),'%j')

We have done the same thing as the planting date: taken the average of the range for the harvest date. We want to extract out the storm events only between these two dates. For this particular file (year), let’s create a spatial index (counties) and a time index (growing season).

Show me the code...

Your script should look something like this:

### Extract out storm events that occurred in the states key counties over the growingseason 
state_index <- which(states[istate] == file$STATE)
county_index <- match(as.character(file$CZ_NAME[state_index]),unlist(strsplit(as.character(counties),’,’)))
spatial_index <- state_index[which(!is.na(county_index))]

# create time variable for meterological data (julian days)
stormDateJulian <- format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH,1,4),substr(file$BEGIN_YEARMONTH,5,6),file$BEGIN_DAY)),’%j’) 

# find time stamps that fall in the typical growing season for the state
time_index <- which(as.numeric(stormDateJulian) <= harvestDateJulian & as.numeric(stormDateJulian) >= plantingDateJulian)

Now extract out the intersection of these.

Show me the code...

Your script should look something like this:

# check if there is an intersection with the spatial index -> for the year (ifile) in the state (istate) these are the indices of event that occurred in the counties typical of growing sweet corn during the growing season
index <- intersect(spatial_index,time_index)

This index represents all the cases in which storms occurred in our key counties for this particular state during the growing season. For each corn production observation, we want a corresponding storm events data observation. To determine the correct index, use this code:

Show me the code...

Your script should look something like this:

# determine the index for the corn data
stormYear <- format(format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH,1,4),
                                            substr(file$BEGIN_YEARMONTH,5,6),file$BEGIN_DAY)),'%Y'))
stormYear <- stormYear[1]
corn_index <- intersect(which(ProcessedSweetCorn$Year == stormYear[1]),which(ProcessedSweetCorn$State == states[istate]))

The corn index is just the row in which we will store the storm frequency (refer to the initialization of the variables). Now we can save the frequency of each event.

Show me the code...

Your script should look something like this:

# count the frequency of each meteorological event for this specific corn observation 
tornado[corn_index] <- length(which(file$EVENT_TYPE[index] == "Tornado")) 
hail[corn_index] <- length(which(file$EVENT_TYPE[index] == "Hail")) 
tsWind[corn_index] <- length(which(file$EVENT_TYPE[index] == "Thunderstorm Wind"))

If the year is 1996 or later, we have more variables to look at.

Show me the code...

Your script should look something like this:

# if the year is 1996 or later check for other variables 
if(as.numeric(stormYear) >= 1996){
 flashFlood[corn_index] <- length(which(file$EVENT_TYPE[index] == "Flash Flood"))
 iceStorm[corn_index] <- length(which(file$EVENT_TYPE[index] == "Ice Storm"))
 heavySnow[corn_index] <- length(which(file$EVENT_TYPE[index] == "Heavy Snow"))
 coldWindChill[corn_index] <- length(which(file$EVENT_TYPE[index] == "Cold/Wind Chill"))
 strongWind[corn_index] <- length(which(file$EVENT_TYPE[index] == "Strong Wind"))
 tropicalStorm[corn_index] <- length(which(file$EVENT_TYPE[index] == "Tropical Storm"))
 winterStorm[corn_index] <- length(which(file$EVENT_TYPE[index] == "Winter Storm"))
 flood[corn_index] <- length(which(file$EVENT_TYPE[index] == "Flood"))
 winterWeather[corn_index] <- length(which(file$EVENT_TYPE[index] == "Winter Weather"))
 blizzard[corn_index] <- length(which(file$EVENT_TYPE[index] == "Blizzard"))
 excessiveHeat[corn_index] <- length(which(file$EVENT_TYPE[index] == "Excessive Heat"))
 freezingFog[corn_index] <- length(which(file$EVENT_TYPE[index] == "Freezing Fog"))
 heat[corn_index] <- length(which(file$EVENT_TYPE[index] == "Heat"))
 highWind[corn_index] <- length(which(file$EVENT_TYPE[index] == "High Wind"))
 sleet[corn_index] <- length(which(file$EVENT_TYPE[index] == "Sleet"))
 drought[corn_index] <- length(which(file$EVENT_TYPE[index] == "Drought"))
 extremeColdWindChill[corn_index] <- length(which(file$EVENT_TYPE[index] == "Extreme Cold/Wind Chill"))
 frostFreeze[corn_index] <- length(which(file$EVENT_TYPE[index] == "Frost/Freeze"))
 heavyRain[corn_index] <- length(which(file$EVENT_TYPE[index] == "Heavy Rain"))
 hurricane[corn_index] <- length(which(file$EVENT_TYPE[index] == "Hurricane"))
 tropicalDepression[corn_index] <- length(which(file$EVENT_TYPE[index] == "Tropical Depression"))
 wildfire[corn_index] <- length(which(file$EVENT_TYPE[index] == "Wildfire")) 
} # end if statement for additional variables

Now finish the loop of states and storm files (years):

Show me the code...

Your script should look something like this:

 } # end loop of states 
} # end loop of storm files

Finally, assign the storm event variables to the corn data frame. Remember, the goal was to create one CSV file that contains both the response variables and the factor variables.

Show me the code...

Your script should look something like this:

# assign tornado, hail, and tsWind to fresh corn data 
ProcessedSweetCorn$tornado <- tornado 
ProcessedSweetCorn$hail <- hail 
ProcessedSweetCorn$tsWind <- tsWind 
ProcessedSweetCorn$flashFlood <- flashFlood 
ProcessedSweetCorn$iceStorm <- iceStorm 
ProcessedSweetCorn$heavySnow <- heavySnow 
ProcessedSweetCorn$coldWindChill <- coldWindChill 
ProcessedSweetCorn$strongWind <- strongWind 
ProcessedSweetCorn$tropicalStorm <- tropicalStorm 
ProcessedSweetCorn$winterStorm <- winterStorm 
ProcessedSweetCorn$flood <- flood 
ProcessedSweetCorn$winterWeather <- winterWeather 
ProcessedSweetCorn$blizzard <- blizzard 
ProcessedSweetCorn$excessiveHeat <- excessiveHeat 
ProcessedSweetCorn$freezingFog <- freezingFog 
ProcessedSweetCorn$heat <- heat 
ProcessedSweetCorn$highWind <- highWind 
ProcessedSweetCorn$sleet <- sleet 
ProcessedSweetCorn$drought <- drought 
ProcessedSweetCorn$extremeColdWindChill <- extremeColdWindChill 
ProcessedSweetCorn$frostFreeze <- frostFreeze 
ProcessedSweetCorn$heavyRain <- heavyRain 
ProcessedSweetCorn$hurricane <- hurricane 
ProcessedSweetCorn$tropicalDepression <- tropicalDepression 
ProcessedSweetCorn$wildfire <- wildfire

# save the data frame as a new file 
write.csv(ProcessedSweetCorn,file="FinalProcessedSweetCorn.csv")

We are left with one file. For every observation of corn production for each state/year, we have a corresponding frequency of storm events that is spatially (key counties) and temporally (growing season) relevant. For your convenience, here is the R code from above in one file.

Missing Values

Now that the variables have been selected and the data has been extracted, we need to talk about missing values. In data mining, unknown values can be problematic. Choosing the best method for dealing with missing values will depend on the question and the data. There’s no right or wrong way. I’m going to discuss three ways we can deal with missing values (some of which may be a review from Meteo 810).

To show you these three methods, I will use our actual data but to make things simpler I will only look at one state. This will minimize the number of observations. Let’s begin by loading in our final data file.

Show me the code...

Your script should look something like this:

# load in the final table of extracted meteorological events and Processed sweet corn data 
mydata <- read.csv("finalProcessedSweetCorn.csv",stringsAsFactors = FALSE)

Let’s extract the data for one state, picking a state that has a lot of observations.

Show me the code...

Your script should look something like this:

# start with one state with the most comprehensive amount of data available 
states <- unique(mydata$State) 
numObs <- {} 
for(istate in 1:length(states)){
 state_index <- which(mydata$State == states[istate])
 numObs[istate] <- length(which(!is.nan(mydata$Production[state_index]))) 
} 

# pick one of the states 
state <- states[which(numObs == max(numObs))[3]] 

# create a data frame for this state 
stateData <- subset(mydata,mydata$State == state)

The state selected should be Minnesota. Before I go into the different methods for filling missing values, let’s start by looking at the data and determining how many missing values we are dealing with. Run the code below to see the number of variables that have missing data.

For each row (each year) of data in the stateData data frame we count the number of variables that had missing data. You will see that the majority of years had 22 missing values, and these are years prior to 1996. If you change the 1 to 2, you will get the total number of missing values for each column – the number of missing values for each variable. If you have a threshold that you want to meet, like I want the variable to have at least 80% of the values (20% missing values), you can use the function ‘manyNAs’.

Show me the code...

Your script should look something like this:

# determine which variables have more than 20% of values with NaNs 
library(DMwR) 
manyNAs(t(stateData),0.2)

By changing the 0.2 you can adjust the threshold. This is a helpful function in pointing out which variables need to be addressed.

So, how do we deal with the missing values? Well, the first method is to fill the missing values with the most representative value. But how do you decide what the most representative value is? Do you use the mode? Median? Mean? Let’s take for example the variable flashflood. If we want to fill in the missing values with the median we could use the following code.

Show me the code...

Your script should look something like this:

# fill in missing values with most frequent value 
stateData$flashFlood[is.na(stateData$flashFlood)] <- median(stateData$flashFlood,na.rm=T)

The median was 1, so now all the observations in that variable that were NAs have been replaced with 1. If we use the function ‘centralImputation’, we can fill in all missing values automatically for all variables in the data frame using this method.

Show me the code...

Your script should look something like this:

# fill in missing values for all variables
stateData <- centralImputation(stateData)

Now rerun the code to determine the number of missing values in each row:

Show me the code...

Your script should look something like this:

# determine the number of missing values in each row 
apply(stateData,1,function(x) sum(is.na(x))) 

# determine the number of missing values in each column 
apply(stateData,2,function(x) sum(is.na(x)))

You should get 0. But is it realistic to say that from 1960-1995 we can replace missing values with the median from 1996-2009? Probably not. So let’s look at another method. If you are following along in R, you need to reset your stateData data frame.

Show me the code...

Your script should look something like this:

# create a data frame for this state 
stateData <- subset(mydata,mydata$State == state)

The next method we will talk about is based on correlations. Many times meteorological variables (or variables in general) can be correlated with other variables. If you have the data observations for variable A and you know that variable B is correlated to variable A, you can estimate Variable B using the correlation. Start by looking at the correlations among the variables. Run the code below to see a table of correlation coefficients and to synthesize the results.

This uses symbols for the level of significance of the correlation. You will see that the highest correlation is 0.629 between extremeColdWindChill and hail. (You will also notice a correlation of 1 between highWind and iceStorms, but I will talk about this in a bit.) So how do we use this correlation? We create a linear model between the two variables. Since hail has fewer missing values, I want to use hail to predict extremeColdWindChill. Use this code to create the linear model and fill in the missing values from extreme cold wind chill with predictions from the model.

The code above fills in the missing values from extremeColdWindChill by finding the corresponding hail values and running the function that uses the linear model to create a new value. What are the problems with this method? Well, the primary issue is that we are calculating correlations between 14 observations (1996-2009). That is not nearly enough to create a valid result. Hence, the correlation coefficient of 1 for highWind and iceStorms. If you look at the data, they have 1 instance of each event over the 14 observations. This correlation is probably spurious rather than real.

The third and more realistic method for this example is to simply remove cases. Make sure you reset your data if you are following along in R.

You can do two things: either set a threshold requiring no more than x% of missing values or completely get rid of all missing values. If we wanted a dataset of only complete cases, we would use the following code:

Show me the code...

Your script should look something like this:

stateData <- na.omit(stateData)

This code removes any row (year) in which any variable has a missing value. This means we would only retain values from 1996 onward. In this case, I think it would be better to just remove variables that have too many unknowns. I’m going to say any variables with more than 20% of the data missing should be omitted. To do this, use the following code (remember to reset your state data frame):

Show me the code...

Your script should look something like this:

clean.stateData <- stateData[,-manyNAs(t(stateData))]

You will be left with the corn variables for all years along with the thunderstorm/wind, tornado, and hail storm event variables. In this case, I decided it was more valuable to have fewer variables with more observations than more variables with limited observations, so as to avoid overfitting when too many variables are used to fit the model. Any of these methods are acceptable; it just depends on the situation and your justification.

Prioritize...

Read...

Corn Data

Question:

Storm Events Summary

Question:

Missing Values

Navigation

EMS

Programs

Related Links