Prioritize...
At the end of this section you should be able to download storm events data and fill in missing values using various methods, and understand when one method may be more appropriate over another.
Read...
Data mining begins with the data. It’s the most important part of the process. You should carefully consider which variables to actually examine and how to create a clean dataset. Although data mining can look at an abundance of variables, we still want to be able to understand the results, and too many variables can be overwhelming to digest.
Let’s begin by selecting the data and the specific variables for this case study. There are two types of variables we will need. The first is the response variable(s) or the variable(s) we will try to predict. The second type are the factor variables or the variables that will be used to predict the response variable(s).
Corn Data
Let’s start by choosing the response variables which are going to be related to corn harvest. We need to rely on what’s freely available. For agriculture purposes in the United States, a good place to start is the Department of Agriculture. They have an Economics, Statistics and Market Information System that archives statistics on different agricultural crops. One such crop is sweet corn. (Note that there are other types of corn, specifically feed corn, but for now I’m going to focus on sweet corn).
Before even looking at the details in the data (search for sweet corn), you will probably notice that some data is organized by states while others represent the entire Continental United States (CONUS). Weather is location-specific, so right away I know I want to look at the individual states instead of CONUS.
You might also notice that some datasets are labeled ‘fresh’ while others are labeled ‘processed’. If you remember from the motivational video at the beginning of the lesson, they specifically discussed processed corn. Corn futures were bought and sold primarily for companies that would use the corn to create a product. Sticking with that route, I will focus on processed sweet corn.
If you open up one of the Excel sheets, you will find seven variables: Year, Acreage Planted, Acreage Harvested, Yield per Acre, Production, Season-Average Price, and Value of Production. Most of these variables are self-explanatory. The only one I will mention is the value of production, which is simply the production amount times the average price. All these variables are metrics used to measure how good the harvest was, with the exception of acreage planted. Six response variables to predict is not unreasonable; so we will keep them all. The acreage planted can be used as a known or factor variable to aid in prediction since we will know that information each year before the growing season begins. Let’s refine the question now.
Question:
How does the weather, along with acreage planted, affect the value of production, the average price, the total production, the yield per acre, or the acreage harvested for sweet corn intended for processing in the United States?
You will notice that not all states in CONUS are listed. This could be due to lack of data or simply that those states do not grow corn. Whatever the reason, we will stick with what’s available and let those states ‘represent’ the outcome for the United States. Remember, the futures market is not on a state by state basis so we need to create a model that reflects the United States.
I’ve already combined all of the state data into one file. You can download the file here.
Storm Events Summary
We have the response variables, now we need the factor variables, which are the ones that we will use to predict the response variables. There are a number of weather variables we could look at such as daily mean temperature or precipitation. Although these more typical variables are important, I think that high impact events might have a bigger effect on the corn harvest.
For this example, I’m going to look at the storm events database. This Storm Events Database provides a list of several different weather phenomena. Data is available from January 1950 - July 2016, however, only tornados were documented from 1950-1954; tornados, hail, and thunderstorm wind events from 1955-1996; and all 48 event types from 1996 onward. To learn more about each variable you can find the documentation here.
What’s great about this database is that it's extremely localized, meaning events are listed at the county level. This is beneficial to us because corn is not grown everywhere in one state. In fact, most states have specific counties where corn production is more prevalent. You can find this list here: Vegetables - Usual Planting and Harvesting Dates.
If we know the counties in each state where corn is grown, we can extract out the storm events data for only those counties. This means we are examining only cases where a storm or weather event likely affected the crop. Furthermore, we have typical harvesting dates by county. We can extract out events for each county only between the planting and harvesting dates. This narrows our time and spatial sampling to something more specific and realistic. Let’s rework our question now into the final form.
Question:
How does the frequency of storm events during the growing season, along with acreage planted, affect the value of production, the average price, the total production, the yield per acre, or the acreage harvested for sweet corn intended for processing in the United States?
We have a refined and clear question. Now we need to prepare the data. We want the frequency of each type of storm event only for the counties where corn is typically grown during the growing season. How do we actually extract that?
Here is an Excel sheet that contains the harvest dates and a sheet that contains the principle counties where corn is grown broken down by state. All you need to do is bulk download the storm events data. You want to download only the files that say ‘StormEvents_details’. Download these files (by year) into a folder. Now we can start extracting out the data. The goal is to have one file that contains the response and factor variables temporally and spatially matched up. To begin, load in the growing season dates and key counties by state as well as the sweet corn data.
Your script should look something like this:
# load in Growing Season dates for all statesObs growingDates <- read.csv("ProcessedSweetCorn_GrowingDates.csv", header=TRUE) # load in Key counties for Corn keyCounties <- read.csv("ProcessedSweetCorn_PrincipleCounties.csv", header=TRUE) # load in processed sweet corn data ProcessedSweetCorn <- read.csv("ProcessedCorn.csv", header=TRUE) # determine which states we have data for states <- levels(ProcessedSweetCorn$State)
Next, we are going to prep the storm events data.
Your script should look something like this:
### Storm Event data is in multiple files (1 for each year) # path of data path <- "StormEvents/” # names of files file.names <- dir(path, pattern=”.csv”) # create an Array for meteorological variables we will include tornado <- array(data=NA, dim=length(ProcessedSweetCorn$Year)) hail <- array(data=NA, dim=length(ProcessedSweetCorn$Year)) tsWind <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) # the rest of these variables are only available from 1996 onward flashFlood <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) iceStorm <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) heavySnow <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) coldWindChill <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) strongWind <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) tropicalStorm <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) winterStorm <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) flood <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) winterWeather <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) blizzard <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) excessiveHeat <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) freezingFog <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) heat <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) highWind <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) sleet <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) drought <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) extremeColdWindChill <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) frostFreeze <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) heavyRain <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) hurricane <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) tropicalDepression <- array(data=NA,dim=length(ProcessedSweetCorn$Year)) wildfire <- array(data=NA,dim=length(ProcessedSweetCorn$Year))
The code above saves the path of your CSV files for storm events (the ones you downloaded in bulk form). It also creates an array with length ‘Years’ for each variable you will include. I selected a sub-selection of what is available. For each year, the variable will contain the frequency of the event. Now we will loop through each year and count the number of times each storm event type occurred.
Your script should look something like this:
# loop through each year of storm event files for(ifile in 1:length(file.names)){ # open the file file <- read.csv(paste0(path,file.names[ifile]),header=TRUE)
For each year, we open the corresponding storm events file. Next, we loop through each state and extract out the key counties for that state.
Your script should look something like this:
# loop through each state for(istate in 1:length(states)){ # Determine the counties to look for --> change to uppercase counties <- factor(toupper(keyCounties$Principal.Producing.Counties[which(keyCounties$State == states[istate])]))
For this given state, we have the counties in which corn is grown. We can use this to subset the storm events data. We want to also subset the storm data for the growing season. Let’s begin by determining the typical planting date for this state.
Your script should look something like this:
### Determine the growing season dates # extract out typical planting dates plantingStartDate <- growingDates$Planting.Begins[which(growingDates$State == states[istate])] plantingEndDate <- growingDates$Planting.Ends[which(growingDates$State == states[istate])] # create planting date that is in the middle of usual start/end dates plantingDate <- as.Date(plantingStartDate)+floor((as.Date(plantingEndDate)-as.Date(plantingStartDate))/2) # change planting date to julian day for year of data examining plantingDateJulian <- format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH[1],1,4), format(plantingDate,'%m'),format(plantingDate,'%d'))),'%j')
The data provided a range of when planting typically begins. What I've done is taken an average of that range, which will be used as the planting date. Now we need to determine the harvest date.
Your script should look something like this:
# extract out typical harvest dates harvestStartDate <- growingDates$Harvest.Begins[which(growingDates$State == states[istate])] harvestEndDate <- growingDates$Harvest.Ends[which(growingDates$State == states[istate])] # create harvest date that is in the middle of usual start/end dates harvestDate <- as.Date(harvestStartDate)+floor((as.Date(harvestEndDate)-as.Date(harvestStartDate))/2) # change harvest date to julian day for year of data examining harvestDateJulian <- format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH[1],1,4), format(harvestDate,'%m'),format(harvestDate,'%d'))),'%j')
We have done the same thing as the planting date: taken the average of the range for the harvest date. We want to extract out the storm events only between these two dates. For this particular file (year), let’s create a spatial index (counties) and a time index (growing season).
Your script should look something like this:
### Extract out storm events that occurred in the states key counties over the growingseason state_index <- which(states[istate] == file$STATE) county_index <- match(as.character(file$CZ_NAME[state_index]),unlist(strsplit(as.character(counties),’,’))) spatial_index <- state_index[which(!is.na(county_index))] # create time variable for meterological data (julian days) stormDateJulian <- format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH,1,4),substr(file$BEGIN_YEARMONTH,5,6),file$BEGIN_DAY)),’%j’) # find time stamps that fall in the typical growing season for the state time_index <- which(as.numeric(stormDateJulian) <= harvestDateJulian & as.numeric(stormDateJulian) >= plantingDateJulian)
Now extract out the intersection of these.
Your script should look something like this:
# check if there is an intersection with the spatial index -> for the year (ifile) in the state (istate) these are the indices of event that occurred in the counties typical of growing sweet corn during the growing season index <- intersect(spatial_index,time_index)
This index represents all the cases in which storms occurred in our key counties for this particular state during the growing season. For each corn production observation, we want a corresponding storm events data observation. To determine the correct index, use this code:
Your script should look something like this:
# determine the index for the corn data stormYear <- format(format(as.Date(ISOdate(substr(file$BEGIN_YEARMONTH,1,4), substr(file$BEGIN_YEARMONTH,5,6),file$BEGIN_DAY)),'%Y')) stormYear <- stormYear[1] corn_index <- intersect(which(ProcessedSweetCorn$Year == stormYear[1]),which(ProcessedSweetCorn$State == states[istate]))
The corn index is just the row in which we will store the storm frequency (refer to the initialization of the variables). Now we can save the frequency of each event.
Your script should look something like this:
# count the frequency of each meteorological event for this specific corn observation tornado[corn_index] <- length(which(file$EVENT_TYPE[index] == "Tornado")) hail[corn_index] <- length(which(file$EVENT_TYPE[index] == "Hail")) tsWind[corn_index] <- length(which(file$EVENT_TYPE[index] == "Thunderstorm Wind"))
If the year is 1996 or later, we have more variables to look at.
Your script should look something like this:
# if the year is 1996 or later check for other variables if(as.numeric(stormYear) >= 1996){ flashFlood[corn_index] <- length(which(file$EVENT_TYPE[index] == "Flash Flood")) iceStorm[corn_index] <- length(which(file$EVENT_TYPE[index] == "Ice Storm")) heavySnow[corn_index] <- length(which(file$EVENT_TYPE[index] == "Heavy Snow")) coldWindChill[corn_index] <- length(which(file$EVENT_TYPE[index] == "Cold/Wind Chill")) strongWind[corn_index] <- length(which(file$EVENT_TYPE[index] == "Strong Wind")) tropicalStorm[corn_index] <- length(which(file$EVENT_TYPE[index] == "Tropical Storm")) winterStorm[corn_index] <- length(which(file$EVENT_TYPE[index] == "Winter Storm")) flood[corn_index] <- length(which(file$EVENT_TYPE[index] == "Flood")) winterWeather[corn_index] <- length(which(file$EVENT_TYPE[index] == "Winter Weather")) blizzard[corn_index] <- length(which(file$EVENT_TYPE[index] == "Blizzard")) excessiveHeat[corn_index] <- length(which(file$EVENT_TYPE[index] == "Excessive Heat")) freezingFog[corn_index] <- length(which(file$EVENT_TYPE[index] == "Freezing Fog")) heat[corn_index] <- length(which(file$EVENT_TYPE[index] == "Heat")) highWind[corn_index] <- length(which(file$EVENT_TYPE[index] == "High Wind")) sleet[corn_index] <- length(which(file$EVENT_TYPE[index] == "Sleet")) drought[corn_index] <- length(which(file$EVENT_TYPE[index] == "Drought")) extremeColdWindChill[corn_index] <- length(which(file$EVENT_TYPE[index] == "Extreme Cold/Wind Chill")) frostFreeze[corn_index] <- length(which(file$EVENT_TYPE[index] == "Frost/Freeze")) heavyRain[corn_index] <- length(which(file$EVENT_TYPE[index] == "Heavy Rain")) hurricane[corn_index] <- length(which(file$EVENT_TYPE[index] == "Hurricane")) tropicalDepression[corn_index] <- length(which(file$EVENT_TYPE[index] == "Tropical Depression")) wildfire[corn_index] <- length(which(file$EVENT_TYPE[index] == "Wildfire")) } # end if statement for additional variables
Now finish the loop of states and storm files (years):
Your script should look something like this:
} # end loop of states } # end loop of storm files
Finally, assign the storm event variables to the corn data frame. Remember, the goal was to create one CSV file that contains both the response variables and the factor variables.
Your script should look something like this:
# assign tornado, hail, and tsWind to fresh corn data ProcessedSweetCorn$tornado <- tornado ProcessedSweetCorn$hail <- hail ProcessedSweetCorn$tsWind <- tsWind ProcessedSweetCorn$flashFlood <- flashFlood ProcessedSweetCorn$iceStorm <- iceStorm ProcessedSweetCorn$heavySnow <- heavySnow ProcessedSweetCorn$coldWindChill <- coldWindChill ProcessedSweetCorn$strongWind <- strongWind ProcessedSweetCorn$tropicalStorm <- tropicalStorm ProcessedSweetCorn$winterStorm <- winterStorm ProcessedSweetCorn$flood <- flood ProcessedSweetCorn$winterWeather <- winterWeather ProcessedSweetCorn$blizzard <- blizzard ProcessedSweetCorn$excessiveHeat <- excessiveHeat ProcessedSweetCorn$freezingFog <- freezingFog ProcessedSweetCorn$heat <- heat ProcessedSweetCorn$highWind <- highWind ProcessedSweetCorn$sleet <- sleet ProcessedSweetCorn$drought <- drought ProcessedSweetCorn$extremeColdWindChill <- extremeColdWindChill ProcessedSweetCorn$frostFreeze <- frostFreeze ProcessedSweetCorn$heavyRain <- heavyRain ProcessedSweetCorn$hurricane <- hurricane ProcessedSweetCorn$tropicalDepression <- tropicalDepression ProcessedSweetCorn$wildfire <- wildfire # save the data frame as a new file write.csv(ProcessedSweetCorn,file="FinalProcessedSweetCorn.csv")
We are left with one file. For every observation of corn production for each state/year, we have a corresponding frequency of storm events that is spatially (key counties) and temporally (growing season) relevant. For your convenience, here is the R code from above in one file.
Missing Values
Now that the variables have been selected and the data has been extracted, we need to talk about missing values. In data mining, unknown values can be problematic. Choosing the best method for dealing with missing values will depend on the question and the data. There’s no right or wrong way. I’m going to discuss three ways we can deal with missing values (some of which may be a review from Meteo 810).
To show you these three methods, I will use our actual data but to make things simpler I will only look at one state. This will minimize the number of observations. Let’s begin by loading in our final data file.
Your script should look something like this:
# load in the final table of extracted meteorological events and Processed sweet corn data mydata <- read.csv("finalProcessedSweetCorn.csv",stringsAsFactors = FALSE)
Let’s extract the data for one state, picking a state that has a lot of observations.
Your script should look something like this:
# start with one state with the most comprehensive amount of data available states <- unique(mydata$State) numObs <- {} for(istate in 1:length(states)){ state_index <- which(mydata$State == states[istate]) numObs[istate] <- length(which(!is.nan(mydata$Production[state_index]))) } # pick one of the states state <- states[which(numObs == max(numObs))[3]] # create a data frame for this state stateData <- subset(mydata,mydata$State == state)
The state selected should be Minnesota. Before I go into the different methods for filling missing values, let’s start by looking at the data and determining how many missing values we are dealing with. Run the code below to see the number of variables that have missing data.
For each row (each year) of data in the stateData data frame we count the number of variables that had missing data. You will see that the majority of years had 22 missing values, and these are years prior to 1996. If you change the 1 to 2, you will get the total number of missing values for each column – the number of missing values for each variable. If you have a threshold that you want to meet, like I want the variable to have at least 80% of the values (20% missing values), you can use the function ‘manyNAs’.
Your script should look something like this:
# determine which variables have more than 20% of values with NaNs library(DMwR) manyNAs(t(stateData),0.2)
By changing the 0.2 you can adjust the threshold. This is a helpful function in pointing out which variables need to be addressed.
So, how do we deal with the missing values? Well, the first method is to fill the missing values with the most representative value. But how do you decide what the most representative value is? Do you use the mode? Median? Mean? Let’s take for example the variable flashflood. If we want to fill in the missing values with the median we could use the following code.
Your script should look something like this:
# fill in missing values with most frequent value stateData$flashFlood[is.na(stateData$flashFlood)] <- median(stateData$flashFlood,na.rm=T)
The median was 1, so now all the observations in that variable that were NAs have been replaced with 1. If we use the function ‘centralImputation’, we can fill in all missing values automatically for all variables in the data frame using this method.
Your script should look something like this:
# fill in missing values for all variables stateData <- centralImputation(stateData)
Now rerun the code to determine the number of missing values in each row:
Your script should look something like this:
# determine the number of missing values in each row apply(stateData,1,function(x) sum(is.na(x))) # determine the number of missing values in each column apply(stateData,2,function(x) sum(is.na(x)))
You should get 0. But is it realistic to say that from 1960-1995 we can replace missing values with the median from 1996-2009? Probably not. So let’s look at another method. If you are following along in R, you need to reset your stateData data frame.
Your script should look something like this:
# create a data frame for this state stateData <- subset(mydata,mydata$State == state)
The next method we will talk about is based on correlations. Many times meteorological variables (or variables in general) can be correlated with other variables. If you have the data observations for variable A and you know that variable B is correlated to variable A, you can estimate Variable B using the correlation. Start by looking at the correlations among the variables. Run the code below to see a table of correlation coefficients and to synthesize the results.
This uses symbols for the level of significance of the correlation. You will see that the highest correlation is 0.629 between extremeColdWindChill and hail. (You will also notice a correlation of 1 between highWind and iceStorms, but I will talk about this in a bit.) So how do we use this correlation? We create a linear model between the two variables. Since hail has fewer missing values, I want to use hail to predict extremeColdWindChill. Use this code to create the linear model and fill in the missing values from extreme cold wind chill with predictions from the model.
The code above fills in the missing values from extremeColdWindChill by finding the corresponding hail values and running the function that uses the linear model to create a new value. What are the problems with this method? Well, the primary issue is that we are calculating correlations between 14 observations (1996-2009). That is not nearly enough to create a valid result. Hence, the correlation coefficient of 1 for highWind and iceStorms. If you look at the data, they have 1 instance of each event over the 14 observations. This correlation is probably spurious rather than real.
The third and more realistic method for this example is to simply remove cases. Make sure you reset your data if you are following along in R.
You can do two things: either set a threshold requiring no more than x% of missing values or completely get rid of all missing values. If we wanted a dataset of only complete cases, we would use the following code:
Your script should look something like this:
stateData <- na.omit(stateData)
This code removes any row (year) in which any variable has a missing value. This means we would only retain values from 1996 onward. In this case, I think it would be better to just remove variables that have too many unknowns. I’m going to say any variables with more than 20% of the data missing should be omitted. To do this, use the following code (remember to reset your state data frame):
Your script should look something like this:
clean.stateData <- stateData[,-manyNAs(t(stateData))]
You will be left with the corn variables for all years along with the thunderstorm/wind, tornado, and hail storm event variables. In this case, I decided it was more valuable to have fewer variables with more observations than more variables with limited observations, so as to avoid overfitting when too many variables are used to fit the model. Any of these methods are acceptable; it just depends on the situation and your justification.