METEO 815
Applied Atmospheric Data Analysis

Obtain Model and Predict

Prioritize...

After reading this section you should be able to obtain the best model and use it. 

Read...

Now we know which model is best (linear, regression forest, or random forest). But how do we actually obtain the model parameters and use it? 

Obtain Model

At this point, we have determined which type of model best represents the response variable but we have yet to actually obtain the model in a usable way. Let's extract out the best model for each response variable. Start by determining the best model names.

# obtain the best model for each predicted variable
NamesBestModels <- sapply(bestScores(allResults),
                          function(x) x['nmse','system'])

Next, classify the model type.

modelTypes <- c(cvRF='randomForest',cvRT='rpartXse',cvLM='lm')

Now create a function to apply each model type.

modelFuncs <- modelTypes[sapply(strsplit(NamesBestModels,split="\\."),
                         function(x) x[1])]

Then obtain the parameters for each model.

modelParSet <- lapply(NamesBestModels,
                   function(x) getVariant(x,allResults)@pars)

Now we can loop through each response variable and save the best model.

Show me the code...

Your script should look something like this:

bestModels <- list()
for(i in 1:length(NamesBestModels)){
  form <-as.formula(paste(names(clean.stateData)[4+i],'~.'))
  bestModels[[i]]<- do.call(modelFuncs[i],
                            c(list(form,lapply(clean.stateData[,c(4,4+i,10:12)],as.numeric)),modelParSet[[i]])) 
}

The code above begins by creating a formula for the model (response variable~factor variable). It then calls the 'modelFuncs' function, which is our learners (lm, randomForest, rpartXse), and supplies the arguments and the parameters. You are left with a list of five models. For each response variable, the list contains the formula of the best model.

Predict

Now that we have the formulas, let's try predicting values that were not used in the training set. Here is a set of corn and storm data from 2010-2015. We can use this data to test how well the model actually does at predicting values not used in the training dataset. Remember, the reason I didn't use this dataset for testing before was due to the limited sample size but we can use it as a secondary check.

Begin by loading the data and extracting it out for our state of interest (Minnesota).

Show me the code...

Your script should look something like this:

# load in test data 
mydata2 <- read.csv("finalProcessedSweetCorn_TestData.csv",stringsAsFactors = FALSE) 

# create a data frame for this state 
stateData2 <- subset(mydata2,mydata2$State == state)

Now we will predict the values using the best model we extracted previously.

Show me the code...

Your script should look something like this:

# create predictions
modelPreds <- matrix(ncol=length(NamesBestModels),nrow=length(stateData2$Observation.Number))
for(i in 1:nrow(stateData2))
  modelPreds[i,] <- sapply(1:length(NamesBestModels),
                           function(x)
                             predict(bestModels[[x]],lapply(stateData2[i,4:12],as.numeric))
  )
modelPreds[which(modelPreds < 0)] <-0

We loop through each observation and apply the best model for each response variable. We then replace any non-plausible predictions with 0. You are left with a set of predictions for each year for each corn response variable. Although we have only 6 years to work with, we can still calculate the NMSE to see how well the model did with data not included in the training dataset. The hope is that the NMSE is similar to the NMSE calculated in our process for selecting the best model.

Show me the code...

Your script should look something like this:

# calculate NMSE for all models 
avgValue <- apply(data.matrix(clean.stateData[,5:9]),2,mean,na.rm=TRUE) 
apply(((data.matrix(stateData2[,5:9])-modelPreds)^2),
        2,mean,na.rm=TRUE)/ 
        apply((scale(data.matrix(stateData2[,5:9]),avgValue,F)^2),2,mean)

The code starts by calculating an average value of each corn response variable based on the longer dataset. Then we calculate the MSE from our 6-year prediction and normalize it by scaling the test data by the long-term average value. You should get these scores:

Normalized MSE scores from 6-year prediction for each variable 
Average.Harvested Yield.Acre Production Season.Average.Price Value.of.Production
0.1395567 0.7344319 0.4688927 0.6716387 0.7186660

For comparison, these were the best scores during model selection. Remember, the ones listed below may be different from yours due to the random factor in the k-fold cross-validation.

$Acreage.Harvested
       system     score
nmse cv.lm.v1 0.2218191

$Yield.Acre
       system    score
nmse cv.rf.v2 0.624105

$Production
       system     score
nmse cv.lm.v1 0.2736667

$Season.Average.Price
       system     score
nmse cv.rf.v3 0.4404177

$Value.of.Production
       system     score
nmse cv.lm.v1 0.4834414

The scores are relatively similar. Yield/acre has the highest score while the lowest score is from acreage harvested and production. This confirms that our factor variables are better at predicting the acreage harvested than Yield/Acre. 

US Example

Let’s go through one more example of the whole process for the entire United States (not just Minnesota). In addition, I'm going to use more factor variables and only look at observations after 1996 (this will include events such as flash floods, wildfires, etc.).

To start, let's extract out the data from 1996 onward for all states.

Show me the code...

Your script should look something like this:

# extract data from 1996 onward to get all variables to include 
allData <- subset(mydata,mydata$Year>= 1996)

Now clean up the data. I've decided that if no events happen, I will remove that event type. Start by summing up the number of events for each event type (over the whole time period) and determining which factor variables to keep.

Show me the code...

Your script should look something like this:

# compute total events 
totalEvents <- apply(data.matrix(allData[,4:34]),2,sum,na.rm=TRUE) 
varName <- names(which(totalEvents>0))

Now subset the data for only event types where events actually occurred and remove any missing cases.

Show me the code...

Your script should look something like this:

# subset data for only events where actual events occurred 
clean.data <- allData[varName] 
clean.data <- clean.data[complete.cases(clean.data),]

Now create the functions to train, test, and evaluate the different model types.

Show me the code...

Your script should look something like this:

# create function to train test and evaluate models through cross-validation
cvRT <- function(form,train,test,...){
     modelRT<-rpartXse(form,train,...)
     predRT <- predict(modelRT,test) 
     predRT <- ifelse(predRT <0,0,predRT) 
     mse <- mean((predRT-resp(form,test))^2) 
     c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2)) 
} 
cvLM <- function(form,train,test,...){ 
     modelLM <- lm(form,train,...) 
     predLM <- predict(modelLM,test) 
     predLM <- ifelse(predLM <0,0,predLM) 
     mse <- mean((predLM-resp(form,test))^2) 
     c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2)) 
} 
library(randomForest) 
cvRF <- function(form,train,test,...){ 
     modelRF <- randomForest(form,train,...) 
     predRF <- predict(modelRF,test) 
     predRF <- ifelse(predRF <0,0,predRF) 
     mse <- mean((predRF-resp(form,test))^2) 
     c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2)) 
}

Now apply these functions to all the response variables.

Show me the code...

Your script should look something like this:

# apply function for all variables to predict 
dataForm <- sapply(names(clean.data)[c(2:6)], 
                   function(x,names.attrs){
                     f <- as.formula(paste(x,"~.")) 
                     dataset(f,lapply(clean.data[,c(names.attrs,x)],as.numeric),x) 
                   }, 
                   names(clean.data)[c(1,7:27)]) 
allResults <- experimentalComparison(dataForm, 
                                     c(variants('cvLM'), 
                                       variants('cvRT',se=c(0,0.5,1)), 
                                       variants('cvRF',ntree=c(70,90,120))), 
                                     cvSettings(5,10,100))

Finally, look at the scores.

Show me the code...

Your script should look something like this:

# look at scores
bestScores(allResults)

Hopefully, by including more data points (in the form of multiple states) as well as more factor variables, we can decrease the NMSE. The best score is still for acreage harvested, with an NMSE of 0.0019, but production and value of production are quite good with NMSEs of 0.039 and 0.11 respectively.

Again the end goal is prediction, so let's obtain the best model for each response variable.

Show me the code...

Your script should look something like this:

# obtain the best model for each predicted variable 
NamesBestModels <- sapply(bestScores(allResults), 
                          function(x) x['nmse','system']) 
modelTypes <- c(cvRF='randomForest',cvRT='rpartXse',cvLM='lm') 
modelFuncs <- modelTypes[sapply(strsplit(NamesBestModels,split="\\."), 
                                function(x) x[1])] 
modelParSet <- lapply(NamesBestModels, 
                      function(x) getVariant(x,allResults)@pars) 
bestModels <- list() 
for(i in 1:length(NamesBestModels)){ 
   form <-as.formula(paste(names(clean.data)[1+i],'~.')) 
  bestModels[[i]]<- do.call(modelFuncs[i],          
                            c(list(form,lapply(clean.data[,c(1,i+1,7:27)],as.numeric)),
                              modelParSet[[i]])) }

And let's test the model again using the separate test dataset. We need to create a clean test dataset first.

Show me the code...

Your script should look something like this:

# create a data frame for test data 
data2 <- subset(mydata2,mydata2$Year >= 2010) 

# subset data for only events where actual events occurred 
clean.test.data <- data2[varName] 
clean.test.data <- subset(clean.test.data,!is.na(as.numeric(clean.test.data$Acreage.Planted)))

Now calculate the predictions.

Show me the code...

Your script should look something like this:

# create predictions 
modelPreds <- matrix(ncol=length(NamesBestModels),nrow=nrow(clean.test.data)) 
for(i in 1:nrow(clean.test.data)){
  modelPreds[i,] <- sapply(1:length(NamesBestModels), 
                          function(x) 
                            predict(bestModels[[x]],lapply(clean.test.data[i,c(1,7:27)],as.numeric)))
} 
modelPreds[which(modelPreds < 0)] <-0

Finally, calculate the NMSE.

Show me the code...

Your script should look something like this:

# calculate NMSE for all models 
avgValue <- apply(data.matrix(clean.data[,2:6]),2,mean,na.rm=TRUE) 
apply(((data.matrix(clean.test.data[,2:6])-modelPreds)^2), 
        2,mean,na.rm=TRUE)/ 
  apply((scale(data.matrix(clean.test.data[,2:6]),avgValue,F)^2),2,mean)

In this case, I get relatively high scores for every corn response variable except acreage harvested. Remember, there are fewer points going into this secondary test, so it may be skewed. But one thing is for sure, acreage harvested seems to be the best route for us to go. If we want to add value to corn futures, we would predict the acreage harvested using storm event frequency and acreage planted. Let's just look at the acreage harvest model in more detail. First, let's visually inspect the predictions and the observations for acreage harvested.

Show me the code...

Your script should look something like this:

# plot test results 
plot(preds[,1],clean.test.data$Acreage.Harvested,main="Linear Model",
     xlab="Predictions",ylab="True values")
abline(0,1,lty=2)

Linear model showing the predicted acreage harvested
Observations versus predictions for Acreage Harvested
Credit: J. Roman

After visually inspecting, we see a really good fit. One of the things to note is that acreage planted is a factor in this model. If there is no difference between acreage planted and acreage harvested, then we aren't really providing more information by looking at storm event frequency. Let's start by looking at the difference between acreage harvested and acreage planted.

Show me the code...

Your script should look something like this:

# difference between planted and harvested 
differenceHarvestedPlanted <- clean.data$Acreage.Planted-clean.data$Acreage.Harvested

If you now apply the summary function to the difference you will see that although there are instances where no change has occurred between the planted and the harvested, the general difference is between 800-1700 acres. We can also perform a t-test. As a reminder, the t-test will tell us whether the mean acreage planted is statistically different than the mean acreage harvested. 

Show me the code...

Your script should look something like this:

# perform a t-test 
t.test(clean.data$Acreage.Planted,clean.data$Acreage.Harvested,alternative = "two.sided",paired=TRUE)

The result is a P-value <0.01, which means that the mean acreage planted and the acreage harvested are statistically different.

Let's check which variables have statistically significant coefficients in this model (coefficients different from 0)

Show me the code...

Your script should look something like this:

# follow up analysis 
summary(bestModels[[1]])

We see that the coefficients for the factor variables cold wind chill events, winter storm events, and droughts are statistically significant. The last thing we can do is compare a linear model only using acreage planted (as this is where most of the variability is probably being captured), one which uses acreage planted and the other statistically significant coefficients (cold wind chill events, winter storm events, and droughts), and the one we selected with our automated process.

Show me the code...

Your script should look something like this:

# create three separate linear models for comparison
lm.1 <- lm(clean.data$Acreage.Harvested~clean.data$Acreage.Planted)
lm.2 <-  
lm(clean.data$Acreage.Harvested~as.numeric(clean.data$Acreage.Harvested)+as.numeric(clean.data$coldWindChill)+as.numeric(clean.data$winterStorm)+as.numeric(clean.data$drought))
lm.3 <- lm(clean.data$Acreage.Harvested~.,lapply(clean.data[,c(1,1+1,7:27)],as.numeric))

If you apply the summary function on all the models you will see that the second model is almost a perfect fit and that lm.3 (which is the best model selected through cross-validation) has a slightly higher R-squared value than lm.1. If we perform an ANOVA analysis between the models, we can determine if the models are statistically different.

Show me the code...

Your script should look something like this:

anova(lm.1,lm.2)
anova(lm.1,lm.3) 

The results suggest that the change was statistically significant from lm.1 to lm.2. As for the difference between lm.1 and lm.3 (lm.3 is the actual model selected), it is not as strong. This follow-up analysis suggests that it might be best to use linear model 2, which is less complex than linear model 3 but provides a statistically different result from the simple usage of acreage planted. You can see that even though the automation created a good model, by digging in a little more we can further refine it. Data mining is a starting point, not the final solution. 

By combining a number of tools you have already learned, we were able to successfully create a model to predict an aspect of the corn harvest effectively. We were able to dig through a relatively large dataset and pull out patterns that could bring added knowledge to investors and farmers, hopefully aiding in better business decisions.