METEO 815
Applied Atmospheric Data Analysis

Pattern Analysis Part 1: Multiple Linear Regressions

Prioritize...

By the end of this section, you should be able to perform multiple linear regressions on a dataset and optimize the resulting model.

Read...

There are many techniques for pattern analysis, but for this course we are going to talk about two that utilize what we already have learned. They are multiple linear regressions and regression trees. In other courses within this program, we may come back to data mining and explore more advanced techniques. But for now, let’s focus on expanding what we already know.

Initial Analysis

We discussed linear regression in a previous lesson, so you are already familiar with this method. A multiple linear regression analysis applies the linear regression method multiple times to hone in on an optimal solution. Before we get into how we decide what’s optimal, we need an initial model to start with. Let’s begin by creating a linear model for each of our response variables (yield/acre, production, season average price, value of production and acreage harvested) using all the factor variables.

Show me the code...

Your script should look something like this:

# create linear model for predictors
data <- clean.stateData[,c(4,6,10:12)]
lm1.YieldAcre <- lm(Yield.Acre~.,lapply(data,as.numeric))
lm1.Production <- lm(Production~.,lapply(clean.stateData[,c(4,7,10:12)],as.numeric))
lm1.SeasonAveragePrice <- lm(Season.Average.Price~.,lapply(clean.stateData[,c(4,8,10:12)],as.numeric))
lm1.ValueofProduction <- lm(Value.of.Production~.,lapply(clean.stateData[,c(4,9:12)],as.numeric))
lm1.AcreageHarvested <- lm(Acreage.Harvested~.,lapply(clean.stateData[,c(4:5,10:12)],as.numeric))

For each response variable, we have a linear equation that takes as inputs our four factor variables: acreage planted, tornado, hail, and thunderstorm wind frequency. Run the code below to summarize each model. 

The first thing to start with is the P-value at the bottom, which tells us if we should reject the null hypothesis. In this case, the null hypothesis is that there is no dependence which is that the response variable does not depend on any of the factor variables (tornado, hail, thunderstorm/wind, acreage planted). If this P-value is greater than 0.05, we know that there is not enough evidence to support a linear model with these variables. Which means we should not continue on! If the p-value is less than 0.05 (remember, this is a 95% confidence - you can change this depending on what confidence level you want to test), then we can reject the null hypothesis and continue examining this model. All P-values are less than 0.05 for this example, so we can continue on.

The next step is to look at the R-squared and adjusted R-squared value. Remember, R-squared tells us how much variability is captured using this model - it’s a metric of fit. The closer to 1, the better the fit. If the R-squared value is really low (< 0.5), the linear model is probably not the best method for modelling the data at hand and should no longer be pursued. If it’s greater than 0.5, we can continue on. The adjusted R-squared value is a more robust result (usually it’s lower than the other R-squared value). One thing to note about the adjusted R-squared: if the adjusted R-squared is substantially bigger than the R-squared, you do not have enough training cases for the number of factor variables you are using. This will be important to look at later on. For now, all the R-squares are greater than 0.5; we can move on.

The next step is to look at the individual coefficients for the factor variables. We want to know whether the coefficients are statistically significant. We do this by testing whether the null hypothesis, that the coefficient is equal to 0, can be rejected. The probability is shown in the last column; Pr(>t). If a coefficient has a value that is statistically significant, then it will be marked by a symbol. As long as you have at least one coefficient that is significant, you should keep going. In this example, each model has at least one coefficient that’s statistically significant.

At this point, we have an initial model (for each response variable) that is statistically significant. Now we want to begin optimizing the model. Optimizing a model is a balance between creating a simple and efficient model (few factor variables) and minimizing errors from the fit. 

Backwards Elimination

One method for optimization is called backwards elimination. We start by looking at an initial model containing all the factor variables and determine which factor variable coefficients are least significant and then remove the corresponding factor variables. We then update the model using just the remaining factor variables. There are several metrics to decide which factor variables to reject next. You could look at which coefficient doesn’t pass the null hypothesis in the summary function, suggesting that the coefficient is not statistically different to 0. Another option is to look at how well the model predicts the response variable. We can do this by using ANOVA. If you remember from our previous lesson, we used ANOVA as sort of a multiple t or Z-test. It told us whether the response variable is affected by a factor variable with different levels. If we rejected the null hypothesis, then the evidence suggested the factor variable affected the response variable. Since ANOVA uses a GLM to estimate the p-value, we can also use it to test the fit of the model. Run the code below to give this a try:

For ANOVA, the null hypothesis is that each coefficient does not reduce the error. You want to look at the values of Pr(>F) which are again marked by symbols if they are statistically significant. If a coefficient is statistically significant, then that factor variable statistically reduces the error created by the fit. If a coefficient is not marked, it would be optimal to remove that variable from our model.

For this example, I would get rid of ts_Wind for the response variables yield/acre, production, season average price, value of production, and tornado for the response variable acreage harvested. Let’s update our models by removing these variables. Fill in the missing code for the 'AcreageHarvested' linear model. Make sure you remove the variable 'tornado'. 

Now we repeat the process, by summarizing our new models. 

The P-value should still be less than 0.05 and the adjusted R-squared should have increased. Generally, if you remove a factor variable in regression the R-squared will always decrease but the adjusted R-squared may increase. The goal here is to minimize the factor variables to drive up the adjusted R-squared, even though the loss of each factor variable will cost you a bit of R-squared. If the adjusted R-squared didn't increase, then the update wasn't successful. Let’s compare the two models (before and after). Add in the code for the Acreage Harvested models. 

This format of the ANOVA test tells us how significant the update was. If the change was statistically significant, the value Pr(>f) would be less than 0.05 and would have a symbol by it. None of our models statistically changed. However, we still have optimized the model by reducing factor variables and increasing the adjusted R-Square. Now we can repeat this process on the new model and remove another variable, further refining the model.

Automated Backward Elimination

If you have a lot of factor variables, you would not want to do this routine manually. There are functions in R that allow you to set this routine up automatically. One such function is called ‘step’. By default, ‘step’ is a backward elimination process but instead of using the ANOVA fit as a metric, it uses the AIC score. As a reminder, the AIC score is the Akaike Information Criterion, another metric describing the quality of the model relative to other models. AIC focuses on the information lost when a given model is used, whereas R-squared focuses on the prediction power. ‘Step’ will compare the before and after AIC scores to optimize the model. You still need to make the initial linear model, but then you can run ‘step’ to automatically optimize. Fill in the missing code for the Value of Production final linear model. 

We can then summarize the final models by running the code below:

You are left with the optimized model. Please note that there is a random component, so the results will not always be the same. For the example I ran, the models with high adjusted R-squared values are the models for variables acreage harvested (0.885) and production (0.7689). This means that 88.5% of the variability of acreage harvested can be explained by the acreage planted and the frequency of hail events. 76.9% of the variability of production can be explained by the combination of acreage planted, hail, and ts_wind events. We will discuss evaluating these models in more detail later on. For now, make sure you understand what the ‘step’ function does and how we perform a data mining process using a multiple linear regression approach.