We will now investigate the multivariate generalization of ordinary linear regression, using a data set of Northern Hemisphere land temperature data over the past century. We will attempt to statistically model the observed data in terms of a set of three predictors: (1) estimates from a simple climate model (discussed in our next lesson) known as an Energy Balance Model that has been driven by estimated historical anthropogenic (greenhouse gas and aerosol) and natural (volcanic and solar) radiative forcing histories, and two internal climate phenomena discussed in the previous subsection: the (2) DJF Niño3.4 index, measuring the influence of the El Niño phenomenon, and the (3) DJFM NAO index.
The demonstration is in 4 parts below:
Video: Linear Regression Demonstration - Part 1 (1:52)
Part 1
Click for transcript of Demonstration Part 1.
PRESENTER: We're going to take a look at an example of multivariate regression and I'm going to read in a data set here this data set contains the average temperature of the Northern Hemisphere land regions over the past century so let's start out by plotting the data so we can see what it looks like and there it is we're going to look at three potential quantities that may explain some of the variations that we're going to see in this temperature series so the temperature series is our dependent variable and our we're going to look at three different independent variables what are these independent variables is a simulation that I've done using a climate model called an energy balance model that we'll be talking about in a future lesson and I've subjected this theoretical climate model to both human impacts estimated radiative forcing by greenhouse gases and anthropogenic sulfate aerosol emissions as well as radiative forcing by natural causes including volcanoes explosive volcanic eruptions and estimated changes in solar output over time so let's take a look at what that simulation looks look looks like click here in EBM so we have both series here we have the temperature anomaly in blue and the EBM output in yellow with the values on the right-hand side so you can see that in fact the energy balance model simulation the yellow curve does capture quite a bit of the variation in the blue curve the instrumental surface temperature data there's still quite a bit of variation that's left unexplained and we will now look at two other factors that are internal to the climate system rather than external in nature that might explain some of that residual variability.
Video: Linear Regression Demonstration - Part 2 (2:09)
Part 2
Click for transcript of Demonstration Part 2.
PRESENTER: Before we go on let's actually do the formal regression using only the single predictor of the energy balance model simulation as a predictor of the Northern Hemisphere and land temperatures will run the regression here IBM Turner observation is temperature run we see that the r-squared value is 0.7 1/6 that means a fairly impressive just under 72% of the variation in northern land northern hemisphere and land temperatures is explained just using this the energy balance model if we look at the value of Rho the lag 1 autocorrelation coefficient 0.067 tells us the autocorrelation of the residuals doesn't appear to be a problem so let's go back to the plot and now we're going to plot the regression model output instead of the EPM but we'll put them both on the same scale here 1 and you can see that it does provide as we saw saw before a fairly good fit to the data it explains just under 472 percent of the variation of the data if I turn off these plots and look at just the residuals from the model we can see that the residuals look still look pretty random and that doesn't appear to be a whole lot of structure although there's quite a bit of internal variability and perhaps we can explain some of the inter annual variability through to other predictors the El Nino phenomenon and the North Atlantic Oscillation phenomenon to internal climate modes that influence northern hemisphere layin temperatures so we'll look at that next.
Video: Linear Regression Demonstration - Part 3 (1:56)
Part 3
Click for transcript of Demonstration Part 3.
PRESENTER: So let's look at the other factors that might potentially explain some of the variation in this temperature series first of all we'll look at the niño 3.4 index we'll plot that here that as we know is a measure of El Niño variability you can see that measure on the right axis as soon as we can see there seems to be somewhat of a correlation between Northern Hemisphere and land temperatures in El Niño one of the more obvious examples is the 97-98 El Niño one of the largest El Niño events on record that was also associated with an unusually warm year when we see other evidence of relationships between El Niños and warmer cooler air temperatures so we see the El Niño index this yellow curve dips down on this extreme negative value that was an unusually cold year in 1917 so we might expect there's a positive relationship between these two times time series and that we can explain some of the variation in northern hemisphere temperature series with El Niño now let's finally look at the NAO index let's put that here here and nao right down the line plot that's here in green let's turn this one off again we can see somewhat of a positive relationship usually in warm years in some cases appear to be associated with the positive nao index so we might expect that El Niño and the nao can explain some of the remaining variation in our energy our energy balance model simulation didn't explain and so our next step would be to try all three factors at once.
Video: Linear Regression Demonstration - Part 4 (3:23)
Part 4
Click for transcript of Demonstration Part 4.
PRESENTER: So now let's try our multivariate regression using all three predictors the energy balance model simulation El Niño and the nao so we go the regression model tab here and with a click of the mouse we can select all three quantities at once IBM media 3.4 and the nao and we're trying to predict we're trying to predict temperature so we run that model and now we can see that we explained nearly 80% of the variation in the temperature series we went from just under 72% to know essentially 80% of the variation using these three predictors and that's about as good as you can expect to do in a simple multivariate regression of this sort to explain roughly for face of the total variation in the data we can see also that the autocorrelation coefficient is small this row value down the bottom it's not going to be statistically significant and we don't have to worry about autocorrelation and the residuals which is nice so now let's go back to plot settings and we're going to plot alongside our temperature series our model output from the simulation let's make that a line plot and put these on the same scale click over here click one down here minus one so our model simulation result remembers includes both the energy balance simulation El Niño and nao these two internal factors the yellow curve is our statistical model but based on these three predictors the blue curve is the actual temperature series and we've explained a fairly impressive amount of variation in the data we can see the effect of volcanic eruptions and some of the short-term Cooling's that are so seen in the record and then a lot of the other internal fluctuations are at least partially explained by the NAO and El Niño if we like we can recover the regression coefficients in our regression model the constant the term multiplied in the energy balance model El Niño niño 3.4 in the NAO index the sum of these terms is our statistical model and it does quite well in this particular case finally we can take a look at the residuals what's left over after all this let's get rid of that plot the residuals and that's what's shown here in the blue curve there is some variability of course it's left over that isn't explained by law by the factors we've considered but there isn't a whole lot of structure in that time series suggesting that the results of this multivariate regression are probably meaningful and telling us something about the underlying factors that explain long-term variations and nearly year variations the Cadle variations in the northern hemisphere land temperatures over time we can see our total model values for -1 to 1 the residuals are here are minus 0.42 roughly 0.3.
You can play around with the data sets used in this example yourself using the Linear Regression Tool and selecting the data 'tempexample.txt'. Note the final model from this example is: Y(t)=c0+c1*x1(t)+c2*x2(t)+c3*x3(t), where Y(t) is the predicted temperature anomaly at time t, x1 is the EBM value, x2 is the Nino 3.4 value, and x3 is the NAO value for a given time t.