EME 210
Data Analytics for Energy Systems

Linear Algebra Solution of Linear Regression

PrintPrint

Introduction to Linear Regression

Read It: Linear Algebra Solution of Linear Regression

Recall from earlier that linear regression involves finding the best fit line for a given dataset. Traditionally, this best fit line is found through linear algebra. If you have taken a linear algebra course, then you may recall that to multiply two matrices, you sum the products of each row and column together, as demonstrated below. 

We will use this concept to find the slope and intercept through linear regression. Below we outline the steps.

  1. Start with a simple linear model for each point in the dataset. y 1 = b 1 x 1 + b 0 1 y 2 = b 1 x 2 + b 0 1 y n = b 1 x n + b 0 1
  2. Convert the series of equations to matrix form. [ y 1 y 2 y n ] = [ x 1 1 x 2 1 x n 1 ] [ b 1 b 0 ]
  3. Rewrite in matrix notation. y = X b
  4. Multiply both sides of the equation with the transpose of X. X T y = X T X b
  5. Divde both sides by the inverse of X^TX. ( X T X ) 1 X T y = b
  6. Plug the values for b into the equation for the line: y = b 1   x + b 0

In the video below, we demonstrate how to implement these steps in Python.

 Watch It: Video -  Linear Algebra Solution (12:29 minutes)

Click here for a transcript.

All right, welcome back to the video series on linear regression. So in the previous set of videos we talked about correlation and how we can use the hypothesis test for correlation to the term, and the significance of an association. Today we're going to get start getting into actually conducting linear regression. So the correlation can sort of tell us an idea about that association, but the linear regression is what really provides that best fit line and allows us to make predictions based off of that linear model. And so, in this particular video we're going to talk about the manual linear algebra style of doing linear regression. And this is an important base for the next couple videos that will focus on linear regression in Python, because whatever your computer is doing to conduct linear regression it follows the exact same framework that we're about to show in this video. And so, while it may seem a little mathematical and a little abstract, just know that this is the basis for all linear regression that is conducted on any computer program. So with that, let's go ahead and dive right in.

So we are here in Google Colab. And we've got a few additional libraries that we will get to at a later time, but for most of this lecture series I'm actually going to focus on using a data frame that we have talked about when we discuss what spurious correlation is and that is the drowning in nuclear data frame, or data set, that has been published right here. And essentially, this data set has just correlated the amount, the number of drowning deaths in the US with the amount of power generated by nuclear energy. And as we've discussed before, this actually has a very positive, high correlation value. So to demonstrate that as, demonstrate that correlation, I'm going to start with ggplot. So I've called it the data frame df, and I'm also going to add in the AES statement here in the ggplot command like I did in the correlation video. And this just really helps to streamline the plotting process because I no longer need to add an AES statement into each individual plotting command. I can focus on adding in the command specific values, so method, color, and for now, I'm actually going to say se equals false, which says, don't put that confidence interval bands that we often see around the data. So we can see here, we've got nuclear and drowning, and it's a very high, very strongly positive association. But we know from just you know logic, that there is no causal relationship between drowning and nuclear power, just that they seem to have increased at the same rate over time. And so, nonetheless, we're going to use this data set to conduct manual linear regression. And these are going to follow those same steps that we talked about in the lesson itself, but here I'm going to demonstrate how you can do that linear algebra in Python. So, the first step is to create a column of ones, and I'm just going to call it ones, and I'm going to say take a scalar of one and multiply it by the number of rows in df. So, if we print that, we can see that we now have this column of ones alongside our drowning in our nuclear power data.

So then the next step is to extract the x data, and the ones, and form a separate matrix. So when we're doing matrix multiplication and matrix math, we need that y matrix, that x matrix, and then we're going to be solving eventually for that beta matrix. And so, to get this x theta, I'm going to create a new sort of list structure where I put my column names, so drowning, and ones, and then I'm going to say x is pd dot data frame, the f calls dot two numpy. So first I'm converting it into a data frame, and then I'm converting it into a numpy array. So then we can print x and so now we can see that this is the same data that we had up here in drowning, but now it's in matrix array form. So then we need to separate out the y data. And that's going to look very similar. This time we don't need to create a column matrix because it's just one value, so we can say pd data frame df nuclear dot to_numpy. And then we can print y.

And so, now we've just got a matrix of our nuclear data. And now that we've got our y matrix and our x matrix, now we can start to solve that equation. So the first thing we need to do is transpose the x matrix. And so I'm using this transpose command from numpy. And I'm just going to create a new variable, XT np dot matrix dot transpose(X) and this is just where we flip our rows into columns, and our columns into rows. So we can see that now instead of being a long data frame, it's now a wide data frame, with the first row being drowning, and the second row being once. And so, then continuing on with this linear regression, we, the fifth step is to multiply X by XT. And so, if you have taken a linear algebra class then you know that you would do this manually by adding and multiplying rows by columns. But in Python, we have this nice command called matmul from matrix multiplication, and we just give it the two values and I called that (XT, X) for X times XT times X. And so, now we can see we end up with a two by two matrix that is the X multiplied by X transpose. And now the next step, step six, is to do that same process but on the Y. So you can sort of imagine the equation in your head. You've got Y equals X times B. And so, if we do something to the X, we need to also do something to the Y, in order to make sure that we're not unbalancing our equation, essentially. So here I'm going to call it x t y and it is np.metmul, x t comma lowercase y, and then we can run that. And so, we've got a two by one matrix on the Y side. And so, now we've got, if you're following the equation, we've got XTY equals XTX times B. So we need to get that XTX to the other side of the equation, and the way that we do that is through the inverse. So when you divide by a matrix you end up getting this inverse. And so, in order to do that, we're still using this numpy, but we're using linalg, for linear algebra, dot in for inverse. So it's XTX.inv and this is just mp.lin.if and then we just give it the variable XTX then we can print that.

And so, then we can see we've still got this two by two matrix, but this is now the inverse. And so, now that we have the inverse, we're ready to move XTX
in from the right side of the equation to the left. And this is how we essentially get our beta values. So now we're dividing X on both sides. So we've done it from the X side, and now we do it from the Y side, which essentially just means multiplying XTX inverse with XTY.

And the result is our beta values, beta naught and beta one. And so, we can see here that we've got this two by one matrix that contains our slope in the first row and our intercept in the second row. And so, through this linear algebra we can come to this conclusion, the results are the beta values. And so, if we take our regression line Y equals MX plus b we can change that to y equals zero point three one three x plus six one four. And this is now our equation of the line that we could use to make predictions for new Y values given X values. And like I said, this linear algebra solution might be a little abstract, but it's a really important base. Because this is what, for the rest of this lesson, will be happening behind the scenes. This is exactly what python is doing every single time we ask it to do linear regression. So in the next several videos, we'll actually get in to some more python commands for making this process a little easier on us.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0