Introduction to Linear Regression

Read It: Introduction to Linear Regression

The goal of linear regression is to predict some value y based on the relationship between y and x. In practice, this means we try to find the best fit line to minimize the error. This is shown in the figure below, where we want to find the best line that minimizes the total difference between the data points and the line. The differences are denoted with the letter e. We then define the line based on the slope and intercept, ultimately following the classic equation: $y = m x + b$ , where $m$ is the slope and $b$ is the intercept.

Best Fit Line

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

When working with a sample of data, we usually denote the line as: $y_i = b_1 x_i + b_0 + e_i$ . For a population, we change the equation to include uppercase letters: $Y = beta 1 X + beta 0 + epsilon$

A key aspect of this latter equation is that epsilon is normally distributed (e.g., $N (0, sigma_e)$ ). In fact, the normally distributed error is a critical assumption in linear regression models. Additional assumptions include: (a) a linear relationship between x and y, (b) constant variability in the data, and (c) no outliers present. Examples of how each of these conditions might be broken are provided below.

Figure A:

Click for a text description.

$A = (\begin{array}{ccc} a_{11} & \dots & a_{1 n} \\ ⋮ & ⋱ & ⋮ \\ a_{m 1} & \dots & a_{m n} \end{array})$

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

Figure B:

Click for a text description.

$B = (\begin{array}{ccc} b_{11} & \dots & b_{1 p} \\ ⋮ & ⋱ & ⋮ \\ b_{n 1} & \dots & b_{n p} \end{array})$

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

Figure C:

Click for a text description.

$C = (\begin{array}{ccc} c_{11} & \dots & c_{1 p} \\ ⋮ & ⋱ & ⋮ \\ c_{m 1} & \dots & c_{m p} \end{array})$

In particular, Figure (a) shows how the first condition (linearity) might be broken. Notice how the pattern is distinctly curved, rather than straight. In Figure (b), the second condition (constant variability) is broken. Here, we see a fan or wedge shape, which indicates that the variance between the y data at x = 20 is much larger than that at x = 10. This means that the variance is not constant, thus breaks one of the conditions for linear regression. Finally, Figure (c) shows the third condition (no outliers) being broken. In particular, there are four likely outliers, which leads to the broken condition. Below, is an example of an ideal example for meeting all three conditions. Notice that the pattern is relatively straight, without increasing variance, and there are no outliers. When conducting linear regression, this is how we want our data to look in an ideal world. Unfortunately, real-world data is rarely this perfect, so we often have to conduct transformations to ensure the model is statistically valid. We will go over these transformations later in the course.

$c_{i j} = a_{i 1} b_{1 j} + a_{i 2} b_{2 j} + \dots + a_{i n} b_{n j} = \sum_{k = 1}^{n} a_{i k} b_{k j}$

Introduction to Linear Regression

Introduction to Linear Regression

Read It: Introduction to Linear Regression

Assess It: Check Your Knowledge

Knowledge Check

Navigation

EMS

Programs

Related Links