EME 210
Data Analytics for Energy Systems

Confidence and Prediction Intervals

PrintPrint

Confidence and Prediction Intervals

We've previously seen confidence intervals for the mean and proportion in Lesson 4. We can also construct a confidence interval around our line of best fit; that is, a lower and upper bound within which we have a certain level of confidence (e.g., 95%) that the population mean is contained. Similarly, a prediction interval gives a range that we are confident (to a certain level) contains the prediction at that specific value of x

95% Confidence Interval for a Linear Model: there's a 95% chance that the MEAN response at a specific x lies within this interval.
95% Prediction Interval for a Linear Model: there's a 95% chance that the response VALUE at a specific x lies within this interval.

The video below shows how to calculate both confidence intervals and prediction intervals in Python.

  Watch It: Video -  Prediction Intervals (11:59 minutes)

Prediction Intervals
Click here for a transcript.

Hi, in this video, I'm going to show you how to get confidence and prediction intervals for a linear regression model in Python. So, let's get to it. In order to make this work, the key function we're going to use is get_prediction, and this is coming from the stats models package that we imported in the last video when we wanted to get R-squared. So, the command to do that is, import statsmodels.api as sm. Recall that in the previous video, on the previous page, we ran this code to get this output and stored the fitted model in 'res'. We defined an ordinary least squares, or simple linear regression model, using this line here, and then fit the model to the data here. This is the summary of the output. So, this sort of code needs to be run first in order to make the confidence and prediction intervals using the method I'm going to show you here.

So, let's start off with a simple example. If we want to make some predictions, we can then call our object 'res', shorthand for results. 'res', you know, could be anything; this is user-defined. We again define it up in this line in the previous chunk. The key function here is 'res.get_prediction', or just 'prediction', sorry, and the arguments that you give it are the values at which you want to make the prediction. Now, I have to give it two values, let's say one and 625. Now, where I'm getting these values from: one is corresponding to the intercept. So if in this model I am adding a constant here, and thus have this constant line here, I need to give it a one first. This is basically a placeholder for the intercept, and it has to be one. Now, the second value is the value that's going to be substituted for the explanatory variable, in this case, drowning. So, I'm picking 625 arbitrarily because, if you go back up to the data, 625 is, you know, a bit higher than anything else that we've seen in the data. So, I'm saying, okay, let's predict outside of the data here, but really, this could be any value I want.

So then, in our object 'predictions', I'm going to say 'summary_frame', and I'll give it an alpha value. This is my significance level. We'll use our default significance level of 0.05, which would correspond to a 95% confidence interval and a 95% prediction interval. So, let's go ahead and run that. This is the output I get: 'mean'. This value is the prediction of my linear model. So, this is the value that falls on the line of best fit at the drowning value of 625. 'mean_se': this is the standard error about that estimate. Here are my lower and upper bounds of my confidence interval for that mean, for that prediction. And here are my lower and upper bounds for the prediction interval. Note that it doesn't indicate prediction; it says 'observed confidence interval' and 'obs_ci_lower' and 'obs_ci_upper' here. So, it doesn't align very well with the language used in the lesson, but you just have to bear in mind that these last two values are the prediction interval.

Okay, so that's how we can, at a very basic level, get the confidence and prediction interval for one point, one X value. But oftentimes, it's useful to get it for many values at once and to plot that graphically. So, let me show you how to do that for many instances of X, and then we can plot that.

So, let's make a new code cell. First, let's make a new data frame from scratch here, using Pandas' data frame command. Here, I'm going to just make a new set of X values using the range command. Here, I want to span the range of drowning data that we have. So, I'm just going to generally say 400 to 625, and the third optional command I can give to range is a step size. So, this is going to be the values 400 to 625 in steps of one, so 400, 401, 402, etc., all the way to 625, and that'll be stored in this object 'new_df'. The other thing I need to then create is a column of ones that are the same size as what I currently have here. So, I can say make a new column, call it 'constant' or 'const', for example. I could call it whatever I wanted, and this will be a bunch of ones. So, I'm going to have one in the square brackets and then a multiplication. It means replicate the value one, not multiply that value one this many times. So 'new_df.shape[0]', so I'm going to replicate the value of one as many times as I have rows in this new 'df'.

Okay, so let's take a look at what that looks like, just to show that this works as expected. We can look at the first five rows. There we go. So, I have my X values and just ones. Again, these ones are placeholders for the intercept.

So, now that I have those, I can then use 'get_prediction', this function. I can exercise it over all of these values. So, now I give it my column of ones, not just a single one value, and my column of X's, not just the single X value, and I will store the summary of these into a data frame. So, 'summary_frame' again, I need to specify an alpha. We'll use our default alpha of 0.05, and let's see what we have now.

So, there we go. So now we've got the prediction at X values 400, 401, 402, etc. It's not showing those values here, but it's giving us the predictions, confidence intervals, and the prediction intervals. Okay, let's merge this then with 'new_df'. So, we're going to override 'new_df', which has the X values in it, using our 'pd.concat' command. And so, I'm going to take our existing 'new_df', and I'm going to just append on these predictions. I need to say axis equals one so it does this over the columns, and we'll see what this looks like.

Here, there we go. So now I have the X values in there as well. Okay, let's visualize. We'll import good old Plotnine. Cut this backwards, sorry. This should be 'from plotnine', where we'll import all the commands. And then we can start with our initial data frame that has the data in it, and we'll just start by plotting those points. So, our X was the drowning, and our Y is the nuclear, and let's see what that looks like. Okay, so there's our scatter plot, just the data points. We can use 'geom_smooth' to conveniently make not only our line of best fit but also our confidence interval. We can give it the aesthetic, and then we'll give it a line color. And we can say our method will be 'lm', or linear model. We'll use, we'll say 'se = True', use the standard error method, level 0.95, 95% confidence interval, and we'll fill it with red as well. See what this looks like. There we go. So, there's our 95% confidence interval and our line of best fit. So, the red line is our line of best fit, and we have a 95% confidence interval. And then, this thing is, this doesn't give us our prediction interval. So, we had to calculate that manually. We don't have a convenient built-in function in ggplot to make that prediction interval. But what we can do is we can reference our 'new_df', which has the prediction interval information in it, and so we just have to plug in this new data frame here. So, we're not borrowing from this data frame that we put in ggplot; we have to give it this new data frame. And we'll specify our aesthetic now is a little bit different. Our X is the X column in that data frame, and our Y would be the 'obs_ci_lower' or the lower limit of the prediction interval. And let's give this some colors to differentiate it from red. We'll give it a blue, and let's say, linetype = 'dashed', just to differentiate a little bit more. And we also have to, of course, give it the upper bound too. So, we'll just copy and paste this line and change 'lower' to 'upper'. And there we go. There's our 95% prediction interval in blue. You'll note that it extends a bit beyond what we have in red here. That is because of this line here, this range I said 400 to 625. Let's see, it's going from 400 to 625 here. When I use the 'geom_smooth' command, it automatically limits itself to the limits of the data. But if I wanted to extend the line of best fit and/or the 95% confidence interval to the same limit as the prediction interval, instead of using 'geom_smooth', I can just use 'geom_line' with the 'new_df', because all that information is contained in here too. I just need to give it the right Ys, change the colors how I want, and so on so forth.

Okay, so that is how to do prediction and confidence intervals in Python. Thank you.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0
The Colab file used in the video is available here. Make sure to sign in with your PSU credentials.


  Assess It: Check Your Knowledge

Knowledge Check

Use the table below to answer the following question:

mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
8466.215372 5258.166436 -2411.130644 19343.561387 -38519.240164 55451.670907