Lesson 4: Confidence intervals and bootstrapping

Overview

A critical step in making statistical inferences from a sample is defining how confident you are in those results. In this lesson, we will introduce sampling distributions, bootstrapping, and confidence intervals to provide this measure of confidence. You will learn how to create a sampling distribution from bootstrapping, as well as how to use that bootstrapped sampling distribution to determine the confidence interval. In particular, we will focus on two key methods for calculating confidence intervals: the standard error method and the percentile method.

Learning Outcomes

By the end of this lesson, you should be able to:

define a confidence interval for a dataset
draw conclusions from a confidence interval
estimate a confidence interval using bootstrapping

Lesson Roadmap

Lesson Roadmap
Type	Assignment	Location
To Read	Lock, et al. 3.1-3.4	Textbook
To Do	Complete Homework: H05 Confidence Intervals Take Quiz 4	Canvas Canvas

Questions?

If you prefer to use email:

If you have any questions, please send a message through Canvas. We will check daily to respond. If your question is one that is relevant to the entire class, we may respond to the entire class rather than individually.

If you prefer to use the discussion forums:

If you have questions, please feel free to post them to the General Questions and Discussion forum in Canvas. While you are there, feel free to post your own responses if you, too, are able to help a classmate.

Sampling Distributions

Read It: Sampling Distributions

Recall the animation from Lesson 1, showing the process of collecting a sample from a population, then using that sample to make an inference about the population.

cyclical diagram showing how a data is collected from a population to quantify as a statistic to measure and make an inference; the standard deviation is noted and described in the text above

Illustrates the collection of a small sample of data from a much larger population. This course will primarily focus on the "Statistical Inference" process in this diagram; what do we learn about the whole population from just a small collection of data?Enter caption here

Eugene Morgan © Penn State is licensed under CC BY-NC-SA 4.0

Often, we will want to collect a number of samples in order to bolster the statistical power of our inference. In other words, we want to obtain a sampling distribution, which we can use to conduct statistical inference tests. This sampling distribution is made up of some statistic (e.g., the mean) of each individual sample. For example, say we roll two dice 10 times and record the average roll, this average is $\overset{―}{x}$ . Then, we repeat the process with a new set of 10 rolls, and so on, until we have a large number of samples. Each of these individual $\overset{―}{x}$ s makes up the sampling distribution, which can be visualized in the figure below.

Sampling distribution of average dice rolls.

Credit: Eugene Morgan © Penn State is licensed under CC BY-NC-SA 4.0 [1]

In this figure, each point represents the statistic (e.g., $\overset{―}{x}$ , $\hat{p}$ ). Note that for most parameters, the sampling distribution will be centered at the population parameter if the following are met:

The samples are random
The sample size is large enough
There are enough samples

Definition: The sampling distribution is the distribution of sample statistics (e.g., $\overset{―}{x}$ , $\hat{p}$ ) calculated on different samples of the same size and from the same population.

Watch It: Video - Introduction to Sampling Distributions (10:33 minutes)

Click here for a transcript.

Hello and welcome to this new lesson on confidence intervals. Before we get into confidence intervals, we're going to talk about sampling distributions. Because we ultimately perform the confidence interval on the sampling distribution. So we need to talk about how we actually get those sampling distributions. And so, in this video series, we're going to be using some simulated dice rolls. And so, you can imagine a situation in which you roll two dice, you take the total, and then you record that total. And ultimately we did this four times, but 10 rules in each time. So, we've got four different samples of 10 data points. So with that, let's go ahead and jump in to the video.

So, we are here in Google Codelab. We're going to be using two main libraries in this series of videos. The first is going to be pandas and the second is plot nine. And then, as I mentioned, we're going to go over sampling distributions. So I went ahead and preloaded this sampling data and as I mentioned we've got four samples here, and we've got 10 dice rolls in each sample, and we recorded the total. What we're actually doing here with these curly brackets is essentially creating a dictionary and then converting it in to a data frame right away. So if you recall in lesson one, we talked about dictionaries and data frames and how we can convert between the two. Here we're just doing it all in one step. So we've got our data underscore DF for data frame. however, we're going to convert this into a long form data frames, which will allow us to help with plotting down the line. So we're going to be using that melt command that we talked about in lesson three.

So, I'm just going to rename override the data DF name and say data df dot melt. And then, we need to give it our value vars. So, which values do we want to maintain and in this case we want to maintain sample One, sample two. Each of these samples is a column, so we want to keep them as they are. So those are going to be our value vars and I forgot an equal sign, right there, and then we also need to give it a var name. So what do we want these new values to be called? We'll call them sample, and we give it a value name. What do we want what's in them could be called? And we'll call that outcome.

And so then I'll just print the data frame. We'll just show the first five rows using this dot head command. And so we can see here we've got our sample, and we've got our outcomes. And so, we can use this to create plots. And so, as we talked about in lesson three, it's often a lot easier to plot using this long form, especially when we're working with gg plot. So our first visualization is to just look at how these different samples relate, and so we're going to do a histogram with filled bars. So we're going to have our basic histogram, and then we're going to add a variable using the fill command. So I'll start with the gg plot command. So we say gg plot data underscore DF.

And then we give our geom. So geom underscore histogram and then we give our AES. So this is the requirement. So we need an x value which we will call, which will plot our outcome so our actual dice rolls. And then, it's a histogram, so we don't need a y value because it's just going to give us the count. But we will add a fill based off of sample within the AES statement because we want to attach it to a variable. And then I'm going to specify that I want 12 bins we want one bin per possible dice roll and I want each of those bins to be one long. So if we run this, we can see our plot here. We've got each of our samples has been automatically put into a legend. And then we can see sort of the distribution and because we've done fill. It's stacked each of these samples on top of each other, so we can see as the whole group of samples. All 40 data points that we ran. The most common was seven across all the samples. And it does tend to follow you know vaguely normal distribution. But we can see that, for example, sample 4 was heavily skewed towards the lower values. It looks like sample two and three might have more of the higher values, and so we do see that there are differences there between our samples. So this is just a good way to get an idea about what your samples exist.

But we still need to determine the sampling distribution. And so we have two steps to determine the sampling distribution. The first is to find the mean of each sample, and then we're going to plot it. So the sampling distribution is the distribution of statistics, and in this case we're going to say x bar is our statistic. So we need to first find the mean of each sample. So we say data means, and we are going to use a group by function. So we want to group by sample. So we want to group first we're going to group by sample. Then we're going to take the mean of each of those rows in the now grouped sample. And then we're going to tag on a rename function. So here we just will rename our outcome column to sample means. Which just is a little bit more descriptive because it'll no longer be the outcome. It'll be the means of the outcome. So we can print that and in theory you can do each of these steps individually. But one of the benefits of working with pandas, is that you can just sort of tag on these additional commands using this period or dot right here. So we can do dot group by dot mean dot rename, and it'll integrate them all in order. So if we print this, we can now see our sample means here. So we've got sample one with a mean of six, sample two 6.3, 7.5, and 5.7. So this is our sampling distribution, it's our sample means or distribution of the sample means. To visualize that, we'll often use what's called a Dot Plot, and so it will function very similarly to a histogram. But we'll start with just plotting data means. And now the geom is geom dot plot. It still requires an AES.

Here we want our sample means. There's no other data that we're doing. We're not adding a fill. No need to specify any y-axis, but normally we will specify the dot size. Just so that the dots don't overwhelm the plot, and so here we can see that it's plotted a single dot for each of our sample means. But it stacked them into bins the way that a histogram would. And so you can use this as a way to see the distribution of data the way you would a histogram, but you can also see how frequently a certain points are. Where they are each located and if we wanted to we can add additional variables to show the fill. To show lines that can make this a bit more descriptive. So with that, this is how we can create a sampling distribution from a small subset of samples.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Watch It: Video - Comparing Samples to Population (13:51 minutes)

Click here for a transcript.

Welcome back to the video series on sampling distributions. We're just going to continue to work with our sampling distribution, and today we're going to get into how we can compare that to the true values. So if you recall from the last video, we had these the sample of dice rolls that we were able to visualize. And then create a sampling distribution of the means of each sample. So, one of the reasons that we developed these sampling distributions is that we use them to test things. So we can say, does the sampling distribution center around the true mean? And so, in this case one of the benefits of working with dice rolls, as an example, is that we know what the true mean is. it's seven. You will always, in theory, if you roll enough dice you will always get seven is the most common and sort of central number. And in particular, the dice outcomes sort of followed this trend, where there's, you can roll one and a one to get a two, but you can roll a one and a two to get a three you can get one out of two that way. And so, we have the most values for seven.

So let's go ahead and jump right in to the coding here. So I've pre-coded this aspect of the theoretical outcomes. And this is a little bit of a faster way to develop a large data frame, because we've pre-wrote all of these. And so each of these. If we run this, and come down here, we can print out twos for example. And it's just one, two. But then if we look at sevens, it's a list of six sevens. And so this is a quick way to develop a data frame or a series of data frames, and then we can connect them. So we can create a new data frame called true DF. Use our PD dot data frame command and again, doing this curly bracket, so that we can specify the outcome. And the outcome is just twos plus threes plus fours plus fives and so on. So this is why we create these variables above because we can quickly just add them together in order to create a data frame that doesn't need us to sit there and write out every single number that is included in our true data frame. And so we can come down here and print true DF. And so, then we can see that we've got the outcome here, and we've got all of our data in a single column. And so then we can also you know plot that plot those values. And so we're going to use a histogram. So we'll say ggplot true BF. And then we say geom histogram and give it our AES statement, so our X is just equal to outcome. We don't have any y we're not doing any fill. We say bins equals 12. Bin with equals one. And then outside of the histogram, I'm going to add a second or a third line that we say scale X continuous and we say breaks range from 2 to 13. And this will essentially tell the histogram where we want our specific breaks for our bins. And so the range tells it the first value, and you'll notice that it only goes until 12. But we said 13. This is something that you have to do in Python whenever you're having a range of data. The last one is not inclusive, so you can say from 2 to 13. Exclusive to 13 meaning that it's really 2 to n minus 1, in this case. n minus 1 is 12. So here we've got our data frame of true values, which is exactly what we expect. We have this nice bell curve. Seven up here is the maximum, and we can go on either side. So this is our population. This is our true population, and what we want to do is compare our data or our samples to that theory or true population.

So we've got several steps here. The first thing we're going to do is create a new column called label, which is going to specify that this is the true outcome as opposed to a sample. So to create a new variable a new column, we just add it here and tell it what goes into it. So it's just going to be true. So, if we just print out the first five rows, we can now see that we've got this label and everything is set to true. So then, the next thing that we need to do is reset the index, and we do this so that the sample number can, well we're resetting the index to our data frame up here, so that we can make sure that we have the actual values in a column instead of as an index. So this is just data DF dot reset index. And so now we can see that we've got what used to be an index is now a row or a column, and then we need to rename so just so that it matches the true DF column. We want them both to be called label. So we say data DF. We need to save it as something, so data DF equals data DF dot rename and then we change sample to label, making sure that we match exactly what we wrote up here down to the proper cases and just to show how that is changing, we can print out the first five rows. So now we can see that this is no longer called sample. It's called label. And then finally we can concatenate this into two data frames, while maintaining our long format. So we'll just call it data frame concat. We use the TD dot concat function and, in this case, we give it our first data frame comma our second data frame in square brackets. We print that. We can see that it starts off with all of our true data, and then it ends with all of our sample data.

And so, then the last thing that we can do to compare the true population with our sample is that we can create a histogram. And so, what we'll do is again using ggplot, we'll say DF concat is our data. Again, geom histogram. We give it the AES statement, so we say x equals outcome. In this case, we're going to add a y and in particular we're going to use this to essentially make it do a statistic rather than a count on the histogram. So we're going to say after stat density. So this will tell it to be used the density calculation as the Y value instead of a count, and then we want to fill based off of the label. And again, we want to specify 12 bins. I've bin width one, and we're going to add an extra thing called position, which will essentially prevent it from stacking on top of each other. We'll just separate them out, so they're next to each other, and then we'll do our scale X continuous so that our range is as expected. So we can run this. And so, now we can see the distribution of all of these values. So our true value in purple here we can see it's got this nice bell curve, but we can see that sample 4 for example had a lot of sixes and less down here. We can see that sample three to one two and four have more sevens than the true value and so forth. And so, we can use this to see how these distributions compare. And then if we wanted to find the sample distribution again, we need to get the distribution of the means in order to make it a sample distribution. So we'll call this DF means two. So we can say DF underscore concat dot group by, and in this case, we're going to group by label. We're going to say mean, and then we're also going to rename so that they're a bit more descriptive. Outcome will become sample means. And so, now we can see each of our sample means and the true population mean down here at the bottom, and we can use that to compare. And what we'll get into in the next series of videos is how we can use this data to develop confidence intervals and start to make some inferences based off of our samples.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Try It: Apply Your Coding Skills in Google Colab

Click the Google Colab file used in the video is linked here. [2]
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, use the partial code below to add two additional samples to the dataframe with data from 10 dices rolls each (use either real dice or a virtual roller [3]), then calculate the new sample distribution.

Note: You must be logged into your PSU Google Workspace in order to access the file.

# Dice roll data, as data frame
data_df = pd.DataFrame({'sample1' : [5, 7, 4, 6, 3, 8, 7, 7, 8, 6],
                        'sample2' : [3, 11, 7, 9, 7, 2, 7, 5, 4, 8],
                        'sample3' : [7, 8, 6, 10, 8, 4, 9, 3, 9, 11],
                        'sample4' : [5, 6, 6, 6, 4, 7, 6, 7, 3, 7], 
                        'sample5' : [...],
                        'sample6' : [...})
 
# melting the data to get it into "long form"
data_df = data_df.melt(value_vars = [...], var_name = 'Sample', value_name = 'Outcome')
 
# Find the mean of each sample
data_means = ...
 
data_means

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Standard Deviation and Standard Error

Read It: Standard Deviation and Standard Error

When working with sampling distributions, we distinguish the standard deviation from the standard error. Both measures are calculated the same way (e.g., through the typical standard deviation formula), the difference is the type of data they are applied to. The standard deviation is applied to the actual data (sample or population) and is a measure of the variance from the mean. The standard error is applied to the sampling distribution and is a measure of the variance from the true population parameter. The video below will show the difference in calculating these measures in Python.

Watch It: Video - Standard Deviation and Standard Error (03:26 minutes)

Click here for a transcript.

Welcome back to the videos where we're discussing sampling distributions. In this video, we're going to talk about the difference between standard deviation and standard error. And so, this is a really important distinction because, while they sound very similar, they measure two different things. Our standard deviation is essentially going to measure the variability of the sample or the population. Whereas, the standard error is going to measure the variability of the sample means. And what makes it more confusing is that we use the same function to calculate both. It's just what we call it becomes different. So this is an important distinction to keep in mind, but we'll go ahead and walk through how we can calculate each of these in Python.

So we're going to start with the standard deviation of each sample and the true population. So we're going to use our DF concat data, and we're going to say group by label. So similar to what we did to come up with the sample means, but in this case we're going to use the dot STD for standard deviation function, and then we say rename, and we'll give new columns. Outcome will become standard deviation, and we can go ahead and print that. So this is our standard deviation we've got one for each sample, and we've got the true standard deviation. And so, recall that we had our sample means from above. We called them data means, and I'll just print them here for you.

And so, these were our sample means. If we want the standard error, it is the standard deviation of the sample means. So the sample standard error is then data means dot STD. Same calculation, but then we can see here that this is our standard error. Whereas, these were our standard deviations. And so, the only difference is what you're using the command on. If you're getting standard error, it should be on some statistical sample distribution. If you're wanting standard deviation, it should be on actual data, whether that's the sample that you collected or the population.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Add new questions

Introduction to For Loops

Read It: For Loops

For loops can be used to perform calculations and store variables iteratively. In Python, there are several keywords that all for loops must have:

for --- This command starts the for loop
in --- This command sets the number of iterations
: --- This command says that everything that follows is meant to be executed in the loop (until the indentation changes)

Below, we demonstrate the basics of for loops.

Watch It: Video - Introduction to For Loops (07:15 minutes)

Click here for a transcript.

Hello and welcome back to the videos in this lesson. We're now going to move in to bootstrapping and how we can use bootstrapping to develop more samples that we can then use to develop confidence intervals. Because in the previous lectures, we had four samples and while we can use that to develop confidence intervals, we really want a lot more data in order to really ensure that our results are the most statistically sound that they can be. And that's where bootstrapping comes in. but before we get into bootstrapping I wanted to briefly introduce for loops, because that's how we actually conduct the bootstrapping is through leveraging for loops. So this video will just cover how we Implement for Loops in Python.

So we're back in Google Collab. We've got our libraries here that we're going to use throughout the bootstrapping lectures. But in this case, we're simply going to talk about for loops. And so, there are a few key words that we need to be aware of when working with for loops, the word for starts the loop the word in tells python how many iterations to complete, and the colon tells python that everything that comes after this is included in the loop. So to give a very simple example, we can say for I in range 0 to 10 colon. And so, everything after this will be executed in the loop. The I here is a placeholder that is just counting up the iterations for us. So it's saying I is what is counting up from 0 to 9. And again, remember that python does that weird thing where it excludes the last number. So we're really doing 0 to 10 minus 1. The key thing is to make sure that whatever you list in this opening statement of the loop is also considered in the body of the loop.

So here we want to print I. So we if we were to print J, it wouldn't work, you can see that we've got this red line underneath. It's telling us that there is no variable J and that's because we messed up our counter. We need to Define I, which is defined up here. We can also do operations. So we can print I equal to 2, or I, print I plus 2. And then we can keep going and as long as we maintain the same indent level, will execute everything. When we're done with the loop, we can backspace, I like to write a comment and let me know that I'm done with the loop. And then we can continue on with whatever we were doing. And so, if we run this we can see that it's printing I zero. Then it's printing I plus two, but then it goes back, and now it prints I. Now I is equal to one. I plus two. Now I is equal to two print and so forth until it ends and says this is the end of the loop. And so within these loops, once we get into bootstrapping we'll do a lot more intensive loops, but the basic structure will always remain the same. The next style of loop that I wanted to demonstrate was how we can nest loops. So this is where we add for loop inside of a for loop. It starts off with the same style. We'll say 0 to 4 this time. Which is actually three. And we'll say that we want to print I is and then print I. So here we're mixing text and value and then for each iteration of I, we want to also complete a series of values for J. We'll say 0 to 3.

And so now you can see that when I enter down I'm another indent here. So this is our first loop and now this is our second loop. So we can print J equals. So we are printing where J is equal to J. And now, in order to make sure that our loops are organized, we can go back one space and add a comment. And now if we want to do anything outside of the inner loop, but still in the outer loop we can add stuff there, so we can say this is the end of the inner loop. Then if we backspace again, add in a comment, and we can also print. So if we run this, we can see that it is it started where I equals zero, and then it gives J one through two. It iterates through all of J during the first iteration of I, and then it does our printed. And then it increases I by one and reiterates through all of J. And so forth until we reach the end. And so the purpose of these nested loops is when you have multiple things that you're iterating over, and you need to do them sort of in sequence. And so we'll use primarily this single version of the loop throughout the bootstrapping lectures, but just know that you can also do a lot of what we're doing with nested loops as well.

Credit: Eugene Morgan & Renee Obringer © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Try It: DataCamp - Apply Your Coding Skills

Try to implement a simple for loop below. You want to iterate the loop 5 times, each time printing the index number.

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Bootstrapping

Read It: Bootstrapping

Recall from the Sampling Distribution lesson that sampling distributions need to be sufficiently large in order to be considered representative of the population. However, this is often difficult to get, as sampling distributions often contain upwards of 1000 samples. Remember how you rolled two dice 10 times, recording the total value each time? Imagine doing that 1000 times in order to collect a larger enough sampling distribution! To avoid this arduous process, we use bootstrapping.

Bootstrapping Theory

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Bootstrapping, as demonstrated above, allows us to approximate the sampling distribution through resampling the initial sample. In other words, we randomly select values from the original sample with replacement over many iterations to develop a bootstrapped sampling distribution. The process is outlined below.

Bootstrapped Sampling Distribution

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

When we implement bootstrapping in code, we will leverage for loops that follow this basic pseudo-code process:

Obtain a sample of size n
For i in 1, 2, ..., N
1. Randomly draw n values from the original sample, with replacement
2. Store these values in the ith row of the bootstrap sample
3. Calculate the statistic of interest for the ith bootstrap sample
4. Store this value as the ith bootstrap statistic
Combine all N bootstrap statistics into the bootstrap sampling distribution

Watch It: Video - Introduction to Bootstrapping (11:29 minutes)

Click here for a transcript.

So now we're going to get into the actual bootstrapping. So we're going to start with a very brief review of what we went, how we got the original four samples and that original sampling distribution in the previous videos. And then we're going to leverage a for loop to create a thousand samples that are of equal size to our original samples.

So here we are in Google lab we've got our libraries that we're going to be using in this script, and we've got the same code that we went over in the previous video. So we've got our four samples, 10 dice rolls each, and we found the sample means of each one. And this is a sampling distribution it, but it's quite small. And so it's hard to say how we could make any statistical inference because it's so small. Generally, we want something that is sufficiently large enough. And so one way that we can improve our statistical standing is to get a bootstrapped distribution. And so in order to do this, we will essentially collect a thousand samples by resampling from our existing data, and we'll use that to ultimately make some inferences through confidence intervals. And so I've laid out our pseudo code here. We're going to start with initializing all of the variables, then we're going to create loops. We're going to use a function called MP random Choice, and then we're going to move the counter forward and complete the loops.

So the first thing that we do is initialize the variables. So we can say step one. So there are a few key variables that we have to initialize. The first is capital N and so this is the number of bootstrap iterations. Then we need to do lowercase n. and this is our sample size. So above in the previous lectures we had four samples, but we had 10 rolls in each sample. So we'll say N little n equals 10. And then we're going to create an empty data frame, which we call boot DF. And so this data frame, again we're going to use that dictionary within a data frame that we've used before. We're going to initialize one column called sample. And for now it's going to be empty, so we use NP nan times and capital N by little n. So this will give us n by n so 10,000 rows of not a number. And then we want another column called outcome, in which we do the same thing. So we're going to fill it with n by n not a number and then finally we will have a type column and do the same. So if we just print, what that looks like do the first five rows. We can see that we've got sample outcome and type all empty.

So then we need to get into the actual loop which is step two, so I'll and do a new code step two. And so it follows our basic syntax that we went over in the previous video. So we say 4 I in range n. and so unlike the previous video I'm just going to give it a single value, and it will assume that 0 is our first number. So if you just give it a single value, it'll assume that's the end. So we'll do a thousand iterations. And so that has started the loop. So now our next step, step three, is to extract a random sample. And so, to do this, we need to first develop some counters within our for loop. So we're going to have one counter called low I, which is just little n times I. And then we're going to have high I which is the low I plus n minus 1. And so what these are going to do is it's going to allow us to move through our boot DF database data frame. And so the first iteration of I will take care of 0 to 10. The second one will take care of 11 to 20 and so forth. So these are our indices, and then we can start to actually build up this empty data frame. And so we can say boot bf dot loc. So this will allow us to locate based off of a column, so we can say from low I to high I. so these are our indexes, and we've got a colon there to say from this to that. In the outcome columns, we say MP dot random dot choice. We give it our original data frame, which we called data DF. So I'll just copy and paste that. So the first part of MP dot random choice tells us what we're actually drawing from, or drawing from the outcome column of data DF. the second option tells it what size we're drawing. We're going to go with little n which is our sample size.

And then the third option, we say replace equals true. And this is a key option when you're doing bootstrapping. You want to make sure you're always replacing it. This means that there might be a case where you have a bunch of the same numbers that have been drawn in a row that are all the same, but that is the purpose of doing this bootstrapping. We don't want to end up with only one option to draw from because we haven't replaced the marbles in the bag, for example. We want to draw a marble, record the color, and replace it. So this is a really critical component of the bootstrapping code. So that actually extracts the random sample. The next case was moving the counter. And so in this case we're going to say boot df dot loc. Again from low I to high I. but in this case we want to do sample, and we're going to say boot and plus SDR I plus one. And so what this will do is it will essentially create a label for us that tells us which bootstrap iteration we're on. And then we have step five, which is tracking the type of sample.

And so again from low I to high I. In this case we want the type column, and we're just going to call it bootstrap. And for now this will all be all bootstrap. But if we were also tracking our truth, for example, we could change that type so that we can identify it later. And then step six is to end the loop. So I've backspaced so that I'm now realigned with the very edge of the pane. And then I will print the first five rows of bootstrap DF, so that we can see what it looks like. And so there is an error. It looks like I've got the wrong name. Yes, so we need this to be a little outcome. So one of those things that python is very particular with the cases that you have in the capitalization and where it occurs. So pretty common error there. But now we can see we've run through it, we've got our sample, we've got the random draw, and then we've got the type here. So now we've got the bootstrap samples but in order to develop the bootstrapping sampling distribution, we need to make sure that we do the mean of this bootstrap sample. So we'll get into that in the next video.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Try It: Apply Your Coding Skills in Google Colab

Click the Google Colab file used in the video here. [4]
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, use the partial code below to implement a bootstrapping procedure.

Note: You must be logged into your PSU Google Workspace in order to access the file.

# STEP 1: initialize variables
N = ...  # insert a sufficiently large iteration value
n = 10   # sample size = 10
 
# empty dataframe:
boot_df = pd.DataFrame({'sample': [np.nan] * (N*n),
                        'outcome': [np.nan] * (N*n),
                        'type': [np.nan] * (N*n)})
 
# implement the bootstrapping procedure:
for i in range(...):
  low_i = n*i
  high_i = low_i + (n-1)
  boot_df.loc[low_i:high_i, 'outcome'] = np.random.choice(..., ..., ...)
  boot_df.loc[low_i:high_i, 'sample'] = 'boot' + str(i+1)
  boot_df.loc[low_i:high_i, 'type'] = 'bootstrap'

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Bootstrapped Sampling Distributions

Read It: Bootstrapped Sampling Distributions

Once we have the bootstrapped samples, we need to calculate the sampling distribution, which contains the statistic of each sample. Below we demonstrate this process.

Watch It: Video - Bootstrapped Sampling Distributions (07:52 minutes)

Click here for a transcript.

Hello and welcome back. This video takes off right at the end of the previous video, where we went through and developed this loop to create a thousand samples using bootstrap. But we still haven't created the sampling distribution because that's the sample of statistics and in this case we'll use x bar. So in order to do this, we start off by finding the mean of each bootstrapped sample. So I'll just call this data frame boot means, and we say boot df dot group by. We want a group by sample, which is what we called each of our bootstraps up here. We want to take the mean, and we want to tack on a rename function so that it's a bit more descriptive because it's technically no longer the outcome. It's now a bootstrap sample means. So we can print the first five rows of that. And so now we can see we've got different samples, and we've got different means of those. So this is now our sampling distribution, but it's usually a little bit better to plot these things. And so we can go in here and use our gigi plot with boot means. And so we can do a dot plot which we introduced in the previous series where our X is just bootstrap sample means and our DOT size we'll do 0.25.

And so we can see that this looks pretty good. So compared to what we previously had, because we have a thousand iterations, we're now starting to see this normal distribution that we expect from a from rolling a dice. And so, but we want to continue to build off of this plot. And so I want to show you how to plot the original sample means from those original four samples to this Dot Plot. In order to do that, we need to reset the index somehow and so previously we've done dot reset index but another way you can do it, is say dot index. So you can just create a new column that is equal to the index here. And so now we can see that the index stayed the same, but we've got this extra column called sample that now matches what the index was.

And so then we can build up our previous plot. I'm actually going to come up here and just copy this. So we'll run it, and so this is our basic plot. If we wanted to add in our original sample means, we can also add a secondary gm dot plot. Where X equals sample mean, which is what we called it up here, and we want to fill based off of sample. So that we can see the different colors. And then we have to do an additional thing here. We need to specify the data, and that's because this data is different than the data we specified up here. So this data set will be carried throughout the entire plotting function unless we specify a new one in a specific graph. So because we're working with data means now, we need to add this, this way. And then we can say dot size. We'll add. We'll make those a little bit bigger. So we can run this. And so we can see these giant dots, so maybe that's a little too big. We'll do 0.25, there. And so now we can see our bootstrap in the back with our black dots. We can see our original sample means here. So you can see how the distribution has changed just by increasing the sample size. And then the last thing that we'll normally do is plot where the actual truth is. so here I'm using a new geome called GM V line, which will just add a vertical line. And within the AES, we just say X intercept equals seven. It's just the true value. And then outside of the AES, we can just set the color equal to blue, and we can set the size equal to one.

And so now we can see the completed plot. We've got our bootstrap mean, our bootstrap sampling distribution in the back, our original sample means here, and where the true value is. so we can also use this to see that technically while the bootstrap does have this nice normal distribution, it is heavily skewed towards the lower half of our rolled values. It's not directly centered around our true value, seven. And that's because the bulk of our data as we can see from our original samples was way down here. And so when we're pulling the bootstrap, we're more likely to get those lower values than to have some of the values up here. And so that's a really critical point to remember that you know, yes you can use bootstrapping to generate a number of samples, but it can't give you anything that you haven't given it. So if you give it biased data, your bootstrap sample is always going to be biased even if you increase this to ten thousand to a million samples. It all depends on what you give it. And so this is how we can do bootstrapping to develop a sample size. Later we can talk about how to actually take the sample size or take this bootstrapping sample and develop a confidence interval around it.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Try It: Apply Your Coding Skills in Google Colab

Click the Google Colab file used in the video here [5].
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, use the partial code below to calculate and plot the bootstrapped sampling distribution. Alternatively, you can continue to edit your Colab file from the previous module.

Note: You must be logged into your PSU Google Workspace in order to access the file.

# Find the mean of each sample
boot_means = ...
 
# plot
(ggplot(...) + geom_dotplot(...))

Assess It: Check Your Knowledge

Knowledge Check

FAQ

(add new questions)

Introduction to Confidence Intervals

Read It: Confidence Intervals

When we start making statistical inferences, we need means to represent the uncertainty in the results or the confidence that we, as the analysts, have in those results. For this, we turn to confidence intervals.

Definition: A confidence interval is an interval, computed from a sample, that has a predetermined change of capturing the value of the population parameter

There are some key aspects here that we can further define. The first is that confidence intervals are computed from a sample. Therefore, we are not computing confidence intervals on the population, but using the sample or sampling distribution. The next aspect is the emphasis on predetermined chance. This refers to the confidence level that you are using in your analysis. Essentially, this level specifies how confident are you that the true population parameter falls within a certain range. Often, you will see the 95th, or 95%, confidence interval, which is the most common interval to use. An example of this 95% confidence interval is shown below.

an example of a confidence interval graph; the mean is 0 for all samples; and the confidence interval for each sample is 95%

The 95% confidence interval for several sample means. The true mean is shown as a red bar.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

In the upcoming videos, we will demonstrate two methods for using bootstrapped sampling distributions to determine the 95% confidence interval.

Assess It: Check Your Knowledge

Knowledge Check

FAQ

(add new questions)

Confidence Intervals: The Percentile Method

Read It: Confidence Intervals: The Percentile Method

For some confidence intervals, there is no easy standard error equation. In these situations, we can use the percentile method to estimate the confidence interval. In this method, we assume that the data is normally distributed so that the confidence interval can be represented by the percentage of data outside the interval. For example, with the 95% confidence interval, we can use the percentile method to say that the lower bound is at the 2.5% mark and the upper bound is at the 97.5% mark, with 95% of the data in between. To calculate this, we follow these steps:

100 - 95 = 5
5/2 = 2.5
Lower Bound: 0 + 2.5 = 2.5
Upper Bound: 100 - 2.5 = 97.5
Thus, the CI is: [quantile(0.025), quantile(0.975)

This is demonstrated below.

Watch It: Video - Introduction to Sampling Distributions (07:11 minutes)

Click here for a transcript.

So far in this lesson, we have been working with the standard error method for determining confidence intervals. And this is the method you know that we come through and we can find the mean the standard error we do a calculation. However, we really only have those values here one two three for certain specific intervals. Occasionally you might want to do an 80th confidence percent confidence interval or maybe a 75% confidence interval, and there are no easy calculations in terms of the standard error in that sense. That's where this percentile method comes in.

And so this method, we make the assumption that the data is perfectly normal. And so we can just take our confidence and interval that we're interested in, and just assume that it's centered on the median, and it's just plus or minus however much percentage on either side. And so with the 95 competence interval to calculate this we do 100 minus 95 equals 5 divided by 2 2.5. And so our confidence interval is 2.5 on the lower end and 100 minus 2.5 or 97.5 on the upper end. And we're going to use a function called quantile in pandas, but you can also use a function called NP percentile from num pi, if you would like.

So there are three steps to this, the first is to get the lower percentile LP. And again we're working with the bootstraps, so you still need to do the bootstrap sampling in order to get to this point. But instead of calculating the mean, the X bar, and standard error, we say dot quantile. And we give it our quantile in decimal form. So 2.5 percent is 0.025. And then we can just print the lower bound rounded. So 5.1. Then we calculate the upper bound, the upper percentiles. And again it functions very similar, so we say bootstrap sample means dot quantile and, in this case, we do 0.975 to get the upper bound. And so then we can print this and then, technically, this is our confidence interval. So we can then add say the 95 percent. And then we can give it LP do a just a space in there to separate LP from UP. And so here we have our confidence interval. If we want to plot this, however, we also need to get the median.

And so unlike, or the 50th percentile, unlike the standard error method where we center around X bar, in this case, we center around the median. And so, we could use the median calculation to get this the median command, but it's the same as the 50th quantile. And so we have 6.4 as our median that this is centered around. And so, if we wanted to continue to build up our plot here, just going to copy this and paste it down here. And so this plot contains two confidence intervals right now. The red confidence interval is our bootstrap standard error. The green is our data standard error. And now we're going to add a third one, which is our percentile method. So again geom bar H AES. In this case, we will send this up 2.75 X Min is just going to be LP X max is UP, and then we'll make it purple, and we'll give it a height of 0.25. And then we'll add our points within the AES. Y equals 0.75. X equals the median this time. The color is purple, and the size is three.

So we can run this and we can see our final completed plots. We can see the difference between the three confidence intervals. And in this case, because our data is actually fairly normally distributed, we can see that the percentile method does match fairly well with the standard error method. Sometimes you'll see that the percentile method is much larger or much smaller simply because of the assumptions that we make when we use the percentile method. So that concludes our bootstrapping and confidence interval lectures. Now you can use all of these skills in the lesson below.

Try It: Apply Your Coding Skills in Google Colab

Click the Google Colab file used in the video is here. [6]
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, use the partial code below to calculate the 95% confidence interval using the percentile method.

Note: You must be logged into your PSU Google Workspace in order to access the file.

# step 1: calculate lower percentile (this is the lower bound of the CI)
LP = ...
 
# step 2: calculate the upper percentile (this is the upper bound of the CI)
UP = ...
 
print('the 95% confidence interval is ', ...)

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Confidence Intervals: The Standard Error Method

Read It: Confidence Intervals: The Standard Error Method

To determine the 95% confidence interval through the standard error method, we use the following equation

$\overset{―}{x} + / - 2^{*} S E$

This equation centers the confidence interval around the sampling distribution mean (" $\overset{―}{x}$ "), as shown in the figure below.

Confidence interval around the sampling distribution mean ("

\overset{―}{x}

In order to calculate this, we take three key steps in the code, following the development of the bootstrapped sampling distribution.

Calculate the mean of the bootstrapped sampling distribution using $mean()$
Calculate the standard error of the bootstrapped sampling distribution using $std()$
Calculate the 95% confidence interval using the equation above

Below we demonstrate this process.

Watch It: Video - Introduction to Sampling Distributions (07:12 minutes)

Click here for a transcript.

Welcome back to our series on bootstrapping. We left off the previous video after having made this plot where we show the bootstrapping distribution compared to the actual data in the original sample means. What we primarily use bootstrapping for is to do statistical inference. We want a larger sample size so that we can feel more confident in our final statistical inference. And one of those things is performing confidence intervals. And so a confidence interval is generally going to be two numbers. We've got the mean plus or minus something times the standard error. And so depending on the confidence level, you can have different calculations. We're going to focus on the 95 percent confidence interval. As this is by far the most common confidence level that you will see. In practice as well as in this course.

And so there are three steps to calculating the confidence interval with the bootstrap samples. The first is to actually calculate X bar. So what the average of the sample means is. and so it is critical to note that everything that we're doing with confidence intervals is on the bootstrapped sample distribution. So we're using boot means, which we calculated up here to show right here. And so essentially we want boot means bootstrap sample means dot mean. So the average of the sampling distribution. And so then we can print, x r is. So this is our mean that goes into these equations. The next step is to calculate the standard error. And recall from our previous video we are working with the standard error because we're working with a sampling distribution. So we say s e is root means, bootstrap sample means dot STD. Then we can also print that. We can say s e is.

And so now we have the standard error. And so now we're ready, that's the only two things we need in order to calculate the confidence interval with the bootstrap distribution. And so we can create an array using square brackets, where the first value is XB minus 2 times SE, and the second value is XB plus 2 times SE. And then we can print that, and we could just print you know CI similar to what we've been doing up here, but to show you a slightly different way to print things, still using the print button. Instead of just typing out the variable name here, I'm going to use a mini function. So I'm going to say round Y to the third decimal place for all Y in CI. And so, similar to our for loops, this Y is just a placeholder. It just needs to be this to match that, and then it'll figure out as the Y's apply to our values in CI. so if we run that we can see the 95 percent confidence interval is 5.058, 7.75.

And so that is all you need to do in order to get the confidence interval. But if we wanted to visualize it, I'm actually going to come up here and take this lot from our previous video, audit here for reference. We want to add the 95 confidence interval to this. So we have a new geom. This is called error bar H for horizontal AES. We give it a y value. And this is just going to be the position on the y-axis based off of what numbers we see here. So I'm just going to do 0.5 to put it halfway, and then we need to give it the X min of our error bar. So this is the first value of CI but attached to the zero width index. And then we do the same for x max. here CI 1 and outside the AES we can give it a color and say red. And then we're going to also add a point, a single point here. We're going to say same y. but our X is going to actually be X bar so that we sort of show where the sample, the mean of the sampling distribution is in reference to our confidence interval. Again, outside the AES we say color is red, and we'll give it a sufficiently large size so that it stands out. So if we run this we can see that now we've added this confidence interval to our lot, and we've got our X bar here in the middle which lines up with the middle of our normal distribution. And we can see here and so with our confidence interval, essentially we're saying that we're 95 percent confident that the true mean lies within this interval. So this is how you can add the confidence interval to the plot after calculating it.

Watch It: Video - Confidence Intervals SEmethod (04:58 minutes)

Click here for a transcript.

Welcome back to another video on confidence intervals and bootstrapping. Where we left off, we developed this plot to show our bootstrap or our confidence interval using the bootstrap sampling distribution. We're going to continue to build up that plot, but we're going to use the original samples. And so the process of calculating the confidence intervals is still the same.

I'm going to create XBD or X bar data, and we're going to use the data means, which if we scroll back up here beta means has our original four samples and those sample means. And so, we want the mean of those four samples, and I won't do any fancy printing here. Just show you what the thing is. I had a typo again with the capitalization. You have to make sure it's exact. But we can see this is what our data X bar is. similarly, we can calculate the standard error with data means.

And so here we can see. Oh, I forgot my parentheses there. And so we can see a single value here for our data standard error. And then we can calculate the confidence interval SID. So here we would say XBD minus 2 times SED. And then XBD plus 2 times SED. And so here we have our basic confidence interval for the data. And then we can build up this plot. So once again I'm going to copy this. I'm going to bring it down here and run it again to show you what we had before. And so now we want to add in the data confidence interval. So again we have this geom error bar H AES. We'll put this down a little lower. So we'll put it at 0.25. We still need to say x min is CID at the zero index and X max is CID at the one index. And then outside the AES we'll say color is, we'll make it green. And I'm going to add in an extra term called height, and this tells it how high we actually want V error bar to be. So it defaults to 0.5, so this is going from 0.25 to 0.75. I'll shorten this one so that it's 0.25. And then we also add the point at the middle. Making sure that our Y matches up so 0.25 and our X is at XBE. Our color is the same green, and we'll make it size three.

So if we run this, we can now see that we've got the confidence interval based off of the original four samples. It's much wider. That means that you know there was more variation, which is creating a wider confidence interval even though the means are fairly close to each other. So the benefit of doing that bootstrapping method is that because we have more samples, we're able to narrow our confidence interval a little bit, and therefore have a little bit more confidence in our statistical inference.

Try It: Apply Your Coding Skills in Google Colab

Click the Google Colab file used in the video here. [7]
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, use the partial code below to calculate and plot the 95% confidence interval.

Note: You must be logged into your PSU Google Workspace in order to access the file.

# step 1: calculate x bar
XB = ...
 
# step 2: calculate the standard error
SE = ...
 
# step 3: calculate the 95% Confidence Interval
CI = ...
 
(ggplot(boot_means) +
    geom_dotplot(...) +
    geom_vline(aes(xintercept = 7), color = 'blue', size = 1) +
    geom_errorbarh(...) +
    geom_point(...)
)

Assess It: Check Your Knowledge

Knowledge Check

FAQ

(add new questions)

Summary and Final Tasks

Summary

Through this lesson, you are now able to define a confidence interval and calculate the interval using bootstrapped sampling distributions. You also learned how to create for loops, which we will continue to use throughout the course.

Assess It: Check Your Knowledge Quiz

Reminder - Complete all of the Lesson 4 tasks!

You have reached the end of Lesson 4! Double-check the to-do list on the Lesson 4 Overview page to make sure you have completed all of the activities listed there before you begin Lesson 5.

Lesson 4: Confidence intervals and bootstrapping

Overview

Overview

Learning Outcomes

Lesson Roadmap

Questions?

Sampling Distributions

Read It: Sampling Distributions

Watch It: Video - Introduction to Sampling Distributions (10:33 minutes)

Watch It: Video - Comparing Samples to Population (13:51 minutes)

Try It: Apply Your Coding Skills in Google Colab

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Standard Deviation and Standard Error

Read It: Standard Deviation and Standard Error

Watch It: Video - Standard Deviation and Standard Error (03:26 minutes)

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Introduction to For Loops

Read It: For Loops

Watch It: Video - Introduction to For Loops (07:15 minutes)

Try It: DataCamp - Apply Your Coding Skills

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Bootstrapping

Read It: Bootstrapping

Watch It: Video - Introduction to Bootstrapping (11:29 minutes)

Try It: Apply Your Coding Skills in Google Colab

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Bootstrapped Sampling Distributions

Read It: Bootstrapped Sampling Distributions

Watch It: Video - Bootstrapped Sampling Distributions (07:52 minutes)

Try It: Apply Your Coding Skills in Google Colab

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Introduction to Confidence Intervals

Read It: Confidence Intervals

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Confidence Intervals: The Percentile Method

Read It: Confidence Intervals: The Percentile Method

Watch It: Video - Introduction to Sampling Distributions (07:11 minutes)

Try It: Apply Your Coding Skills in Google Colab

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Confidence Intervals: The Standard Error Method

Read It: Confidence Intervals: The Standard Error Method

Watch It: Video - Introduction to Sampling Distributions (07:12 minutes)

Watch It: Video - Confidence Intervals SEmethod (04:58 minutes)

Try It: Apply Your Coding Skills in Google Colab

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Summary and Final Tasks

Summary

Assess It: Check Your Knowledge Quiz

Reminder - Complete all of the Lesson 4 tasks!

Lesson 4 FAQ