EME 210
Data Analytics for Energy Systems

Bootstrapped Sampling Distributions

PrintPrint

  Read It: Bootstrapped Sampling Distributions

Once we have the bootstrapped samples, we need to calculate the sampling distribution, which contains the statistic of each sample. Below we demonstrate this process.

  Watch It: Video - Bootstrapped Sampling Distributions (07:52 minutes)

Click here for a transcript.

Hello and welcome back. This video takes off right at the end of the previous video, where we went through and developed this loop to create a thousand samples using bootstrap. But we still haven't created the sampling distribution because that's the sample of statistics and in this case we'll use x bar. So in order to do this, we start off by finding the mean of each bootstrapped sample. So I'll just call this data frame boot means, and we say boot df dot group by. We want a group by sample, which is what we called each of our bootstraps up here. We want to take the mean, and we want to tack on a rename function so that it's a bit more descriptive because it's technically no longer the outcome. It's now a bootstrap sample means. So we can print the first five rows of that. And so now we can see we've got different samples, and we've got different means of those. So this is now our sampling distribution, but it's usually a little bit better to plot these things. And so we can go in here and use our gigi plot with boot means. And so we can do a dot plot which we introduced in the previous series where our X is just bootstrap sample means and our DOT size we'll do 0.25.

And so we can see that this looks pretty good. So compared to what we previously had, because we have a thousand iterations, we're now starting to see this normal distribution that we expect from a from rolling a dice. And so, but we want to continue to build off of this plot. And so I want to show you how to plot the original sample means from those original four samples to this Dot Plot. In order to do that, we need to reset the index somehow and so previously we've done dot reset index but another way you can do it, is say dot index. So you can just create a new column that is equal to the index here. And so now we can see that the index stayed the same, but we've got this extra column called sample that now matches what the index was.

And so then we can build up our previous plot. I'm actually going to come up here and just copy this. So we'll run it, and so this is our basic plot. If we wanted to add in our original sample means, we can also add a secondary gm dot plot. Where X equals sample mean, which is what we called it up here, and we want to fill based off of sample. So that we can see the different colors. And then we have to do an additional thing here. We need to specify the data, and that's because this data is different than the data we specified up here. So this data set will be carried throughout the entire plotting function unless we specify a new one in a specific graph. So because we're working with data means now, we need to add this, this way. And then we can say dot size. We'll add. We'll make those a little bit bigger. So we can run this. And so we can see these giant dots, so maybe that's a little too big. We'll do 0.25, there. And so now we can see our bootstrap in the back with our black dots. We can see our original sample means here. So you can see how the distribution has changed just by increasing the sample size. And then the last thing that we'll normally do is plot where the actual truth is. so here I'm using a new geome called GM V line, which will just add a vertical line. And within the AES, we just say X intercept equals seven. It's just the true value. And then outside of the AES, we can just set the color equal to blue, and we can set the size equal to one.

And so now we can see the completed plot. We've got our bootstrap mean, our bootstrap sampling distribution in the back, our original sample means here, and where the true value is. so we can also use this to see that technically while the bootstrap does have this nice normal distribution, it is heavily skewed towards the lower half of our rolled values. It's not directly centered around our true value, seven. And that's because the bulk of our data as we can see from our original samples was way down here. And so when we're pulling the bootstrap, we're more likely to get those lower values than to have some of the values up here. And so that's a really critical point to remember that you know, yes you can use bootstrapping to generate a number of samples, but it can't give you anything that you haven't given it. So if you give it biased data, your bootstrap sample is always going to be biased even if you increase this to ten thousand to a million samples. It all depends on what you give it. And so this is how we can do bootstrapping to develop a sample size. Later we can talk about how to actually take the sample size or take this bootstrapping sample and develop a confidence interval around it.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

  Try It: Apply Your Coding Skills in Google Colab

  1. Click the Google Colab file used in the video here.
  2. Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
  3. Once you have it saved in your Drive, use the partial code below to calculate and plot the bootstrapped sampling distribution. Alternatively, you can continue to edit your Colab file from the previous module.

Note: You must be logged into your PSU Google Workspace in order to access the file. 

# Find the mean of each sample
boot_means = ...

# plot
(ggplot(...) + geom_dotplot(...))


  Assess It: Check Your Knowledge

Knowledge Check

 FAQ

(add new questions)