Summarizing Quantitative Data: Measures of Spread

Read It: Measures of Spread: Standard Deviation

In addition to knowing the center of a distribution of data values, it is often useful to know how dispersed or spread-out the distribution it. A common measure of this spread is the standard deviation:

Standard Deviation: the typical distance of a data value from the mean. $s = \sqrt{\frac{\sum (x - \bar{x})^{2}}{n - 1}}$

A larger standard deviation indicates a more dispersed collection of values, whereas a smaller standard deviation indicates a narrower distribution. To make the interpretation of the standard deviation more quantitative: a new, randomly drawn value from the population will have a 95% chance of being within the range -2 to +2 standard deviations from the mean (i.e., $[\overset{―}{x} - 2 s, \overset{―}{x} + 2 s]$ ). This is depicted in the example image below.

A histogram of a quantitative variable bell curve. 95% of the data is within two standard deviations from the mean. Mean = 0.002 Media = 0.007 Standard Deviation = 0.997 At +2 standard deviation, the “z-score”: z*s = *s

Measures of Spread: Standard Deviation

Click for a text description of Measures of Spread: Standard Deviation.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

The image above also introduces the z-score, which we will come back to later in the course. The z-score, $z$ , is simply the number (or fraction) of standard deviations that a value is from the mean. The z-score can be reported for a data point, or for a hypothetical value, such as the far right red line in the example above (where $z = 2$ ).

A Note on Notation

The formula above gives the standard deviation for a sample, noted s. The standard deviation for the entire population is noted with $σ$ (or "sigma", a Greek letter).

cyclical diagram showing how a data is collected from a population to quantify as a statistic to measure and make an inference; the standard deviation is noted and described in the text above

Notation for Standard Deviation

Click for a text description of Notation for Standard Deviation.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

Watch It: Video - Measures of Spread (4:36 minutes)

Click here for a transcript.

Hi, in this video we're going to keep on using the EPA flight data set. This is, these are facility level emissions of greenhouse gas data, but now to exemplify how to calculate measures of spread. So we're going to use those data to find standard deviation and then the interquartile range in Python. So, let's get to it. So first off, to find the standard deviation, we can find this in a very similar way as to how we found the mean and median above. So I'll do this in a print function, as we did before. So standard deviation is equal to, and then we call our whatever variable it is we want to do this on,
so total reported direct emissions in this case. And our function now is not mean or median, but STD for standard deviation. And this actually shares the same units as the underlying variable. So in this case, millions of metric tons of CO2 equivalent. Okay, so our measure of spread, basically our measurement of how much these direct emissions vary across all the facilities in this data set, comes out to be a little bit more than a million. Pretty large number! Okay. Well, let's compare this to the interquartile range now. So we're going to calculate the interquartile range using the describe function. And there's different ways to do this describe function. I want to do it this way because the describe function is actually quite handy for a number of things. So let's first find that. So we'll call our variable total reported direct emissions, and our function here is just describe, and here we have to put in parentheses at the very end. And then we'll view the output.

Here we go. So this is a series actually, and it's got some, a lot of nice values. So we see our standard deviation here. We see our mean value. It also tells us how many values are being used here; so 6,481, in this case. Note the use of scientific notation here. And then the minimum zero, max of 20 million or so, and then 25, 50, and 75 percent. This 50 percent is the median. So this agrees with exactly what we had in the median above because this 50 percent
refers to the 50th percentile. And that is another name for the median. So those are synonymous. And then we also have the 25th percentile, and the 75th percentile. So 25 percent of the values are less than 31,000 and some change, here. And 75 percent of the data values are less than 186,000 and some change here.

So our interquartile range is the difference between the 75th percentile and the 25th percentile. So we can simply find this by referencing these values in the describe series above. So it's our 75th percentile minus our 25th percentile. And you can view that value. Turns out to be 154 000 or so. So quite a bit less than standard deviation. So these are two different measures of the spread. The standard deviation is you know the typical distance away from the mean. And if you recall, in the comparison of the mean and median we have a right-skewed distribution up here. And so, our mean was a lot larger than our median. And so, there's a lot more spread about that mean, if we're looking about it that way. Whereas, the median and the interquartile range are much more concentrated down in this lower end where we have more observations.

So in summary, that's how you can find the standard deviation and the interquartile range. Along the way we found this handy describe function which is useful for summarizing the data in one fell swoop. And again, we got to do all this with the fascinating flight data set on greenhouse gas emissions across the U.S. Okay, thank you.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0(link is external)

The above video demonstrates how to calculate standard deviation using the std function from Pandas. There is also a std function from Numpy, however in order to use that function to calculate standard deviation as given in the above formula, you need to specify the argument ddof=1 to represent the 1 being subtracted from n in the denominator:

1	`numpy.std(x, ddof=1)`

where x is an array of values.

Try It: DataCamp - Find the standard deviation

You already found the mean and median of the total electricity consumed in a year (KHW) in the previous page.

HOME ID	DIVISION	KWH
10460	Pacific	3491.900
10787	East North Central	6195.942
11055	Mountain North	6976.000
14870	Pacific	10979.658
12200	Mountain South	19472.628
12228	South Atlantic	23645.160
10934	East South Central	19123.754
10731	Middle Atlantic	3982.231
13623	East North Central	9457.710
12524	Pacific	15199.859

* Data Source: Residential Energy Consumption Survey (RECS)(link is external), U.S. Energy Information Administration (accessed Nov. 15th, 2021)

Now can you find the standard deviation using Python?

Assess It: Check Your Knowledge

Use the table below to answer the following questions.

Make	Model	Type	City MPG
Audi	A4	Sport	18
BMW	X1	SUV	17
Chevy	Tahoe	SUV	10
Chevy	Camaro	Sport	13
Honda	Odyssey	Minivan	14

Read It: Measures of Spread: Standard Deviation

Watch It: Video - Measures of Spread (4:36 minutes)

Assess It: Check Your Knowledge

FAQ

Navigation

EMS

Programs

Related Links