EME 210
Data Analytics for Energy Systems

Summarizing Quantitative Data: Histograms and Measures of Center

PrintPrint

  Read It: Histograms

Let's start our discussion with histograms, because this particular form of data visualization is useful for illustrating the measures of center that we discuss a little later in this section. Histograms simply show the number (or "Count") of observations whose values fall within pre-defined intervals (or "bins"). There are many ways to make a histogram in Python, and a convenient method is in the Pandas library, where one applies the function hist to any quantitative variable:

df['x'].hist()

where you would replace df with your dataframe name and x with the variable name. The height of the columns indicates the relative number of observations in each interval. For example, in the histogram below, the tallest column belongs to the interval 0 to 0.1.

Bell shaped curve; x axis begins at approximately -4 and ends at 4; y axis count ranges from 0 to 400; summarized in text above and below
Bell-shaped & Symmetric
Credit: © Penn State is licensed under CC BY-NC-SA 4.0 

Histograms effectively visualize the distribution of a quantitative variable, showing us what values that variable spans (i.e., the minimum and maximum), where most values are concentrated, and how spread-out typical values are. Some typical shapes of histograms are represented in this section. In the figure above, we have a bell-shaped and symmetric distribution, in which the counts fall off in a similar pattern above and below the center (this example happens to be centered at 0). The next figure shows an example of a right-skewed distribution, where observations are concentrated at lower values, with fewer observations at much larger values. This distribution appears to have a "tail" going off to the right. Note that the values need not be concentrated at 0, or even be positive; these are just features of this example.

example of a right skewed data; distribution quickly drops off quickly as more observations are concentrated at lower values
Right-skewed
Credit: © Penn State is licensed under CC BY-NC-SA 4.0 

The next figure shows a left-skewed distribution, with a "tail" going off to the left. Most observations are concentrated at larger values. Note that the values need not be concentrated at 0, or even be negative; these are just features of this example.

example of a left skewed data; distribution starts slowly and then rises rapidly
Left-skewed
Credit: © Penn State is licensed under CC BY-NC-SA 4.0 

An example of a distribution that is symmetric but not bell-shaped is depicted below. Even though it is centered at 0, that is not where most values are concentrated. This particular example is "bimodal", in that it has two locations where values are concentrated (about -2 and 2).

example of a symmetric but not bell shaped distribution of data that rises from -4, peaks at -2, drops and rises again at +2 and dropping again at +4 (from left to right); described in text above
Symmetric but not bell-shaped
Credit: © Penn State is licensed under CC BY-NC-SA 4.0

  Read It: Measures of Center

Most often, the primary characteristic of interest for a distribution of data is its center, because one wants to know the most representative value for the variable of interest. in the example below, even though we witness values as large as 4 and as small as -4, these are very rare; the most common values are close to the center at approximately 0. 

bell curve rising from around -3, centered at a mean of approximately 0, then tailing off again at +3; as described in text above
Centered: Mean = 0.004   Median = 0.005
Credit: © Penn State is licensed under CC BY-NC-SA 4.0 
Perhaps the most common way to measure the center of a quantitative variable is with the mean, in which you add up all the values and divide by the number of values:
Mean =  Sum of all data values   Number of data values  = x 1 + x 2 + x 3 + + x n n = x n

Here, n is the length of x, or the number of values in that vector. When someone talks about the "average", they usually are referring to the mean. When one calculates the mean from a sample of data, we call that the "sample mean" and note it with $\bar{x}$ or "x-bar", whereas the mean of the entire population is noted with $\mu$ or "mu" (a Greek letter). The image below shows this distinction in notation graphically. Throughout this course, whenever we talk about a value that characterizes the population, we refer to that as a "parameter"; and when we talk about a value calculated from a sample, we call that a "statistic".

Another measure of the center of a distribution is the median, which is the middle value after you sort the values from lowest to highest (or the mean of the middle two if there are an even number of values):

Median: 

  • If n is odd, median is middle value of ordered data values
  • If n is even, median is mean of the middle two values of ordered data values

The median value is noted with m

A third measure of the center of a distribution is the mode, which is simply the value that appears most often in a variable:

Mode: The value that appears most frequently

We won't work with the mode much in this course, as it is a less useful measure of the center. Often with quantitative variables, the mode cannot be calculated because each value is unique. It is better applied to ranges of values (e.g., comparing the heights of bins in the histograms above), or for categorical data, where values do tend to repeat and one usually sees a distinct category with more observations.

cyclical diagram showing how a data is collected from a population to quantify as a statistic to measure and make an inference; as described in the text above
Notation for Mean
Click for a text description of Notation for Standard Deviation.
Enter Description Here

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 

The mean is more affected by skewness (and outliers) than the median!

For symmetric distributions, the mean and median will agree upon the measure of the center. Take the example below, where the mean estimates the center at 0.004 and the median at 0.005. Although not exactly equal, these values are very close (indeed, both are represented by vertical dark blue lines, which are indistinguishable because they plot right on top of one another). 

bell curve rising from around -3, centered at a mean of approximately 0, then tailing off again at +3; as described in text above; mean and median both approximately at 0; as described in text above
Influence of Skewness: Mean = 0.004. Median = 0.005
Credit: © Penn State is licensed under CC BY-NC-SA 4.0

However, for skewed distributions, the mean and median will generally not agree, and the more skewed the distribution, the more these two estimates will disagree. The reason for this disagreement is that the mean will be more influenced by the extreme values in the tail of the distribution, since it sums the values in the numerator. The median doesn't care about the magnitude of the values, just their rank in terms of smallest to largest. Take for example the right-skewed distribution below, where the mean is 1.123 and significantly greater than the median of 0.993 (here the two dark blue vertical lines representing these values are distinguishable). The large positive values in the right tail have led to a larger mean than the median. 

right skewed distribution portraying how means and medians are different as described in text above
Influence of Skewness: Mean = 1.123 Median = 0.993
Credit: © Penn State is licensed under CC BY-NC-SA 4.0

Thus, the median tends to give a more stable measure of the center that is robust to extreme values and outliers. For this reason, the median is often the go-to measure of the center for skewed distributions, such as salaries and housing prices, for example.

  Read It: EPA Flight Tool

In the videos below use data from EPA's Greenhouse Gas Reporting Program (GHGRP) to demonstrate how to make histograms and find the mean and median in Python. You may want to first explore the EPA's web app: Facility Level Information on GreenHouse gases Tool (FLIGHT). This will help you understand the content of the dataset, which gives volumes of emissions from power plants, factories, refineries, etc. across the country.

screen capture of EPA's web app: Facility Level Information on GreenHouse gases Tool
Caption: A screenshot of FLIGHT.
Image Credit: EPA

  Watch It: Measures of Center (10:39 minutes)

Click here for a transcript.

Hi, in this video we're going to take a look at the underlying data for the EPA's flight tool. These are data on greenhouse gas emissions from facilities all across the United States by facilities, I mean, power plants, waste processing centers, factories, and so on, and so forth. So, I highly recommend you click on this flight tool to explore the database in another format. And if you click on that, you'll go to their web app. So, this is the flight tool here, and here it's a map-based overview of the data. You can zoom in on different localities and see where these facilities are located and information about them. You can also filter by different facilities and variables, and so on, so forth. So go ahead and check that out. However, what we're going to do here is we're going to look again at the underlying data. We're going to do our own thing with it outside of the web app. And we are going to ultimately find which state has the largest amount of emissions from facilities. So this is not emissions from the automotive sector or anything else, just from facilities. And along the way we're going to look at histograms. We're going to see how to calculate the mean and the median and then ultimately how to summarize all the emissions within each state.

So, let's get to it. So, we're all familiar with importing the data. We've got that. Those lines of code here, except there's one difference now that you may not be familiar with, and that is this skip rows argument. So, why do I have skip rows here? Let me explain. If we take a look at the underlying CSV, our data frame or our table doesn't really start until row four, here with the column headers. And then all the data below. We've got three superfluous lines here that really aren't part of the data frame at all. And we don't want to include. So, we just need to skip those three rows. So, pandas makes this incredibly easy. We say skip rows equals three. So, we skip over those unnecessary lines there. And here, highlight this. This highlights the importance of taking a look at the CSV file before doing the importing just to understand exactly what is in this data file. And you can open up these CSV files and in Google Sheets, or Microsoft Excel. So, once we have this imported, let's go ahead and take a look at the size of this data set. And one handy function to do this is shape. And all the shape gives us are the number of rows in this data set, and the number of columns. So, 1, 5, 3, 8, 6 some odd rows. The rows here are different facilities and we've got 66 columns. Furthermore, we should take a look at what the columns are exactly. So, let's print out the column names here.

So, we've got facility ID, facility names, city, state, etc, etc. And one really important variable here is total reported direct emissions. So, we're going to focus on that, although there's a lot of other interesting things here too, in particular, the industry type. So what type is, what form of industries producing these emissions. Lots of other useful things here, so let's get down to it. Let's start by finding the histogram. So, with a histogram we want to make sure that the data that we plot in our visualization in the histogram spans from the minimum of the data to the maximum of the data. At least it needs to cover the span of the data. So, if we want to find the min and the max, let's just say ghg min. we want to find our minimum total reported direction emissions. We just use the min function, very easy. We'll do the same thing for the max. And let's print these results. We'll say min is ght min and max is ghg max. So, our direct emission values for all the facilities span from zero to some really large number here. So, what is this, about 20 million? More than 20 million, so very large number here. So, let's see what happens when we make a histogram of this. So, we'll call our data frame, and one easy way to make a histogram is through pandas. We just say histogram.hist and we give it the column that we want to make the histogram of. And this is, in this case, total reported direct emissions. And we'll run this. And so, here we go. We get our histogram, and what is done is it's used some predefined, you know, rules of thumb to to make adequate bin sizes of our histogram here. So, the histogram is counting the values that fall within specified ranges and and then plotting those frequencies as bars. So, this is essentially a frequency table represented graphically where the categories are just different ranges of the quantitative variable. However, when we use the default method here, we don't know what those ranges are. And that's a little bit problematic. We're not sure basically what the break point is between this first bin and the second bin. We don't know where this stops. So, it's good practice to manually specify the number of bins that you have. So, we could do this fairly crudely, and we could say one e to the 7 7 when you do the seven being 10 million up to 20 million. But then we have to go up to 30 million because our largest value is slightly more than 20 million. So, we could do this, we'll have three bins. We've got four values specifying the break points between those three bins. You visualize this and we see a pretty crude histogram. I say this is crude because we pretty much have all our values in this first bin here and then a few values in the second bin and probably one value in this third bin. So, these bins are too fat. We don't have a good enough resolution. And so, we could break this up some more. We could subdivide it. A handy tool here is the range function. Built in range will just give us a sequence of values from some integer to some other integer. So, let's say, so let's say 21 million. And by some step, let's say we want to go by 1 million. And we get that, all the list of values going up in that increment. So, this is a handy way to build these bins if we want to have many of them. And we don't want to manually specify the break point between each and every one. So, there we go. We get a much finer histogram, and we know where the break points are. We know this is going from zero to one million, one million to two million, two million to three million, and so on, so forth.

Okay, so that's the histogram, and it's telling us the spread of the emissions from the facilities. In particular, we see we've got a lot of values that are close to zero, very few values that are really, really large. So this is what we call right-skewed. This is an example of a right-skewed histogram. Let's go ahead and calculate some measures of center. So, what is the average level of emissions in these facilities? I'll just do this in a couple of print statements. So, if we want to find the mean, also denoted x bar, this is going to be equal to our quantitative variable, phdm in this case, is total reported direct emissions. And we just say dot mean. Calculate the mean. It's also a good practice to include units. Our units here are millions of tons of CO2 equivalent, so, mmt CO2 millions of metric tons CO2 equivalent. So, that'll give us the mean. And let's compare this to the median. So instead of the mean function we just have the median dependent on there. And this doesn't have the notation x bar. The median is just the median. Let's run this. We see that our mean, it's got 388 000, our median 65 000, or so. A lot less. What this indicates to us, with the mean being much larger than the median, is that the distribution is indeed right skewed. In other words, this mean, the value of the mean, gets pulled up by these much larger values up here. These are most likely outliers. So, the mean is more susceptible to these outliers. The median is more robust to them. Note that the median falls, you know, well within this first spike here. The mean somewhere up here.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

  Watch It: Measures of Center Part 2 (6:16 minutes)

Click here for a transcript.

Lastly, let's find our total emissions for each state. So what we need to do here is within each state, sum up all of the emissions from all of the facilities within that state. So far we've just been looking at all the facilities collectively, but let's actually tally up the quantities within each state. So in order to do that, let's create a new object called stem, for state emissions. And we'll reference our initial data set. And we're going to use this group by function. So we're going to say we want to group by the variable, state, here. So they're going to be our groups. And within each group we want to perform whatever comes next. In this case we want to sum up the values.

When we do this, we retain all of our columns there. But ultimately, our total reported direct emissions are the sum within each state. Also one byproduct of this is that state becomes an index here. So let's reset that. Let's include in here this function reset index. And that'll just turn the state back into a normal column. There we go. So now we see state is just a regular column. Okay, so we've done that. We've summed all the total reported direct emissions within each state. Now let's find what the max is. So our max emissions will be the max of stem total reported direct emissions. And let's see what that max value is. Some really large value. What is this? This is 372 million and some change.

Okay. But we don't know what state that belongs to, so now we need to find which row contains this max value. So let's just make some dummy variable, we'll call it f. And we'll say stem total reported direct emissions equals equals max m. Note that I'm making a logical statement here. I'm using this equals to say which of these values are equal to max m. And what we'll get then is a Boolean variable. In other words, a variable containing only true and false. So it's false everywhere where it doesn't contain max m. And then we've got this one true in row 45, or index 45 here. But still that doesn't tell us what state, it just tells us the index. But we're getting there. So now we just need to query that row. We say stem of f. And you get the row that has the max total for direct emissions, the state being Texas. And there we have our answer. A quicker way to do this would be to say max row stem dot loc, using our loc function here of stem total reported direct emissions. And instead of max we're going to use idx max. So return the index where max is located, not the max value itself. Okay. And I need to type square brackets here so we can then see what max row is. And it reports all the variables, all the variable values for whatever the max is. We see it's Texas here. And we could put this into a nice print statement. I could say print the state with the highest facility emissions is, and then we say max row state.

There we go. So we found our answer. All right, so we've covered histograms, mean, median, and then ultimately summarizing data within groups. Note that Texas here, you know, it is the largest of the continental United States. And so you might say well of course it has the highest total direct emissions it's a really, really large state - lots of people, lots of industrial activity, makes a lot of sense. If we wanted to normalize this by state size, we could look at the average, what the average is across all the facilities. See if there's any state that's got, whose facilities are producing above average emissions, which case we'd replace the sum here with mean. So you can replace this sum with any other functions if you want to summarize your data differently than groups. Okay, we'll stop there for now.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

  Try It: DataCamp - Find the mean and median

Let's return to our sample of residential energy use, which gives the total electricity consumed in a year (KHW):

HOME ID DIVISION KWH
10460 Pacific 3491.900
10787 East North Central 6195.942
11055 Mountain North 6976.000
14870 Pacific 10979.658
12200 Mountain South 19472.628
12228 South Atlantic 23645.160
10934 East South Central 19123.754
10731 Middle Atlantic 3982.231
13623 East North Central 9457.710
12524 Pacific 15199.859

* Data Source: Residential Energy Consumption Survey (RECS)(link is external), U.S. Energy Information Administration (accessed Nov. 15th, 2021)

Can you use Python to determine, from this sample, the typical annual home electricity use? Find both the mean and median.


  Assess It: Check Your Knowledge

Use the table below to answer the following questions.

Make Model Type City MPG
Audi A4 Sport 18
BMW X1 SUV 17
Chevy Tahoe SUV 10
Chevy Camaro Sport 13
Honda Odyssey Minivan 14

 FAQ