Lesson 3: Data visualization

Overview

Visualizing data is a critical step in the analysis process. In this lesson, we will introduce a plotting tool in Python, as well as several types of plots that can be used to show different types of data. We will start with an introduction to plotting with one variable, demonstrating bar plots, box plots, and histograms. Then, we will delve into plots with multiple variables, including bar plots (again) and scatter plots. Finally, we will wrap up with an example of working with time series and line plots.

Learning Outcomes

By the end of this lesson, you should be able to:

Recognize different types of plots in python
Implement python code to plot one variable at a time
Create plots with multiple variables
Distinguish correlation vs. causation

Lesson Roadmap

Lesson Roadmap
Type	Assignment or Assessment	Location
To Read	Lock, et. al.: 2.4-2.7	Textbook
To Do	Complete Homework: H04 Exploratory Data Analysis Take Lesson 3 Quiz	Canvas Canvas

Questions?

If you prefer to use email:

If you have any questions, please send a message through Canvas. We will check daily to respond. If your question is one that is relevant to the entire class, we may respond to the entire class rather than individually.

If you prefer to use the discussion forums:

If you have questions, please feel free to post them to the General Questions and Discussion forum in Canvas. While you are there, feel free to post your own responses if you, too, are able to help a classmate.

Visualization: One Variable

Read It: Visualizations with One Variable

In this lesson, you will learn about various types of plots and how to create those in Python. In particular, we will start with plots that consider one variable. If the variable that you want to plot is a categorical variable, often you will want to create a bar plot, as shown below. Here, the variable we are plotting is the majors of students that took this survey, and on the y-axis we are showing the count of those majors. This type of plot can also be considered a graphical representation of a frequency table.

seven column bar graph depicting number of students within programs of study who completed survey; As described above with results described below

Bar plot of a single categorical variable. Here, we show the major of students on the x-axis and the number of students in that major on the y-axis.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

You can also use bar plots to show a quantitative variable that is separated by a category. For example, the plot below shows the total number of credit hours taken by students in the survey, separated by major. We can see that PNGE students took the most credit hours in total, largely because they represent the largest major that took the survey.

seven column bar graph depicting number credit hours taken bystudents within programs of study; As described above with results described below

Bar plot of a single quantitative variable separated by categories. Here, we show the major of students on the x-axis and the total number of credits being taken by students in that major on the y-axis.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Another way that you can plot quantitative data is through histograms. Histograms are similar to bar plots, but they are only used to show counts of variables. Below, we show an example histogram that plots the credit hours taken by students in the survey. Plotting a histogram can also be a great way to visualize a variable's distribution.

8 column histogram depicting count for student credit hours as characterized above

Histogram of a single quantitative variable. Here, we show the number of credits being taken by students on the x-axis and the count (or frequency) of the number of students that are taking a given number of credits on the y-axis.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Similar to histograms, box plots can be used to visualize a distribution. They can also be used to identify the 5-number summary of a quantitative variable. Below, we show a boxplot of the credit hours taken by students. We can see the minimum value, the maximum value, the median, and the first and third quantiles. Additionally, box plots will show you which values, if any, are outliers. In this plot, we do not have any outliers.

box shaped plot depicting rectangular data range for minimum and maximum data values as described above and below

Box plot for a single quantitative variable. Here, we show the number of credits taken by each student on the y-axis and nothing on the x-axis.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Introduction to Plotnine

Read It: Introduction to Plotnine

In this course, we will be use the Plotnine library [2] to create plots in Python. Plotnine uses a style of plotting known as "Grammar of Graphics". The name stems from the idea that you build graphics from the bottom up using a specific syntax, similar to how you build a sentence up using specific grammar! For example, you could write a basic sentence: "The fox jumps." It has all of the necessary components, but is fairly simple. You could increase the complexity of the sentence by adding some adjectives: "The quick, brown fox jumps." Next, you could keep building this sentence with a preposition: "The quick, brown fox jumps over the dog." Finally, you could add another adjective, adding further complexity to the sentence: "The quick, brown fox jumps over the lazy dog." In this way, we will use the "Grammar of Graphics" to start simple and build the complexity of our plots.

There are three key components of any "Grammar of Graphics" (or "ggplot") plot:

?
1
ggplot(...)
This command starts the plotting object and specify the pandas dataframe that you are going to use for visualization
?
1
aes(...)
This command defines the variables that you are plotting
?
1
geom_xyz(...)
This command defines the type of plot. You can use points, lines, bars, and many others. For example, in today's tutorial, we use geom_bar as demo for plotting histogram.

The most basic syntax, similar to our basic sentence above, is:

1	`ggplot(...)` `+` `geom_xyz(aes(...), ...)`

As the plots get more advanced, we can add things like labels:

1	`ggplot(...)` `+` `geom_xyz(aes(...), ...)` `+` `xlab(...)` `+` `ylab(...)`

Or axis controls via `theme`:

1	`ggplot(...)` `+` `geom_xyz(aes(...), ...)` `+` `xlab(...)` `+` `ylab(...)` `+` `theme(...)`

Other plotting tools in Python include matplotlib [3] and seaborn [4], but for the purpose of this co

Assess It

Knowledge Check

Bar Plots: One Variable

Read It: Bar Plots

Watch It: Video - Importing and Cleaning the Data (9:03 minutes)

Click here for a transcript.

Hello. In this lesson we are going to talk about visualization and how we can use Python to create graphs to aid our statistical inference and data analysis. Before we get into the actual visualization, however, we're going to import and clean some of the data that we will be using later on in the lesson. So, with that, we can go ahead and get started.

All right, so we've got our data set, or our code here. We've got three libraries that we will be using within this demonstration. Throughout the lesson we'll use our Google collab library to import the data. We'll use Pandas to read in a CSV file, and we'll use this Plot9 to actually do the plotting. And I want to point out that we are using this asterisk to tell Python to import all the commands from Plot9 without the need for a nickname. So, this is one of the few libraries that we'll use throughout the course in which you don't need to give it a nickname, you can simply import all the commands as is.

So, with that let's go ahead and get started. I'm using the first option from lesson one to import our data, the mount Google Drive option, and I've already got it set up here. I also commented out my file path for easy use. And so, I will go ahead and read this data in. We'll just call it survey, and we'll say pd dot read csv and just give it the file path. I've looked into this file before, and we know that there is no extra header or anything, so no need to skip rows. But to give you an idea about what this looks like, this is a survey that was collected from students that took this course in University Park, during the Spring of 2023. And we can see that they have answered all these questions. But we can also see that there's a number of unnecessary columns, some metadata, all of this that we want to remove in order to make it usable in the future. And so, the first thing I'm going to do is print the names of the columns to the screen. And this is a good practice to get into because it very quickly allows us to see what columns there are and possibly which columns then need to be removed.

And so, we can see that there's all of these sort of metadata about which section the students were in. This was a canvas-based survey, so it's all of that stuff. There are also these numbers here that aren't attached to any data at all that we need to remove. And so, we'll start in to step three of our data cleaning, which is to remove some of these unnecessary columns.

And so, the first thing that I'm going to do, I'm going to override my existing data frame and I'm going to say “survey”, which is what I call my data, dot drop, and this will drop any columns that we tell it to. So, I'm going to. But we need to tell it which columns and we give them those columns in square brackets. And so, for this command I'm going to come back up here. I'm going to copy these top command, top columns here, as well as these bottom ones here. And you'll notice that I have not removed any of the number columns and that's because there's so many of them that I'm actually going to do a secondary removal that sort of speeds up the process. So, I'll go ahead and run this command. So, we've now dropped these columns. And so, if we come in here and say, survey dot columns, and run that… and there is an error because I've already dropped those columns. So, when I go to rerun them, it doesn't like it. So, this is something that you might come across in your own homework in your own coding assignments, is when you try to rerun something that you already did, sometimes it doesn't find what you're trying to do. Maybe you've removed a row that no longer exists and now you're trying to call it. In this case I've removed columns that don't exist so it can't drop them. But if we go through here and just type it in another code block, we can see that the columns are very similar to up here, but we no longer have these section base columns, or this incorrectly score base columns.

Okay. To remove these, you know random numbers 1.0 1.0.1, and to do that we're going to do a similar process but we're going to tell Python to do all of the one, all of the columns that start with one, simultaneously. So, to do that, we say survey dot loc, give a column to say all of the rows, and then we do tilde survey dot columns dot str dot starts with one. And this will actually drop every column that has a string that starts with one. So, if we run this, and then our printing -- oops typo there – and we print the columns at the end, we can now see that we've got just the questions, which is exactly what we want.

However, it's not exactly where we would like it to be. We need to rename these columns. And so, to do that I'm going to once again override my original data and say survey dot read name and then we say columns and then we open curly brackets. And now the basic way that we do this is that we take, come up here and take the first column name that we want to rename, put it in quotes, and then outside of that quote, do colon whatever we want to rename it. So here, we'll rename it to home. And so, this will do the first column, comma, and enter down to do the second column. And so, we can retype all of these, but I am going to just grab it from off screen and copy and paste it here and that just makes it a little faster. So, you aren't watching me type all of these, but it all follows the same basic formula, original row name, column name, comma, next one. So, we can run that and then again if we look at our column names, we can say survey dot columns and once again see that now all of these are easier to work with and will ultimately help us with, when we're plotting, which we’ll return to in the next video.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Watch It: Video - Creating Bar Plots with One Variable (5:45 minutes)

Below we will create two bar plots. The first will show a categorical variable with counts. The second will show a quantitative variable separated by category.

Click here for a transcript.

Hello, and welcome back to the video series. For this lesson, we're going to go ahead and dive into how we can visualize data and create plots using Plot9 and ggplot in Python. And we're going to demonstrate this through bar charts today. So, here we are in Google Colab. This is the same file that we were working with to import and clean the data. So, we've got our libraries here. We've got our file read in and imported, and then we did some cleaning to ultimately remove and then rename the remaining columns. So, with that, let's go ahead and dive in.

So generally, bar charts are going to be used to display some sort of categorical variable and here we're going to demonstrate how to display counts for that. So, this is in effect similar to a frequency table, but just in graphical form. And so, we're working with ggplot and for most of the ggplot lectures I'm actually going to wrap the whole command within a set of soft parentheses. And this just helps me to enter down in between different parts of the ggplot command, make it a little easier to read. So, with any ggplot figure we start off with the ggplot command and we give it whatever our data frame is called. Here, I call it survey. And then we add in the geome, and for bar charts that geom is known as geom bar. And the final piece that we always need is the aes or aesthetics mapping. And in this case, all we need to do is specify a categorical variable for our x axis.

But we're not quite done. If we go outside of the aes parentheses, we need to set the stat equal to count. So, this tells Python what statistical measure, what calculation you want to show on the y-axis. And then for demonstration purposes we can add a fill color again outside of that aes statement. And so, we can run this, and we can see how the count by major looks. And so, when we took this particular survey, we had the majority of students from the PNGE major with a few from Energy Engineering and EBF. Well, the other majors were very small groups, but this is a primary way to use a bar plot showing a categorical variable with counts on the y-axis.

Occasionally you will want to show a quantitative variable on the y-axis that isn't just the count of how many were in a category. Maybe you want to see how many credit hours were being taken by all the students within different majors. And for that, we can also do a bar plot, but we need to change it just a little bit. And so, we'll start off again with these open and closed parentheses, and then our ggplot command for survey. And then we're still going to do a bar chart, so we still say geom_bar and then we have aes. And so, we have our categorical variable on the x-axis. And now, we left it at here for the above plot, but for this particular plot we are going to add a y-axis with credits. So, this is a quantitative variable that will now show up on our y-axis, but once again we still need to add a stat regardless of what type of bar plot we're doing. And in this plot, we're going to say stat equals identity. And this means you now use the data as you see it. Don't do any calculation on it. Just identify what those values are and plot them.

And then we'll also add a fill. And in this case, I'm going to show you a slightly different way to set colors, and that is through these hex codes. So, these are six-digit long codes that are a mix of numbers and letters, and they all attach to a specific color. And so, if you're ever interested in using a certain color palette, you can find these hex codes fairly easily online. This particular hex code is for a very dark, almost dark blue, that is almost black. But here, we see our plot. We've got our categories, our majors on the x-axis, we've got credits, our numerical variable on the y-axis, and we can see that it has plotted those variables. And it's actually plotted them as like a total. And so, you can imagine that each of the credit hours for an individual student has been stacked on top of each other for a total value based off of the majors.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Try It: Apply Your Coding Skills in Google Colab

Click the Google Colab file used in the video is here [5].
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, plot two bar plots by following the outline below:

Note: You must be logged into your PSU Google Workspace in order to access the file.

from plotnine import * # import the library
 
# fill in the blank (...) to plot a bar chart with 'Major' on the x-axis and the count on the y-axis
ggplot(survey) + ...
 
# fill in the blank (...) to plot a bar chart with 'Major' on the x-axis and 'Credits' on the y-axis
ggplot(survey) + ...

Once you have implemented this code on your own, come back to this page to test your knowledge.

Assess It: Check Your Knowledge

Knowledge Check

Box Plots

Read It: Box Plots

In this lesson, we are going to demonstrate how to create bar plots using ggplot/Plotnine.

Watch It: Video - Box Plots (6:24 minutes)

Click here for a transcript.

Hello. Welcome back. In this video we're going to continue to talk about how we can create visualizations in Python using Plot9 or ggplot. And in particular we're going to focus on box plots in this video. All right, so we are back in the same visualization Colab file. We've had our libraries continuing to use Plot9. We have our data that we went through in the first video to clean up and rename a number of columns, and now we're going to get into box plots. And now once again, we're using ggplot. And I'm going to create these parentheses so I can enter down between options. So, we start off with what we do for every ggplot object, and so we start off with ggplot survey, and then we have some kind of geom. And here our geom is going to be box plot. And then we need to have our aes. And so, the box plots in ggplot are a little bit different than some other box plots, or some other plots, they do require both an X and a Y variable to be put within our aes parentheses. However, you can just state a single value, such as 0 for the x-axis. And then here we need a quantitative variable for the y-axis, which we will use for credits. And then outside of the aes, we can add a fill. And let's just fill this one with pink. So, we can run this, and we can see this box plot.

And so, we've got credits and we can use this, we can see where our median is, where our first and third quartile is, as well as the min and max. But we've got this sort of weird x-axis that isn't necessarily the best to include, because it has nothing to do with the data itself. And so, if we wanted to change that, we would need to work with a function called theme. And within theme, you can use this to do about anything. You can change legends, you can change text, you can change the background. Here, we're going to be using it to remove the x-axis label, the x-axis text, and the tick marks. So, we'll start off with the title. So, we can say axis_title_x. And so, here we can add any number of things. We can change it. But what I'm going to do is say element_blank. This will just say the x-axis title, make it blank.

And then the next we'll do is we'll do the text. So, this is actually these numbers down here, negative 0.4, negative 0.2, 0, and so forth. So, axis_text_x, and again element_blank, and then the last thing we wanted to remove were the tick marks. So, we can say axis_ticks_major_x. So, you can start to see the pattern here. We tell it what bit of what part of the plot we're working with, so the axis, then we tell it what part of the axis that we're working with, here the title, the text, or the major ticks. And then we tell it which axis we're doing. So, when we run this, we can now see that we've taken away all of that that was on the x-axis, makes it look a little cleaner without that additional set of numbers. That comes from the fact that we are forced to provide some x value.

So, this is how you can make a single box plot, but say you wanted to break up this box plot by credits. So, similar to our bar plot from that video, we want to look at how credits are separated by major. And so, here we can once again say ggplot. With the survey data we can say geom_boxplot(aes. And now, because we actually have two variables that we're doing, we can give our categorical variable on the x-axis and our quantitative variable on the Y, and we can still fill with pink. And so, here we can see how each of these different majors varies within their major in terms of credit hours. So, we can see that PG and E had the greatest variation in credit hours being taken, whereas our other and these two majors represent a single variable point, so not much… no variation there. You can see that they had less.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Try It: Apply Your Coding Skills in Google Colab

Click the Google Colab file used in the video is linked here [6].
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, use the partial code below to create box plots using plotnine/ggplot. Alternatively, you can continue to build off of the code used in the bar plots example.

Note: You must be logged into your PSU Google Workspace in order to access the file.

from plotnine import * # import the library
  
# fill in the blank (...) to plot a box plot of 'Credits'
ggplot(survey) + ...
  
# fill in the blank (...) to plot a box plot with 'Major' on the x-axis and 'Credits' on the y-axis
ggplot(survey) + ...

Once you have implemented this code on your own, come back to this page to test your knowledge.

Assess It: Check Your Knowledge

Knowledge Check

Visualization: Multiple Variables

Read It: Visualization with Multiple Variables

So far in this lesson, you have learned about various types of plots that demonstrate one variable. In this lesson, we will extend these plots to include multiple variables.

First, we will return to bar plots. To add a third variable to a bar plot, you can create a fill variable, which can be used to show a categorical variable. Here, we are plotting a categorical variable on the y-axis (Emissions Scenario), a quantitative variable on the x-axis (Change in Emissions Relative to 2005), and a categorical variable as the fill (Source).

bar plot with multiple variables described in paragraph above and below in caption

Bar plot showing three variables. Here, we show a categorical variable (the emissions scenario) on the y-axis, a quantitative variable (change in emissions relative to 2005) on the x-axis, and a categorical variable (source of emissions) as a fill in the bars.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

If we have two quantitative variables, we can plot a scatter plot. Below, we show a scatter plot of the maximum gas and the total gas extracted from wells in the Marcellus Shale.

scatter plot with 2 variables as described in paragraph above and below in caption

A scatter plot of two quantitative variables. Here, we show the maximum gas extracted from a Marcellus Shale well on the x-axis and the total gas extracted from a Marcellus Shale well on the y-axis.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

When we have two quantitative variables, we can also begin to discuss the correlation or association between the two variables. For example, based on the plot above, we can say that the maximum and total gas extracted from a well are positively associated. However, if we want to quantitatively measure that association, we need to understand the correlation.

In particular, when working with correlation, you need to use distinct notation, as shown below. In particular, we use the Greek letter "rho" (ρ) to denote the population correlation and the Latin letter "r" to denote the sample correlation.

diagram depicting population as rho; sample as r;

Recall the population and sample illustration from before. Here, we show the notation for correlation, with "rho" representing the population correlation and "r" representing the sample correlation.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

In terms of correlation, the values typically range from -1 to 1, with values at the edge of the range representing strong correlation (positive or negative, depending on the sign) while values closer to 0 represent weak correlation. The figure below shows possible correlation values. It is important to note, however, that correlation is only applicable to linear relationships! Non-linear relationships can have associations, but they will always have a correlation of 0, as shown below.

visualization of values for correlation coefficients

Demonstration of different values for correlation coefficients.

Credit: This file is made available under the Creative Commons CC0 1.0 Universal Public Domain Dedication [7].

A good example of this is the line plot below, which shows the average hourly kWh produced by Dr. Morgan's solar panels in 2021. There is clearly an association between the time of day and the kWh production, but it is not linear, so the correlation coefficient will be 0.

Enter image and alt text here. No sizes!

A line plot of a quantitative variable. Here, we show the hour of the day on the x-axis and the average produced kWh of solar panels during that hour over the course of the year on the y-axis.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

A final important note when determining correlation is that correlation does NOT equal causation. For example, the plot below shows the number of people that have died in swimming pools plotted against the electricity generated by nuclear power plants in the US. The correlation coefficient is 0.90, which is very strongly positive! This correlation, however, doesn't mean that the increased use of nuclear power has caused the increased swimming pool deaths, simply that they have both increased at roughly the same rate over the years.

Scatter plot that shows erroneously that the number of pool drownings is directly related to nuclear power generation

Scatter plot of two quantitative variables with a best fit line and 95% confidence interval band. Here, we show the power generated by nuclear plants in the US on the x-axis and the number of people that drowned in swimming pools in the US on the y-axis.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Bar Plots: Multiple Variables

Read It: Bar Plots with Multiple Variables

In this lesson, we are going to demonstrate how to create bar plots using ggplot/Plotnine. First, however, we will need to prepare the dataset for plotting. This will involve importing the data and melting the data to get it into "long" form. In other words, the melt command will convert the column names of select variables into their own categorical variable. For example, the code below will convert the data as shown in the figure.

1	`df.melt(id_vars='Country', value_vars=['1999',` `'2000'], var_name='Year', value_name='Cases')`

Demonstration of how the "melt" command works in Python.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Below, we will demonstrate this command with the Emissions Scenarios dataset.

Watch It: Video - Importing and Cleaning the Data for Visualization (5:14 minutes)

Click here for a transcript.

Hello. In this set of videos, we're going to be talking about how we can add second, third, and sometimes even four variables into our plots. But before we get into the visualization, I'm also going to take a video to show you how we can use melt, the function, to transform our data into a more plottable format.

So, here we are in a new Google Colab file for this part of the lesson. And in particular, we've got the libraries that we're using that we've used before, but we are now adding in num pi, which we have nicknamed as np. So, let's get into the data. So, I've already mounted my Google Drive using option one from earlier. And I'm going to go ahead and read it in. And the data that we are working with is from mission scenarios. So, this data, actually going to cut that, cut that short to make it a little bit easier to type later on. And essentially what these emission scenarios are, they're published by the US, and they're saying what percent change that we might have in emissions based off of different scenarios. So, in 2005 we assumed 100 percent fossil fuel emissions. And then, when we do reference in 2035, those emissions drop because we're actually having negative emissions from zero carbon and solar, and we've got differences here based off of these different scenarios.

And now we could go ahead and plot this to see a variety of different plots, but it's very difficult to work with data sets that are, what we call, wide data sets. Meaning, that there's many columns that fall into different row-based categories, instead of being long and having a short amount of column-based categories. And so, in order to get this data set into what we call long form, we need to use the melt command. And so, I'm going to create a new data frame called emsc_melt. And I'm just going to type our original data frame dot melt.

And there are several things that go into melting the data set. The first is that we need to tell it which variables we want to maintain as they are now. And in this case, we want to maintain category. So, this is the column name. Up here, we want these variables to be the same in our new melted data set. What we want, though, is all of these numbers to be in a single column. And we want these row or column names to become their own categorical variable. And so, we need to provide what we will name our new variable, and in this case, this is whatever your column names currently are. So, we'll just name it source because that's the source of electricity up here. And then we also need to provide a value name, and this is whatever your actual numbers are. And so, we will just call that, emissions. And so, then if we print this, we can see that this data is now what we call long form. All the numbers are in a single column, and we've got these categories. Instead of being individual columns, they're now within a single column to specify. And this really helps us when we go to plot, which we will do in the upcoming videos.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Below, we will use the melted dataframe to create a bar plot with three variables.

Watch It: Video - Bar Plots with Multiple Variables (4:54 minutes)

Click here for a transcript.

Welcome back. So, we're going to talk about how we can add a third variable to our bar plots today. So, in the previous bar plot video, we showed how to get apply a categorical variable with counts on the y-axis, and then a categorical variable with a quantitative variable on the y-axis. But if you wanted to add a third variable, you could do that through adding fills, which is what we will demonstrate here using the Plot9 or ggplot plotting tool.

So, we are here in the Google collab file that we started in the previous video. We've got our libraries. We've got our data that has already been read in. And at the end of the previous video, we melted the data set. And this will become very important when we go to use ggplot to make this bar plot. And so, we can go ahead and just get started with our ggplot. So, similar to the previous videos, we start off with listing our data frame. And then we're still doing a bar plot, so we say geom_bar. And we still need an aes. And so, in the past we specified an x-axis, we'll say category. We specified a y-axis, emissions. And then we did stat equals identity to say we want the actual data; we don't want to do any calculation on it. And then we added the fill outside of the aes. And that's how we got to variable plot.

However, if you want to add a third variable, you need to do it within the aes statement. And so, here we say fill equals source. So, if we run this, we can see that now we've got emissions, we've got category, and we've got source. And it automatically creates this legend for us, so we don't need to worry about that. And we can see how it works. Now, this plot isn't the most descriptive, nor is it the most readable. And so, to show you a few other things that you can do within ggplot, we can first flip the coordinates with chord underscore flip. So, if we run that, we can see that now. Because it's on the y-axis we can actually read what those categories are, which makes it a lot nicer.

And then we can actually add some labels. So, we can say xlab equals emissions scenario. And note that I said x label. It needs to follow whatever you mapped your x value to in the aes statement. So, in this case, when I run this, the x label is actually on the Y, because I flipped those coordinates. But the program is able to track where you, which value you coded as the x value and attach the actual, know that that's what value you mean when you do the label instead of assuming that it goes on the x-axis automatically. So, instead of tracking a specific axis at this point, we're tracking a variable. And the program knows where to put that label based off of the variable. And so, this is an example of how you can add a third fill based variable to your bar plots.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0 [1]

Try It: Apply Your Coding Skills in Google Colab

Click the Google Colab file used in the video is linked here [8].
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, use the partial code below to create a multiple variable bar plot using plotnine/ggplot.

Note: You must be logged into your PSU Google Workspace in order to access the file.

from plotnine import * # import the library
  
# fill in the blank (...) to plot a bar plot with 'Category' on the y-axis, 'Emissions' on the x-axis, and 'Source' as the fill variable
ggplot(emsc_melt) + ...

Once you have implemented this code on your own, come back to this page to test your knowledge.

Assess It: Check Your Knowledge

Knowledge Check

Scatter Plots

Read It: Scatter Plots

Scatter plots can be used to plot two (or more) quantitative variables. Below, we will first demonstrate how to create a basic scatter plot. Then, we will show how to add a third categorical or quantitative variable to that plot.

Watch It: Video - Basic Scatter Plots (4:40 minutes)

Click here for a transcript.

In this video, we're going to talk about scatter plots. So, scatter plots are used between two quantitative variables to show how the two variables change together. And as we've used in the previous videos, we will also be using Plot9 or ggplot plotting tool. So, here we are in Google Colab. We're actually going to use a new data set for this particular video. In particular, we're using the Marcellus Shale data set, which I've already read in using the first method in our import lesson. And so, to give you an idea of what this data frame looks like, we can look at the first five rows. And so, here we've got all of these different columns they've got a lot of metadata on different wells within the Marcellus Shale formation. But the things that we are going to be interested in are these two columns here, total gas and max gas. These are two quantitative variables that we will use to create scatter plots.

And so, because the data is already ready to go, there isn't much that we need to do in terms of cleaning. And so, I'm going to jump right in to the plots themselves. So, we're going to start off with a basic scatter plot. And again, this is quantitative first quantitative. And again we're using ggplot, and we give it our data frame, which is the marc, what we call marc for Marcellus. And here we have our another geom for ggplot, but we're going to give it geom underscore point. And so, this is how you specify a scatter plot, is by having points.

And similar to previous plots, we need to have an aes statement. So, we can say x is just max gas and y is total gas. And so, for a scatter plot you do need to specify both x and y, and they both need to be quantitative. But we can see here from this plot we could make some assumptions to show whether total gas increases to max gas. We can see that maybe there's some outliers over here, but it's really just a basic scatter plot. And so, what we can do to make this a little bit more descriptive is we can actually add a best fit line and a confidence interval.

And so, I'm going to just copy this from above because we're going to be building off of this existing plot. And right here I'm going to add a second plot and this plot is going to be stat_smooth. And stat_smooth works very similarly to a geom, so we still need to say max gas and total gas, and then outside of the aes we need to specify the method, so this is how it's going to fit, the best fit line. And in this case, lm is for linear regression. And if we wanted to we can give it a color so that it looks differently than our dots. And so, if we run this, we can see here that we now have this best fit line. And if you zoom in, you can see there's a slight confidence interval there that shows what the 95 percent confidence interval for this data set is.

Watch It: Video - Scatter Plots with Three Variables (6:04 minutes)

Click here for a transcript.

Similar to what we did with the bar plot, sometimes we want to add a third variable to our scatter plot. So I'm going to copy this down. And so, if that categorical variable, if that third variable is a category, we need to come into our AES, say color equals, and give it whatever variable. So, in this case, it's the type of drill used to create the well. And you'll notice that we're using color, whereas before we used to fill. And a good rule of thumb is that if it's a point or a line, you want to use color, but if it's a polygon, you want to use fill. So we'll use fill with bar charts or histograms, but we'll use color with line plots or scatter plots.

And so, if we run this because it's a categorical variable, we see this pre-populated legend that tells us what color each drill type is associated with. And you'll notice that we can't really see all of them, and that's because the plotting tool overlays dots on top of each other. So, we're actually hiding some of the dots with the H drill type and so if we wanted to make that a little bit easier to see we can use this argument called Alpha which sets the opacity of the dots.

And so, we can start to see a little bit more of those purple dots here, and we can see how we've sort of made these dots more see-through and that's why we see some of this purple coming through. But again, it's not necessarily the most helpful way to see the differences between drill types. And so, another way to show these differences is to add multiple panels, one for each drill type. So again, I've just copied and pasted our previous plot down, and we're just going to keep adding to it. So, if we do a plus sign and then another plot, in this case our plot is called facet wrap. And so, the facet wrap allows us to specify a variable that will create panels. So, within quotes, we do tilde variable name. And then, we add to this scales equals free, which tells it to fit itself instead of trying to specify how we scale each one.

And so, if we run this, we can now see that it's created a single plot for each of our drill types and still maintain a color, which helps. And we can even see how different the 95 percent confidence intervals are for our different data. The downside is that our labels are running into each other, and they're not necessarily matching. If you're interested, you can look into the documentation and figure out how to fix these, but as a demonstration, this is how you can show these different panels. So, for our final plot, we're going to come back up here and take our original basic scatter plot.

And so, until now, our third variable that we've been using has been a qualitative variable. But if you wanted to add a quantitative variable, the process is still the same. You still specify color within the AES, and then you give it whatever variable. Here, I'm going to use this perf interval length, which is a quantitative variable that they calculate in hydraulic fracturing.

And so, we can color it based off of that and go ahead and run through it. And you'll notice that the main difference here is that we now get a continuous legend, instead of individual categories. And so, we can see where you know the majority of these low lengths variables are and where some of the high ones are. You will notice there's some gray variables in here. This means that these particular data points, in here, they had values for max gas and total gas, but they had n/a values for the perf interval length. And so, it still plots those, it just doesn't give them a color. So you can still see all of the data, just colored for where they have it. And so, this is how you can add a third variable to your Scatter Plots, both continuous and as we showed earlier as individual categories.

Try It: Apply Your Coding Skills in Google Colab

Click the Google Colab file used in the video is linked here [9].
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, use the partial code below to create scatter plots using plotnine/ggplot.

Note: You must be logged into your PSU Google Workspace in order to access the file.

from plotnine import * # import the library
  
# fill in the blank (...) to plot a scatter plot with 'MaxGas' on the x-axis and 'TotalGas' on the y-axis
ggplot(marc) + ...
  
# fill in the blank (...) to add a best fit line to your scatter plot from above
ggplot(marc) + ... + stat_smooth(...)
 
# fill in the blank (...) to add a third variable, 'PerfIntervalLength', to your plot
ggplot(marc) + ... + stat_smooth(...)

Once you have implemented this code on your own, come back to this page to test your knowledge.

Assess It: Check Your Knowledge

Knowledge Check

Working with Times Series

Read It: Working with Time Series

Working with time series requires us to work with DateTime objects, which is a special data type in python. Below, we will present three videos that walk you through various ways to plot time series, as well as how to create and work with DateTime objects.

Watch It: Video - Basic Line Plots (2:56 minutes)

Click here for a transcript.

Watch It: Video - Creating and Working with DateTime Objects (7:01 minutes)

Click here for a transcript.

Where we wanted to plot, maybe we just want to focus on September 8th. And so, before we can get into that, we need to create a new column. So, if we create new columns in data frames, we just need to specify it within quotes and square brackets, and in particular, we're going to use the PD to date time, and you can see it's telling me to click it. And what this does is, it takes hours, or it takes dates, and it takes time and combines them into a single date-time object. And so, in order to do that, we need to provide it what our date is. And then we tell it quote space quote, and then we tell it what our time is. And that space just allows the date-time object to not have the time running against the date, but to space it out.

And so, if we look at the top five rows, we can now see that we've got this date time object. Which is a combination of the date and the time, and additionally if we look at the data types of our new data set, we can now see the date time is this special type of data called a date-time object. Which means that we can use it in Boolean or conditional statements to find dates that are greater than our date of Interest or less than.

And so, to go ahead and get that data. We can create a new variable called solar subset in which we want the solar date time that is greater than 2021 September 08. And we want similar data in which the date time is less than the ninth. And so what this will do is, it will extract the data that is above the September 8th at midnight but before September 9th at midnight, so we'll effectively get all 24 hours of September 8th. And then, we don't really need all of the subset. So I'm going to extract a few columns. So, we want we still want all the rules, but we really only want, Well, to show you, we can really just extract date time and produced kilowatt hour. I forgot this print statement here. And so, there is how we could extract just the date time and just the produced kilowatt hour. But it's not necessary because we can always just call out specific values within our gg plot.

So, this is what our new data set looks like, and we can see that now it's just September 8th. So then we can create our line plot. So we can say, gg plot with the subset of data because we're just plotting the eighth at this point, and we can say geom line AES x equals date time as before and Y equals produced kilowatt hour. And so, here we can see that this is looking a lot better. We can see that this is the early hours of September 8th. This is the late hours. We can see that it spiked sometime in the middle of the day, as solar power is likely to do. But we can improve how this looks. I'm just going to copy it, paste it down here, and we can use the theme command as we did with the box plots early on to actually change this axis.

So, we can say axis text X and now before we said element blank because we were removing it, but in this case we're changing it. So, we say element underscore text, and we tell it what we're changing, and so we can just change the angle to 90. And so, now we can see that this is a little bit more readable. And we can see that this spike happens it hasn't attached the time to it, so it's not terribly useful because it's cut off that little bit, but you know we could assume that if the time was here it would be even an even better representation.

Watch It: Video - Using Statistics Within a Time Series Plot (5:29 minutes)

Click here for a transcript.

So, this is the actual production over a single day. But something that we'll often want to do is to plot averages or some other statistic. And so in this last bit we will go over how to plot the average hourly production. And to do that I'm going to create another new variable called solar hour which is equal to the date time, But just the hour value. And this is something that you can do with, one of the benefits of date time objects is that you can use a special set of commands that extract bits of time or dates for you without having to do some sort of complicated read the string and figure out what that first number is. so now we can see the hour is over here.

And so we can go ahead and plot this. Say solar again do a geom line AES and here say x equals hour and Y is still produced kilowatt hours. And so this is showing the, essentially, the total kilowatt hours produced over each hour in a day across the entire year. So we haven't gotten to the average yet, which is our goal of doing that. But to do that, we need to use a special plot plotting tool within gg plot. So we still tell it which data to use but now instead of saying geom we stay stat summary and then it still sort of works. And we still say X we still have an AES statement in which we have our X and Y that we've been using. But outside of that AES statement, we need to specify the geom. And so we've sort of just done things a little bit backwards, so if you remember with the bar plot we specified geom bar and then stat inside that. but in order to work with the line plot, we need to specify stat outside and line inside. And then we also need to give it the function that we're applying to the y-axis which we are going to use np.mean.

And so, now we can see that this looks a lot more like we would expect. We can see the average kilowatt hour produced. It's less than three across each hour of the day, given all of the data in 2021. So we can see that there was a spike just after six, but that by and large on average three o'clock tends to be the best time to generate solar power here in central Pennsylvania. If we wanted to show a confidence interval on this, I'm just going to copy this and paste it down here. We can add a second stat summary, maintain the same AES values,but our geom then can be ribbon and our function isn't just on the y-axis, now it's on the data. And it's mean confidence level bootstrapping, which we will get into in the next lesson what that actually means. But for now this is the way that we can get a 95 confidence interval, and I'm also going to specify Alpha so that we can see the line through the ribbon. And so, now we can see how this ribbon will follow the data. And so, this is a way that you can work with time series and create line plots that have different functionality, whether that's showing the actual data or the average or the average plus a confidence interval.

Try It: Apply Your Coding Skills in Google Colab

Click the Google Colab file used in the video is here [10].
Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
Once you have it saved in your Drive, use the partial code below to create scatter plots using plotnine/ggplot.

Note: You must be logged into your PSU Google Workspace in order to access the file.

from plotnine import * # import the library
  
# fill in the blank (...) to plot a line plot with 'Date' on the x-axis and 'Produced (kWh)' on the y-axis
ggplot(solar) + ...
 
# fill in the blank (...) to subset all times > '2021-09-08' and < '09-09-2021'
solar_subset = solar[(solar['DateTime'] > ...) ... (solar['DateTime'] < ...)]
 
# fill in the blank (...) to plot a line plot with 'DateTime' on the x-axis and 'Produced (kWh)' on the y-axis
ggplot(solar_subset) + ... 
 
# fill in the blank (...) to plot the average hourly 'Produced (kWh)' with a 95% confidence interval ribbon
(ggplot(solar) +
 stat_summary(..., geom = ..., fun_y = np.mean) +
 stat_summary(..., geom = ..., fun_data = 'mean_cl_boot', alpha = 0.5)
 )

Once you have implemented this code on your own, come back to this page to test your knowledge.

Assess It: Check Your Knowledge

Knowledge Check

Summary and Final Tasks

Summary

Through this lesson, you are now able to create visualizations in Python and differentiate between different types and purposes of various graphs. You plotted bar charts, box plots, scatter plots, and time series data. You will continue to grow and use these visualization skills throughout the course. Additionally, you learned how to work with time series data, which we will come back to later in the course.

Assess It: Check Your Knowledge Quiz

Reminder - Complete all of the Lesson 3 tasks!

You have reached the end of Lesson 3! Double-check the to-do list on the Lesson 3 Overview page to make sure you have completed all of the activities listed there before you begin Lesson 4.

Lesson 3: Data visualization

Overview

Overview

Learning Outcomes

Lesson Roadmap

Questions?

Visualization: One Variable

Visualization: One Variable

Read It: Visualizations with One Variable

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Introduction to Plotnine

Introduction to Plotnine

Read It: Introduction to Plotnine

Assess It

Knowledge Check

Bar Plots: One Variable

Bar Plots: One Variable

Read It: Bar Plots

Watch It: Video - Importing and Cleaning the Data (9:03 minutes)

Watch It: Video - Creating Bar Plots with One Variable (5:45 minutes)

Try It: Apply Your Coding Skills in Google Colab

Assess It: Check Your Knowledge

Knowledge Check

Box Plots

Box Plots

Read It: Box Plots

Watch It: Video - Box Plots (6:24 minutes)

Try It: Apply Your Coding Skills in Google Colab

Assess It: Check Your Knowledge

Knowledge Check

Visualization: Multiple Variables

Visualization: Multiple Variables

Read It: Visualization with Multiple Variables

Assess It: Check Your Knowledge

Knowledge Check

FAQ

Bar Plots: Multiple Variables

Bar Plots: Multiple Variables

Read It: Bar Plots with Multiple Variables

Watch It: Video - Importing and Cleaning the Data for Visualization (5:14 minutes)

Watch It: Video - Bar Plots with Multiple Variables (4:54 minutes)

Try It: Apply Your Coding Skills in Google Colab

Assess It: Check Your Knowledge

Knowledge Check

Scatter Plots

Scatter Plots

Read It: Scatter Plots

Watch It: Video - Basic Scatter Plots (4:40 minutes)

Watch It: Video - Scatter Plots with Three Variables (6:04 minutes)

Try It: Apply Your Coding Skills in Google Colab

Assess It: Check Your Knowledge

Knowledge Check

Working with Times Series

Working with Times Series

Read It: Working with Time Series

Watch It: Video - Basic Line Plots (2:56 minutes)

Watch It: Video - Creating and Working with DateTime Objects (7:01 minutes)

Watch It: Video - Using Statistics Within a Time Series Plot (5:29 minutes)

Try It: Apply Your Coding Skills in Google Colab

Assess It: Check Your Knowledge

Knowledge Check

Summary and Final Tasks

Summary

Assess It: Check Your Knowledge Quiz

Reminder - Complete all of the Lesson 3 tasks!

Lesson 3 FAQ

FAQ