EME 210
Data Analytics for Energy Systems

Scatter Plots

PrintPrint

Scatter Plots

 Read it logo Read It: Scatter Plots

Scatter plots can be used to plot two (or more) quantitative variables. Below, we will first demonstrate how to create a basic scatter plot. Then, we will show how to add a third categorical or quantitative variable to that plot.

 Watch It: Video - Basic Scatter Plots (4:40 minutes)

Click here for a transcript.

In this video, we're going to talk about scatter plots. So, scatter plots are used between two quantitative variables to show how the two variables change together. And as we've used in the previous videos, we will also be using Plot9 or ggplot plotting tool. So, here we are in Google Colab. We're actually going to use a new data set for this particular video. In particular, we're using the Marcellus Shale data set, which I've already read in using the first method in our import lesson. And so, to give you an idea of what this data frame looks like, we can look at the first five rows. And so, here we've got all of these different columns they've got a lot of metadata on different wells within the Marcellus Shale formation. But the things that we are going to be interested in are these two columns here, total gas and max gas. These are two quantitative variables that we will use to create scatter plots.

And so, because the data is already ready to go, there isn't much that we need to do in terms of cleaning. And so, I'm going to jump right in to the plots themselves. So, we're going to start off with a basic scatter plot. And again, this is quantitative first quantitative. And again we're using ggplot, and we give it our data frame, which is the marc, what we call marc for Marcellus. And here we have our another geom for ggplot, but we're going to give it geom underscore point. And so, this is how you specify a scatter plot, is by having points.

And similar to previous plots, we need to have an aes statement. So, we can say x is just max gas and y is total gas. And so, for a scatter plot you do need to specify both x and y, and they both need to be quantitative. But we can see here from this plot we could make some assumptions to show whether total gas increases to max gas. We can see that maybe there's some outliers over here, but it's really just a basic scatter plot. And so, what we can do to make this a little bit more descriptive is we can actually add a best fit line and a confidence interval.

And so, I'm going to just copy this from above because we're going to be building off of this existing plot. And right here I'm going to add a second plot and this plot is going to be stat_smooth. And stat_smooth works very similarly to a geom, so we still need to say max gas and total gas, and then outside of the aes we need to specify the method, so this is how it's going to fit, the best fit line. And in this case, lm is for linear regression. And if we wanted to we can give it a color so that it looks differently than our dots. And so, if we run this, we can see here that we now have this best fit line. And if you zoom in, you can see there's a slight confidence interval there that shows what the 95 percent confidence interval for this data set is.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

  Watch It: Video - Scatter Plots with Three Variables (6:04 minutes)

Click here for a transcript.

Similar to what we did with the bar plot, sometimes we want to add a third variable to our scatter plot. So I'm going to copy this down. And so, if that categorical variable, if that third variable is a category, we need to come into our AES, say color equals, and give it whatever variable. So, in this case, it's the type of drill used to create the well. And you'll notice that we're using color, whereas before we used to fill. And a good rule of thumb is that if it's a point or a line, you want to use color, but if it's a polygon, you want to use fill. So we'll use fill with bar charts or histograms, but we'll use color with line plots or scatter plots.

And so, if we run this because it's a categorical variable, we see this pre-populated legend that tells us what color each drill type is associated with. And you'll notice that we can't really see all of them, and that's because the plotting tool overlays dots on top of each other. So, we're actually hiding some of the dots with the H drill type and so if we wanted to make that a little bit easier to see we can use this argument called Alpha which sets the opacity of the dots.

And so, we can start to see a little bit more of those purple dots here, and we can see how we've sort of made these dots more see-through and that's why we see some of this purple coming through. But again, it's not necessarily the most helpful way to see the differences between drill types. And so, another way to show these differences is to add multiple panels, one for each drill type. So again, I've just copied and pasted our previous plot down, and we're just going to keep adding to it. So, if we do a plus sign and then another plot, in this case our plot is called facet wrap. And so, the facet wrap allows us to specify a variable that will create panels. So, within quotes, we do tilde variable name. And then, we add to this scales equals free, which tells it to fit itself instead of trying to specify how we scale each one.

And so, if we run this, we can now see that it's created a single plot for each of our drill types and still maintain a color, which helps. And we can even see how different the 95 percent confidence intervals are for our different data. The downside is that our labels are running into each other, and they're not necessarily matching. If you're interested, you can look into the documentation and figure out how to fix these, but as a demonstration, this is how you can show these different panels. So, for our final plot, we're going to come back up here and take our original basic scatter plot.

And so, until now, our third variable that we've been using has been a qualitative variable. But if you wanted to add a quantitative variable, the process is still the same. You still specify color within the AES, and then you give it whatever variable. Here, I'm going to use this perf interval length, which is a quantitative variable that they calculate in hydraulic fracturing.

And so, we can color it based off of that and go ahead and run through it. And you'll notice that the main difference here is that we now get a continuous legend, instead of individual categories. And so, we can see where you know the majority of these low lengths variables are and where some of the high ones are. You will notice there's some gray variables in here. This means that these particular data points, in here, they had values for max gas and total gas, but they had n/a values for the perf interval length. And so, it still plots those, it just doesn't give them a color. So you can still see all of the data, just colored for where they have it. And so, this is how you can add a third variable to your Scatter Plots, both continuous and as we showed earlier as individual categories.

Credit: © Penn State is licensed under CC BY-NC-SA 4.0

 Try It: Apply Your Coding Skills in Google Colab

  1. Click the Google Colab file used in the video is linked here.
  2. Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
  3. Once you have it saved in your Drive, use the partial code below to create scatter plots using plotnine/ggplot. 

Note: You must be logged into your PSU Google Workspace in order to access the file. 

from plotnine import * # import the library
 
# fill in the blank (...) to plot a scatter plot with 'MaxGas' on the x-axis and 'TotalGas' on the y-axis
ggplot(marc) + ...
 
# fill in the blank (...) to add a best fit line to your scatter plot from above
ggplot(marc) + ... + stat_smooth(...)

# fill in the blank (...) to add a third variable, 'PerfIntervalLength', to your plot
ggplot(marc) + ... + stat_smooth(...)

Once you have implemented this code on your own, come back to this page to test your knowledge. 


 Assess It: Check Your Knowledge

Knowledge Check