data:image/s3,"s3://crabby-images/2e96f/2e96f39e4539d7b7ee391315f938ae106243d2a5" alt="Print Print"
Bar Plots: One Variable
Read It: Bar Plots
In this lesson, we are going to demonstrate how to create bar plots using ggplot/Plotnine. First, however, we will need to prepare the dataset for plotting. This will involve importing the data, removing columns, and renaming columns.
Watch It: Video - Importing and Cleaning the Data (9:03 minutes)
Hello. In this lesson we are going to talk about visualization and how we can use Python to create graphs to aid our statistical inference and data analysis. Before we get into the actual visualization, however, we're going to import and clean some of the data that we will be using later on in the lesson. So, with that, we can go ahead and get started.
All right, so we've got our data set, or our code here. We've got three libraries that we will be using within this demonstration. Throughout the lesson we'll use our Google collab library to import the data. We'll use Pandas to read in a CSV file, and we'll use this Plot9 to actually do the plotting. And I want to point out that we are using this asterisk to tell Python to import all the commands from Plot9 without the need for a nickname. So, this is one of the few libraries that we'll use throughout the course in which you don't need to give it a nickname, you can simply import all the commands as is.
So, with that let's go ahead and get started. I'm using the first option from lesson one to import our data, the mount Google Drive option, and I've already got it set up here. I also commented out my file path for easy use. And so, I will go ahead and read this data in. We'll just call it survey, and we'll say pd dot read csv and just give it the file path. I've looked into this file before, and we know that there is no extra header or anything, so no need to skip rows. But to give you an idea about what this looks like, this is a survey that was collected from students that took this course in University Park, during the Spring of 2023. And we can see that they have answered all these questions. But we can also see that there's a number of unnecessary columns, some metadata, all of this that we want to remove in order to make it usable in the future. And so, the first thing I'm going to do is print the names of the columns to the screen. And this is a good practice to get into because it very quickly allows us to see what columns there are and possibly which columns then need to be removed.
And so, we can see that there's all of these sort of metadata about which section the students were in. This was a canvas-based survey, so it's all of that stuff. There are also these numbers here that aren't attached to any data at all that we need to remove. And so, we'll start in to step three of our data cleaning, which is to remove some of these unnecessary columns.
And so, the first thing that I'm going to do, I'm going to override my existing data frame and I'm going to say “survey”, which is what I call my data, dot drop, and this will drop any columns that we tell it to. So, I'm going to. But we need to tell it which columns and we give them those columns in square brackets. And so, for this command I'm going to come back up here. I'm going to copy these top command, top columns here, as well as these bottom ones here. And you'll notice that I have not removed any of the number columns and that's because there's so many of them that I'm actually going to do a secondary removal that sort of speeds up the process. So, I'll go ahead and run this command. So, we've now dropped these columns. And so, if we come in here and say, survey dot columns, and run that… and there is an error because I've already dropped those columns. So, when I go to rerun them, it doesn't like it. So, this is something that you might come across in your own homework in your own coding assignments, is when you try to rerun something that you already did, sometimes it doesn't find what you're trying to do. Maybe you've removed a row that no longer exists and now you're trying to call it. In this case I've removed columns that don't exist so it can't drop them. But if we go through here and just type it in another code block, we can see that the columns are very similar to up here, but we no longer have these section base columns, or this incorrectly score base columns.
Okay. To remove these, you know random numbers 1.0 1.0.1, and to do that we're going to do a similar process but we're going to tell Python to do all of the one, all of the columns that start with one, simultaneously. So, to do that, we say survey dot loc, give a column to say all of the rows, and then we do tilde survey dot columns dot str dot starts with one. And this will actually drop every column that has a string that starts with one. So, if we run this, and then our printing -- oops typo there – and we print the columns at the end, we can now see that we've got just the questions, which is exactly what we want.
However, it's not exactly where we would like it to be. We need to rename these columns. And so, to do that I'm going to once again override my original data and say survey dot read name and then we say columns and then we open curly brackets. And now the basic way that we do this is that we take, come up here and take the first column name that we want to rename, put it in quotes, and then outside of that quote, do colon whatever we want to rename it. So here, we'll rename it to home. And so, this will do the first column, comma, and enter down to do the second column. And so, we can retype all of these, but I am going to just grab it from off screen and copy and paste it here and that just makes it a little faster. So, you aren't watching me type all of these, but it all follows the same basic formula, original row name, column name, comma, next one. So, we can run that and then again if we look at our column names, we can say survey dot columns and once again see that now all of these are easier to work with and will ultimately help us with, when we're plotting, which we’ll return to in the next video.
Watch It: Video - Creating Bar Plots with One Variable (5:45 minutes)
Below we will create two bar plots. The first will show a categorical variable with counts. The second will show a quantitative variable separated by category.
Hello, and welcome back to the video series. For this lesson, we're going to go ahead and dive into how we can visualize data and create plots using Plot9 and ggplot in Python. And we're going to demonstrate this through bar charts today. So, here we are in Google Colab. This is the same file that we were working with to import and clean the data. So, we've got our libraries here. We've got our file read in and imported, and then we did some cleaning to ultimately remove and then rename the remaining columns. So, with that, let's go ahead and dive in.
So generally, bar charts are going to be used to display some sort of categorical variable and here we're going to demonstrate how to display counts for that. So, this is in effect similar to a frequency table, but just in graphical form. And so, we're working with ggplot and for most of the ggplot lectures I'm actually going to wrap the whole command within a set of soft parentheses. And this just helps me to enter down in between different parts of the ggplot command, make it a little easier to read. So, with any ggplot figure we start off with the ggplot command and we give it whatever our data frame is called. Here, I call it survey. And then we add in the geome, and for bar charts that geom is known as geom bar. And the final piece that we always need is the aes or aesthetics mapping. And in this case, all we need to do is specify a categorical variable for our x axis.
But we're not quite done. If we go outside of the aes parentheses, we need to set the stat equal to count. So, this tells Python what statistical measure, what calculation you want to show on the y-axis. And then for demonstration purposes we can add a fill color again outside of that aes statement. And so, we can run this, and we can see how the count by major looks. And so, when we took this particular survey, we had the majority of students from the PNGE major with a few from Energy Engineering and EBF. Well, the other majors were very small groups, but this is a primary way to use a bar plot showing a categorical variable with counts on the y-axis.
Occasionally you will want to show a quantitative variable on the y-axis that isn't just the count of how many were in a category. Maybe you want to see how many credit hours were being taken by all the students within different majors. And for that, we can also do a bar plot, but we need to change it just a little bit. And so, we'll start off again with these open and closed parentheses, and then our ggplot command for survey. And then we're still going to do a bar chart, so we still say geom_bar and then we have aes. And so, we have our categorical variable on the x-axis. And now, we left it at here for the above plot, but for this particular plot we are going to add a y-axis with credits. So, this is a quantitative variable that will now show up on our y-axis, but once again we still need to add a stat regardless of what type of bar plot we're doing. And in this plot, we're going to say stat equals identity. And this means you now use the data as you see it. Don't do any calculation on it. Just identify what those values are and plot them.
And then we'll also add a fill. And in this case, I'm going to show you a slightly different way to set colors, and that is through these hex codes. So, these are six-digit long codes that are a mix of numbers and letters, and they all attach to a specific color. And so, if you're ever interested in using a certain color palette, you can find these hex codes fairly easily online. This particular hex code is for a very dark, almost dark blue, that is almost black. But here, we see our plot. We've got our categories, our majors on the x-axis, we've got credits, our numerical variable on the y-axis, and we can see that it has plotted those variables. And it's actually plotted them as like a total. And so, you can imagine that each of the credit hours for an individual student has been stacked on top of each other for a total value based off of the majors.
Try It: Apply Your Coding Skills in Google Colab
- Click the Google Colab file used in the video is here.
- Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
- Once you have it saved in your Drive, plot two bar plots by following the outline below:
Note: You must be logged into your PSU Google Workspace in order to access the file.
from plotnine import * # import the library # fill in the blank (...) to plot a bar chart with 'Major' on the x-axis and the count on the y-axis ggplot(survey) + ... # fill in the blank (...) to plot a bar chart with 'Major' on the x-axis and 'Credits' on the y-axis ggplot(survey) + ...
Once you have implemented this code on your own, come back to this page to test your knowledge.