Click here for a transcript.
Hello. In this lesson we are going to talk about visualization and how we can use Python to create graphs to aid our statistical inference and data analysis. Before we get into the actual visualization, however, we're going to import and clean some of the data that we will be using later on in the lesson. So, with that, we can go ahead and get started.
All right, so we've got our data set, or our code here. We've got three libraries that we will be using within this demonstration. Throughout the lesson we'll use our Google collab library to import the data. We'll use Pandas to read in a CSV file, and we'll use this Plot9 to actually do the plotting. And I want to point out that we are using this asterisk to tell Python to import all the commands from Plot9 without the need for a nickname. So, this is one of the few libraries that we'll use throughout the course in which you don't need to give it a nickname, you can simply import all the commands as is.
So, with that let's go ahead and get started. I'm using the first option from lesson one to import our data, the mount Google Drive option, and I've already got it set up here. I also commented out my file path for easy use. And so, I will go ahead and read this data in. We'll just call it survey, and we'll say pd dot read csv and just give it the file path. I've looked into this file before, and we know that there is no extra header or anything, so no need to skip rows. But to give you an idea about what this looks like, this is a survey that was collected from students that took this course in University Park, during the Spring of 2023. And we can see that they have answered all these questions. But we can also see that there's a number of unnecessary columns, some metadata, all of this that we want to remove in order to make it usable in the future. And so, the first thing I'm going to do is print the names of the columns to the screen. And this is a good practice to get into because it very quickly allows us to see what columns there are and possibly which columns then need to be removed.
And so, we can see that there's all of these sort of metadata about which section the students were in. This was a canvas-based survey, so it's all of that stuff. There are also these numbers here that aren't attached to any data at all that we need to remove. And so, we'll start in to step three of our data cleaning, which is to remove some of these unnecessary columns.
And so, the first thing that I'm going to do, I'm going to override my existing data frame and I'm going to say “survey”, which is what I call my data, dot drop, and this will drop any columns that we tell it to. So, I'm going to. But we need to tell it which columns and we give them those columns in square brackets. And so, for this command I'm going to come back up here. I'm going to copy these top command, top columns here, as well as these bottom ones here. And you'll notice that I have not removed any of the number columns and that's because there's so many of them that I'm actually going to do a secondary removal that sort of speeds up the process. So, I'll go ahead and run this command. So, we've now dropped these columns. And so, if we come in here and say, survey dot columns, and run that… and there is an error because I've already dropped those columns. So, when I go to rerun them, it doesn't like it. So, this is something that you might come across in your own homework in your own coding assignments, is when you try to rerun something that you already did, sometimes it doesn't find what you're trying to do. Maybe you've removed a row that no longer exists and now you're trying to call it. In this case I've removed columns that don't exist so it can't drop them. But if we go through here and just type it in another code block, we can see that the columns are very similar to up here, but we no longer have these section base columns, or this incorrectly score base columns.
Okay. To remove these, you know random numbers 1.0 1.0.1, and to do that we're going to do a similar process but we're going to tell Python to do all of the one, all of the columns that start with one, simultaneously. So, to do that, we say survey dot loc, give a column to say all of the rows, and then we do tilde survey dot columns dot str dot starts with one. And this will actually drop every column that has a string that starts with one. So, if we run this, and then our printing -- oops typo there – and we print the columns at the end, we can now see that we've got just the questions, which is exactly what we want.
However, it's not exactly where we would like it to be. We need to rename these columns. And so, to do that I'm going to once again override my original data and say survey dot read name and then we say columns and then we open curly brackets. And now the basic way that we do this is that we take, come up here and take the first column name that we want to rename, put it in quotes, and then outside of that quote, do colon whatever we want to rename it. So here, we'll rename it to home. And so, this will do the first column, comma, and enter down to do the second column. And so, we can retype all of these, but I am going to just grab it from off screen and copy and paste it here and that just makes it a little faster. So, you aren't watching me type all of these, but it all follows the same basic formula, original row name, column name, comma, next one. So, we can run that and then again if we look at our column names, we can say survey dot columns and once again see that now all of these are easier to work with and will ultimately help us with, when we're plotting, which we’ll return to in the next video.