EME 210
Data Analytics for Energy Systems

Supervised Learning: Training Sets and Validation Sets

PrintPrint

Read It: Supervised Learning: Training Sets and Validation Sets

Enter image and alt text here. No sizes!
ADD IMAGE: L27:Slide 5
Enter image credit here

[00:26:27.59] Any questions, guys? OK, so let's get into some of the details here. So that's the overall workflow. What do I mean by some of these things? Let's dig down.  

[00:26:38.69] All right, so suppose this is our data set. And generalizing here, we have some y's. These are our labels. And some x's-- these are our features. And so, first step is we randomly select some of these cases, some of the rows, to be our training set. And the remaining ones will be our validation set.  

[00:27:05.21] Now, just a little bit more words about these-- so before, I said it's good to have more data in the training set than in the validation set. A typical split is 80/20. You may want to-- if you are having trouble getting the testing error down, if it doesn't seem like you can't get a model that fits very well, you might want to have more training data. So have maybe, like, 90% of your data be training data, and then only validate on 10%.  

[00:27:36.10] Or if you think that you're overfitting, if you're fitting on too much data and you have a really complex model, you might want to give it less data. So, you might want to have maybe 70% or 60% training data. But you don't want to go to extremes. You don't want to have all but one data point be in your training set, and then you're only predicting on one validation data point.  

[00:28:00.55] There are some circumstances in which you can do that. There are-- well, let me-- I'll just leave it at that. There are some circumstances in which you could do that, but don't do that-- as a general principle, it's not good to do that. If you do do that, you do it over and over and over again.  

[00:28:16.02] So you hold out. So, you use all your data but one point on training, and you have this one on validation. But then you repeat the whole thing and use a different data point for validation. You do that many, many times over, and that's it. That's a valid way to go about it. But you wouldn't just do it once.  


Try It: OPTION 1 GOOGLE COLAB

  1. Click the Google Colab file used in the video here.
  2. Go to the Colab file and click "File" then "Save a copy in Drive", this will create a new Colab file that you can edit in your own Google Drive account.
  3. Once you have it saved in your Drive, try to implement the following code to import a file of your choice by mounting your Google Drive:

Note: You must be logged into your PSU Google Workspace in order to access the file.

from google.colab import drive

drive.mount('/content/drive')

import pandas as pd

df = pd.read_csv('yourfilename.csv')

df # print the dataframe

Once you have implemented this code on your own, come back to this page to test your knowledge.


OPTION 2 : DATACAMP

 Try It: OPTION 2 DataCamp - Apply Your Coding Skills

Dictionaries are a quick way to create a variable from scratch. However, their functionality is limited, so we will often want to convert those dictionaries into DataFrames. Try to code this conversion in the cell below. Hint: Make sure to import the Pandas library.

# This will get executed each time the exercise gets initialized. # Create a Simple Dictionary mydict = {'Name':['Amy', 'Bob', 'Clair', 'Daisy'], 'Birthday':['9/3/1991', '4/21/1988', '4/21/1990', '11/11/1989'], 'Age':[31, 34, 32, 33]} # convert the dictionary to a DataFrame # print the values in the 'Birthday' column # Create a Simple Dictionary mydict = {'Name':['Amy', 'Bob', 'Clair', 'Daisy'], 'Birthday':['9/3/1991', '4/21/1988', '4/21/1990', '11/11/1989'], 'Age':[31, 34, 32, 33]} # convert the dictionary to a DataFrame import pandas as pd mydataframe = pd.DataFrame(mydict) # print the values in the 'Birthday' column mydataframe['Birthday']


Assess It: Check Your Knowledge

Knowledge Check (replace question)