Click here for a transcript.
In this video we're going to talk about logistic regression, which is a form of regression that allows us to come up with an equation of a line when we only have two options. So, think about a situation where you can either succeed or you can fail. And you still want to know the probability of success or the probability of failure. This is what logistic regression can do, and what we're going to demonstrate today.
So we're here in Google Colab. I've got my libraries loaded, and we're going to be working with the Marcellus Wells and the PA Wells Frac data set that we used during the correlation lesson earlier in this lesson. So, I've already got the data loaded, and I've already gone through our grouping and merging. if you want more details on this process, go back and check out the correlation video where it's explained in detail. Let's move on to the logistic regression. So, logistic regression only works with a binary data set - zero/one, yes/no, success/fail. So, we need to come up with a way to form one of those variables with our data set. And, in order to do that, we need some constraints. So, we're essentially going to compare the cost and profit of drilling a new well in the Marcellus Shale based off of the total water needed to extract the most gas. And so, we have some constraints. The first is that water costs one cent per gallon. The next is that renting a truck to carry that water costs $200, that the truck can hold 6257 gallons, that drilling a well costs $850,000 per 1000 feet, and that there's this base of two million dollars for renting land, paying employees, etc. Now these numbers are not for anything specific, they're just estimates that we pulled out specifically for this demonstration. And so, essentially, the cost of our well is the cost of water, plus the cost of a truck, plus the cost of drilling, plus the cost of just that base load. And then our profit is based off of a price from a couple years ago where gas was selling for five dollars and twenty cents per 1000 cubic feet. And so, that will be our profit. And then we define success as whenever cost is less than profit. So that's what we want. And throughout this, we'll be using a new function called numpy.where, which will allow us to assign a variable based off of a condition. So the first thing we need to do is set up our new variables cost and profit.
So, I'm just going to create a variable cost, and it's just going to follow this function right here. So 0.01 times our total base water volume, which is already in gallon so we don't need to do any unit conversions. And then, we have 200 per truck and we need to figure out how many trucks we need, which is the total base water volume divided by 625. And then we do a plus, and I'm going to put a slash mark here, and that slash allows me to enter down while maintaining the same function line of code. So, if you're doing this on your own and you're just in a single line you don't need to put that slash in, but for me that helps so that you can keep seeing everything that I'm doing. So then the next part of the code is the cost of drilling. So that's $850,000 times merge_df, total depth divided by 1000, and then finally the 2 million base pay. So, that is our cost. And then we can define our profit, which is just 5.2 times our total gas. So, now we've got our our cost and profit, so we need to calculate our successes. So I'm going to create one variable called success, and this is where our numpy.where comes in.
So the command is numpy dot where, and inside this command there's going to be three parts. First, you're going to give the condition comma what to do if true, comma, what to do if false. And so, our condition here is when merge_df of cost is less than merge_df profit. And so, if this is true, we want to write the word yes into that particular row. Otherwise, we want to write the word no. And then I'm going to create a second column which I call successnum, which is just the same statement I'm just going to copy, but instead of words I'm going to change it to the number one and the number zero. And this will become important later on when we need to have numbers to represent our success. And so, if we look at the first five rows we can see that now we've got our cost, our profit successes, and the number of success. Now, this where statement is essentially performing a mixture of an if statement and a for loop all in one. So alternatively, you could do this and essentially that process ends up being a bit longer, but I'm going to go ahead and give a demonstration just for the sake of providing contrast. So we would have for i in the range of merge_df. So, for i, every row in merge_df. Then we need to test that row to see if cost is less than profit. So, we say if merge_df dot iloc, so the i throw dot loc in the cost column is less than merge_df dot iloc. Again, the ith column. But in the profit column, so if that is true we need to say merge_df dot iloc I set the ith column in the success, or the i throw in the success column, equal to yes, and then repeat that for the successnum.
So that's if our cost is less than profit. Then we need to do the other statement. So we can say I if, so else if, but combined. I'm going to copy the same basis here because we want to make sure it follows the same process. But instead we're saying if cost is greater than profit, then our success is no, and our success num is zero, and then two backward spaces to end up outside of both the if statement and the loop. And so, we can run this and we can see that it's taking a little bit longer to run. It has to go through every single row, test every single condition. It's telling me that it doesn't like how I've sliced the data. We can come down here and see the same results as we had before, but instead of having two lines of code we've got seven lines to do virtually the same thing. So, either way is appropriate but in this case the where statement does work a little better. And so, then the last thing that we are going to do in this video is to visualize the variables colored by success.
So, I'm going to use ggplot with merge_df. I'm just going to do a point plot, and so our X variable is total base water volume, our y variable is total gas, and then inside the AES statement I'm going to specify a color as being success. And then outside of the AES statement, but inside the point I'm going to say alpha is 0.25, so that our plot points are partially opaque. And so, here we can see our success data where we have some noes down here, and some yeses up here. There's a few outliers, but by and large they do seem to follow this line. So there might be a good model to use this logistic regression, which we will get into in the next video.