
Logistic Regression
Read It: Logistic Regression
Occasionally, you may encounter a response variable that is categorical, rather than quantitative. In this case, you can conduct logistic regression instead of linear regression. Through logistic regression, you can conduct a form of regression on a binary variable (e.g., 0/1 or yes/no). Below is a figure that shows a dataset that could be used in logistic regression. In particular, the aim of logistic regression in this case would be to find the curve that separates the "yes" points from the "no" points, based on the input data (e.g. TotalGas and TotalBaseWaterVolume).
Mathematically, logistic regression works be defining a response variable with two options: Y = 1 for "yes" and Y = 0 for "no". Then the model tries to predict p, the probability that Y = 1, given the input data. The equation looks like:
This results in a sigmoid curve, as shown below, that indicates the probability of success (Y = 1) as a certain value changes. Here, we show how the probability of success increases as the TotalBaseWaterVolume increases, with the actual data in black at Y = 0 and Y = 1.
Below, we include two videos demonstrating logistic regression. The first sets the data up, including getting our "yes" and "no" values. The second video is where we implement the logistic regression.
Watch It: Video - Logic Regression Set Up (12:21 minutes)
In this video we're going to talk about logistic regression, which is a form of regression that allows us to come up with an equation of a line when we only have two options. So, think about a situation where you can either succeed or you can fail. And you still want to know the probability of success or the probability of failure. This is what logistic regression can do, and what we're going to demonstrate today.
So we're here in Google Colab. I've got my libraries loaded, and we're going to be working with the Marcellus Wells and the PA Wells Frac data set that we used during the correlation lesson earlier in this lesson. So, I've already got the data loaded, and I've already gone through our grouping and merging. if you want more details on this process, go back and check out the correlation video where it's explained in detail. Let's move on to the logistic regression. So, logistic regression only works with a binary data set - zero/one, yes/no, success/fail. So, we need to come up with a way to form one of those variables with our data set. And, in order to do that, we need some constraints. So, we're essentially going to compare the cost and profit of drilling a new well in the Marcellus Shale based off of the total water needed to extract the most gas. And so, we have some constraints. The first is that water costs one cent per gallon. The next is that renting a truck to carry that water costs $200, that the truck can hold 6257 gallons, that drilling a well costs $850,000 per 1000 feet, and that there's this base of two million dollars for renting land, paying employees, etc. Now these numbers are not for anything specific, they're just estimates that we pulled out specifically for this demonstration. And so, essentially, the cost of our well is the cost of water, plus the cost of a truck, plus the cost of drilling, plus the cost of just that base load. And then our profit is based off of a price from a couple years ago where gas was selling for five dollars and twenty cents per 1000 cubic feet. And so, that will be our profit. And then we define success as whenever cost is less than profit. So that's what we want. And throughout this, we'll be using a new function called numpy.where, which will allow us to assign a variable based off of a condition. So the first thing we need to do is set up our new variables cost and profit.
So, I'm just going to create a variable cost, and it's just going to follow this function right here. So 0.01 times our total base water volume, which is already in gallon so we don't need to do any unit conversions. And then, we have 200 per truck and we need to figure out how many trucks we need, which is the total base water volume divided by 625. And then we do a plus, and I'm going to put a slash mark here, and that slash allows me to enter down while maintaining the same function line of code. So, if you're doing this on your own and you're just in a single line you don't need to put that slash in, but for me that helps so that you can keep seeing everything that I'm doing. So then the next part of the code is the cost of drilling. So that's $850,000 times merge_df, total depth divided by 1000, and then finally the 2 million base pay. So, that is our cost. And then we can define our profit, which is just 5.2 times our total gas. So, now we've got our our cost and profit, so we need to calculate our successes. So I'm going to create one variable called success, and this is where our numpy.where comes in.
So the command is numpy dot where, and inside this command there's going to be three parts. First, you're going to give the condition comma what to do if true, comma, what to do if false. And so, our condition here is when merge_df of cost is less than merge_df profit. And so, if this is true, we want to write the word yes into that particular row. Otherwise, we want to write the word no. And then I'm going to create a second column which I call successnum, which is just the same statement I'm just going to copy, but instead of words I'm going to change it to the number one and the number zero. And this will become important later on when we need to have numbers to represent our success. And so, if we look at the first five rows we can see that now we've got our cost, our profit successes, and the number of success. Now, this where statement is essentially performing a mixture of an if statement and a for loop all in one. So alternatively, you could do this and essentially that process ends up being a bit longer, but I'm going to go ahead and give a demonstration just for the sake of providing contrast. So we would have for i in the range of merge_df. So, for i, every row in merge_df. Then we need to test that row to see if cost is less than profit. So, we say if merge_df dot iloc, so the i throw dot loc in the cost column is less than merge_df dot iloc. Again, the ith column. But in the profit column, so if that is true we need to say merge_df dot iloc I set the ith column in the success, or the i throw in the success column, equal to yes, and then repeat that for the successnum.
So that's if our cost is less than profit. Then we need to do the other statement. So we can say I if, so else if, but combined. I'm going to copy the same basis here because we want to make sure it follows the same process. But instead we're saying if cost is greater than profit, then our success is no, and our success num is zero, and then two backward spaces to end up outside of both the if statement and the loop. And so, we can run this and we can see that it's taking a little bit longer to run. It has to go through every single row, test every single condition. It's telling me that it doesn't like how I've sliced the data. We can come down here and see the same results as we had before, but instead of having two lines of code we've got seven lines to do virtually the same thing. So, either way is appropriate but in this case the where statement does work a little better. And so, then the last thing that we are going to do in this video is to visualize the variables colored by success.
So, I'm going to use ggplot with merge_df. I'm just going to do a point plot, and so our X variable is total base water volume, our y variable is total gas, and then inside the AES statement I'm going to specify a color as being success. And then outside of the AES statement, but inside the point I'm going to say alpha is 0.25, so that our plot points are partially opaque. And so, here we can see our success data where we have some noes down here, and some yeses up here. There's a few outliers, but by and large they do seem to follow this line. So there might be a good model to use this logistic regression, which we will get into in the next video.
Watch It: Video - Logic Regression Modeling (10:45 minutes)
In this video we're going to actually conduct the logistic regression. So much of what we did earlier was prepping the data that we needed. We got this nice plot, and now we're going to use the statsmodels.formula.api library in order to fit a logistic regression to our data. And there's some information here that you can use to read up on what the logistic regression is doing. And so, I'm going to create a variable called logregmod logistic regression model. The library is nicknamed SMF, and this is, up here you can see that I've nicknamed this library, SMF. And the command is smf.logit for logistic regression. And this is the formula. So we do the successnum tilde numpy.logit total base water volume. And so, this particular model requires that numeric excess value, which is why we had to create both the yes/no version and the zero/ one version. And then our data is merge_df, and I'm going to pack the fit right onto the end of that. And so, then we can print the results. We can say printlogregmod.summary.
And we can run that. And the summary is much like what we got earlier when we did linear regression. We can see a lot of information. It even gives us this pseudo r squared value. So now, technically logistic regression isn't linear, and so a correlation coefficient would equal zero, r squared would equal zero because we can't apply those variables that measure to non-linear functions. However, what
this command does in logit is, it calculates a pseudo r squared which is a way to assess goodness of fit for non-linear functions. And so, this can tell us that you know technically only about 11 of our data is being explained by this logistic regression model. So not ideal, but still a good exercise to do.
So now that we have the model ready we can use that fitted model to make predictions. And so, I'm going to create a new variable called log10x, which is just going to use our linspace command. Start from four, end at eight, and add as many data points as there are in merge_df. And then I'm going to create a predictions variable which is just our logregmod dot predict, so slightly different function. And there's this variable called exog, which is just another way to say y variable. And so, then it is a little bit funny, it needs to be in a dictionary and it needs to have some connection to the original data set. So, I'm calling it TotalBaseWaterVolume, which is what our original data is. And that's also why up here I said log10x total base water volume, instead of our... instead of creating a new column for the log total waste base water volume, just so that these two values can be the same. But then we can print predictions. And so, we can see here that now we have these predictions for our optimized variable log10x.
And so, then we can actually go in and visualize that data. And so, we're going to start with visualizing the actual prediction. And so, we're going, our actual data... sorry, we're going to say ggplot merge_df geom_point log10water, which is a variable that was created. That is just the log transform of the water variable. And then our Y is again going to be our successnum. So we want it to be in our successnum.
So, this is not how the plot is supposed to look. And I believe that it is because I forgot to do my AES statement, so it didn't know what to do with X and Y because that's not a normal inclusion outside of the AES statement. So, let's run that again and there we have what we actually would expect. So we've got success rate, and we've got log10water here. And so, we can see that there was some successes over here, but not as many. And a lot of successes over here, but there's still a lot of failures at this higher range too.
So this is the original sort of data and now we want to add in our logistic regression curve, which is going to give us the probability between these values. So, in order to do that, we need to first create new variables that are in our data frame because ggplot doesn't like to work with non-data frame values. So we can create an equate or a variable log10x, which just includes our log10x values up here, as well as a predictions column that just contains our predictions. And then I'm going to come up here, grab the same plot, and we're just going to add to it. And so, we can add geome line AES.
And so here our x value is going to be log10x. Our y value is going to be predictions, and then I'm going to add a color blue. I'm going to make it a little bit bigger, so do a size two. And then I'm going to add on to that and add a y label. So this is not something we have done too much, but technically what the y value, y-axis is showing is the probability of success. So technically, although we're plotting successnum and later predictions on the y-axis. What it's actually telling us is probability. So this is a little bit more descriptive, and so here we can see this idealized curve as it goes up. So this is our actual data, but it's even. Though, there's some data points over here. What the logistic regression is doing, is saying that this has to fit some sort of S curve. And so it's more likely to be a failure on the low end with these being some outliers, then it's more likely to be a success on the high end with some of these being. And so, this is ultimately the result of a logistic regression so that then we could come in and say that, you know, at log106, so that's of water, you know. Sometimes there were failures and sometimes there were successes, but if we come up to this curve and start just estimating over here it's about just over 0.25. So maybe 0.3. So there's still a 30 chance of success. So it's on the low end, whereas once we get up to sort of this inflection point, we start to see higher chances of success for the higher amounts of water.
Try It: DataCamp - Apply Your Coding Skills
Edit the code to run a logistic regression model. Your response variable should be the pre-programmed "success" variable and the explanatory variable should be the pre-programmed "x" variable.
# This will get executed each time the exercise gets initialized.
# libraries
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
# create some data
df = pd.DataFrame({'x': np.random.randint(0,100,100),
'y': np.random.randint(-50,150,100)})
df['success'] = np.where(df['x'] > df['y'], 1, 0)
# run model
model = ...
print(...)
# libraries
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
# create some data
df = pd.DataFrame({'x': np.random.randint(0,100,100),
'y': np.random.randint(-50,150,100)})
df['success'] = np.where(df['x'] > df['y'], 1, 0)
# run model
model = smf.logit('success ~ x', data = df).fit()
print(model.summary())