Click here for a transcript of the L07V08 ANOVA video.
Hi. In this video, we're going to walk through an example of one-way ANOVA. And in this analysis, the example we'll use comes from research conducted by people in my own department. That's the Department of Energy and Mineral Engineering here at Penn State. So, the researchers here predict the volume of emissions that will come out of PA based facilities under three different policy scenarios.
Scenario one is the base case, which is that PA does not join the Regional Greenhouse Gas Initiative, which is essentially a cap and trade program, and keeps the existing alternative energy portfolio standard, which mandates a certain amount of energy has to come from alternative energy sources. Our second scenario is that PA joins RGGI and keeps the portfolio standard. And then the third scenario is that the PA joins RGGI but drops the portfolio standard. It should be noted that this research was conducted in 2021 before Pennsylvania joined RGGI in the summer of 2022. So, let's get down to the analysis. .
So, our hypotheses, which we'll always have with ANOVA, are that all three means across all three groups are equal. So, our three groups are the three different policy scenarios, and our quantitative variable here is the volume of emissions. To simplify things, we're just going to look at the volume of nitrous oxide. But do bear in mind that the data set does contain other pollutants as well, if you want to look at those. So, we're saying that the volume of nitrous oxide on average is equal across the base case scenario RGGI and no AEPS. .
The alternative is that at least one of these means is not equal from the others. So, let's get down to it. Let's see what the data say. So, first things first is to import the data, and I've already done that with these familiar commands here. One thing that should be noted is that we've got three distinct Excel sheets. .
So, we're going to read directly from Excel using read underscore Excel, this function from Pandas. We can do that, which is nice. Another thing, too, is to simplify this analysis, we're only going to look at the year 2030. So, if you open up one of these Excel files, you'll note that it's got multiple spreadsheets by year. And so we're just going to pull out the sheet for year 2030. .
So, we've imported these three Excel sheets as data frames called Base RGGI and no AEPS. Let's examine one. Let's just take a look at base here. .
And so, here's our data. Now, the task at hand for data processing is that we need to merge all three of these data frames. So, if we want to compare these three, it's helpful to have them in one data frame so that we can make a fair comparison using our statistical tools here. So, first things first is if we're going to merge these, we need to go back and use our merge function that we learned about in lesson two. And we need to identify some keys for doing this merging. .
And so, we can tell the merge function what our key variables are by setting them as indices. So, we'll use the set index command to define the key variables. And these are the variables that uniquely identify the rows here. So, we need to provide a list here and the variables that do that, If you go and inspect the data set, is the unit unique ID, the latitude, the longitude, the state, the county and the fuel type. Okay. .
And we'll use these same ones. So, all three of those data frames are structured the same and have the same variables. So, we'll do this for all three data frames that we have here and just change these names. So, RGGI and no AEPS. .
Okay, so all the set index function is doing is designating these variables as indices. Another thing that we'll do to set up the data before doing the merging is to add a suffix to the variables in the data frames. And what this will allow us to do is distinguish the variables that are in base from the variables in RGGI and from the variables in no AEPS because these variables are all currently called the same thing because they're in separate data frames. But when we merge them together, we don't want to get confused about which column belongs to which. So we'll add a suffix to just make sure that they're distinct. .
So, use this add suffix command and to base we'll add the suffix underscore base and same thing with RGGI and no AEPS, except we want to change these suffices to RGGI and no AEPS. Okay? And then we can do the merging. So, we'll make a new data frame, let's call it DF underscore merged stand for data frame that is merged. And from the Pandas library we'll use our merge function and we can't merge all three at once. So, we'll merge two into one data frame and then this merge data frame will merge with the third item. So, let's first start off by just merging Base and RGGI. .
Our left index here we'll say left index is true and right index is true just to indicate that we're using the indices here and we want to keep everything. So, we'll say how is equal to outer. So, we're going to do an outer merge in order to keep everything. And then once these two are merged, what we'll do is we'll overwrite our DF merged again using the merge command here and we'll use all of these same settings except instead of Base, we're going to use DF merged. So we're going to first merge Base and RGGI store that in DF merged and then DF merged is going to be used in another merge command to attach not RGGI but the no AEPS data frame. .
Okay? And then we'll visualize how that plays out. Okay, so we'll go ahead and run this and there we go. So, note that these items that are in bold, these are the indices. So from setting the index, these are all in bold here. .
And then we've got and then we've got all our variables. So, the different pollutants here's NOx that we'll be looking at. And note that we've attached the underscore base to distinguish that from the NOx that's currently hidden by this dot, dot, dot, but the NOx for underscore RGGI and no APS. Okay, so that all worked well. And you might be wondering, oh, why did we include latitude, longitude, state, county and fuel type in the indices, even though that they're not really keys. That we've got the same values across these first five entries, for example, for those variables. .
Whereas really unit unique ID is really the key here in that it's the unique identifier for each row. Well, it doesn't hurt to include these variables in the keys and essentially retain them for the items. These are properties of the individual units. The units here are basically power plants, factories and things like that. And so, we can retain these variables by including them in the indices without duplicating them across all the different data frames that are merged in here.
So anyway, a little bit about that. So, let's move on. So, once we've done that, let's take those variables that we have set as indices and bring them back as variables. So, we'll reset index, this reset index command will just turn these variables that were currently indices again into variables. So, let's check that out. There we go. That worked. They're not bold anymore. We have these row numbers as bold. Those are our indices now.
Okay, so that's all well and good. We've got those three data sets merged. Now the other thing that we want to do is format our data set so that we've got one quantitative variable and one categorical variable. Now the categorical variable is going to contain our groups. So, each category will represent one of the policy scenarios here, whereas the quantitative variable will contain our variable of interest, the thing that we're going to calculate the mean upon, which will be the emissions, in this case, for this particular example, the NOx emissions.
So, in order to create that format, we're going to use the melt function here. And again, you can go back to lesson two or reference the Pandas cheat sheet to learn about the basics of the melt function. But with this melt function, our ID VARS, whatever we specify in ID VARS is going to be left alone. And so, here we'll just specify unit unique ID. So, we'll keep that as is.
Now, everything that we specify in value VARS here, these are going to be called names that are going to be turned into a new variable. And so, this is going to include our NOx base, our NOx RGGI and our NOx no AEPS. So, these three columns, these are the column names. These are going to get turned into a new variable here and we're going to call that variable Pollutant. We call it Policy or whatever we want.
Let's just call it Pollutant here. And our value name is going to be Emissions because whatever we specify, basically all the values that are under these three columns are then get put into a new column that will call emissions here. But again, we could call this whatever we want. So let's see this in action.
Go ahead and run this. And there we go. So, it's taken again these three columns out of the original data frame above. Put those into the pollutant column here. So NOx space, NOx RGGI, NOx no AEPS.
And then it's taken all the variables that are under each of these and put them in the right columns here based upon where the unit unique ID falls. Okay, so we've got that nicely arranged. Now one thing that bugs me here is that we've got a pollutant variable and it starts with NOx and that's fine. This adequately describes the policy scenario. But one nice touch would be to separate out the different policy names here.
So, let's just do that real quick. So, what we'll do is we'll keep our pollutant column but we're going to change this a little bit. So, we're going to overwrite it and we're going to make a new column called Policy. And so, what we're going to do is we're going to take each of these entries and separate out the first part, which is the pollutant, put that in the pollutant column and the second part after the underscore, we'll put in the Policy column. And in order to do that we're going to use the string or STR split function.
So, we'll call our current pollutant column, say STR split and we'll tell it what thing to split upon and that is the underscore. So, we want to split upon the underscore. And another argument we're going to give it here is expand equals true, which is basically to take. So, after splitting on this underscore we'll have two strings and this will expand it into two separate items, not one single string that contains everything but the underscore. So, let's see how that plays out.
There we go. Now we still have our pollutant column and it just contains the pollutant type. In this case NOx because that's all we're looking at here. And now we have a policy call that's got the policy names. These are our three policy scenarios.
So, base the dot dot dot is covering up the RGGI but it's there. And then our no AEPS. Now, the third thing that we want to do is you might notice that we have some NAN values here. Let's just get rid of those. This is important to do because our one way ANOVA test won't work if we have NAN values in there. So, our NANS are just in our emissions column.
And really, that's the only thing we need to remove the NA values from because that's the variable that we're going to plug into. Our one way ANOVA will mainly operate on. And we only have these Nas under the no AEPS scenario. And you can get that by inspection of the data set here. I won't waste time doing that here.
And we'll just use our drop NA function to remove those NA values. Okay, there we go. Now, another good thing to do is to do some exploratory data analysis. So, here we'll look at the mean values in each of our groups.
And so, let's make a new data frame where we just kind of clean this up a little bit. We'll drop some labels here, so our unit unique ID will get rid of. And actually, we don't need to keep pollutant either, because it's NOx for everything. We don't need the unit unique IDs because we don't need to distinguish the different units from one another in the subcontinent analysis.
So, we're just going to get rid of those variables just because they're currently unnecessary. And then we'll do our group by and we're going to group by policy, and we'll calculate the mean within each group. We're going to calculate the mean emissions within each type of policy.
Let's just rename some columns here, because once we do this group mean, we'll have a column called Emissions that would really be much better called mean. And then just one other thing that we'll need to do in here is reset the index after doing the group by. So, let's see what we get as a result of this.
There we go. So, we've grouped by policy. We have our three different policy groups, and we've got the mean NOx emissions within each group here. And certainly these mean values are different from one another, so none of them are equal. But the real question here, which we'll answer with the ANOVA test, is whether these differences are significant or not.
So, do we have a statistically significant difference of means here? So, let's go ahead and do that test. So, we'll use the built in function, or I should say the function that's available from the SCIPY stats package here. And this is just Stats F underscore one way. The F is referring to the F statistic that's inherent to the ANOVA test. So, that's why it's called that name.
You don't see a nova in the test. It's referring to F test, which is, again, in reference to that F statistic. And so, we want to give it the three series of values from our quantitative variable, in this case, emissions, and we need to pull these values out of each policy. So, we need to do a bit of a filtering statement here. And so, first case is under the policy, that's equal to base, and we can include our just for safety sake, we'll include our drop NA here and just copy and paste this for our other scenarios, too.
And we'll keep the drop NA in there. It doesn't hurt. So, even though there aren't any NAs in these other two, it won't affect them, it just won't drop anything. But we'll certainly get rid of the NAs from this no APS here. Okay, so there we go.
We're supplying the emissions values for the base case and then separated with the comma from the emissions values from RGGI case, another column. And then our third argument is emissions from the no AEPS case. Okay, so let's go ahead and run this. We run it and we get some output. We get one line of output here we've got our statistic.
So, this is the F statistic value and we have our P value. You might note that this P value is pretty high. It's much higher than zero five. So, our default significance level. So, we fail to reject the null hypothesis. So, our conclusion here is you fail to reject the null hypothesis at the 0.5 significance level, exclamation point capital W. And so, what does this mean? It means that these data do not support the conclusion that any policy scenario leads to different mean emissions from the other scenarios.
So, with respect to just the mean emissions, there's no significant difference. This is interesting. Clearly, the base case is quite a bit higher than the other two. And you would think that adopting RGGI would lead to a significant difference. But this test is saying that's just not true.
In order to illuminate this conclusion a little bit better, let's visualize it. This will make a clear explanation here. So in order to do this, though, we need to calculate some error bars. So, bear with me here. So we'll make a new data frame.
We're going to take our DF two, we're going to group. Our DF two, by the way, is what we have up here after dropping these variables. And we're going to group by our policy again, except instead of doing the mean, we're going to calculate our standard deviation on the values. And we'll rename some columns here because again, it's going to call it emissions, when in fact it doesn't have the emissions, it has the standard deviation of the values there. All right, so let's do that.
And then we need another summary statistic here. I'm going to borrow the same structure. We're still going to group by policy, except instead of doing standard deviation, we're going to count. That is, we're going to count the values in each policy group. And again, we need to rename the emissions and we'll rename this.
Instead of a CD, we'll rename it count. And then we'll do some more merging. So, our new DF three, which has the means in it already. So, this is our DF three currently. We're going to add on the results of these other grouped summaries.
So, PD merge, DF three. We're going to tack on our DF four, and we're going to merge on policy. So, policy will be our key here because that's really making each row distinct. And then similarly, once we tack on DF four, we'll also then merge on DF five. And then we need to do one final calculation here, and that is we need to find our standard error with a capital E.
And this is equal to our standard deviation divided by the square root of our sample size or our count. Okay. And this formula is coming. You can go back to the lesson six to see where this formula is coming from. Basically, you can see the traditional inference stuff there on confidence intervals and where this comes from. So, let's see what this looks like now. There we go. We've got our mean, our standard deviation, our count, and our standard error. All right, finally, let's make a nice plot here. So, we'll use plot nine.
We'll import everything, and we're going to make a GG plot based upon this DF three info. And what I would like to do is I would like to make a bar plot that's showing that's breaking up our values by policy and plotting the mean for each policy. Now, here we just have a single value for each policy. So what we need to do is we need to set stat or statistic to identity, meaning that it's just going to use this single value for the height of the bar. Okay, so we can take a look at that.
There we go. There's our bar plot showing that base has a bit higher emissions. These two are pretty close, but that's just showing what we calculated already before we did the hypothesis test. Let's add in some error bars because it's the variance about these means. That's the critical information, and that's why it's called analysis of variance, is because we want to see how much these mean values could vary.
So again, we'll have an error bar for each policy and we need to specify a minimum. So, a low level of the error bar. And this will be our mean minus two times our standard error based upon our going back to confidence intervals. Right. So ,we have our statistic plus or minus two times our standard error to give us a 95% confidence interval.
And so, that would mean then that our Y max or the top of the error bar should be plus mean plus two times standard error. Okay, so let's see what this looks like. Oh, one other just esthetic thing here is that we can specify width of the error bars. Let's say zero two should look nice, but we can adjust that if it doesn't. There we go.
You've got our error bars. So, the error bars, these are showing our 95% confidence interval for the mean and otherwise showing how much this mean could vary within a reasonable realm of possibilities. Right. And so, interpreting our hypothesis test result that was much larger than zero five and where we failed to reject the null, well, we see that the mean of each one falls within the 95% confidence interval of all the other bars. So, that is, considering the variance of the means, these overlap significantly.
There's no real big significant difference. And so, that's the critical aspect to ANOVA is that you need to take into account the variability of the mean and not just the single mean value itself. Okay, thank you. That's it.