EME 210
Data Analytics for Energy Systems

Introduction to Hypothesis Testing

PrintPrint

Introduction to Hypothesis Testing

In many instances, it is not clear whether a data sample supports a claim about the population from which it came. Some examples include:

  • One might want to see if particulate matter (PM2.5) pollution exceeds healthy concentration levels in economically disadvantaged areas. The population is the air in all economically disadvantaged areas over a long period of time. It is infeasible to collect data for the whole population, so one collects air quality samples from several areas at several different times throughout the year. Hypothesis testing then answers the question: do the sampled data indicate that the average PM2.5 concentration is statistically significantly greater than the limit to be deemed healthy?
  • A popular electric car company has heard reports that some of the individual cars it has made do not meet the mileage range, on a single charge, reported by the company. If the proportion of such defective cars is greater than a certain amount, the company needs to recall that particular vehicle model and pay to repair all the cars. The population is all the individual cars of that model. Since it is impossible to test the entire population to see if the range is met or not, the company randomly samples 100 of the cars and tests them. Hypothesis testing then answers the question: is the proportion of defective cars less than the limit for issuing a recall?
  • A natural gas company has come up with a new well design that they think will boost production. Here there are two populations: the population of wells with the new design and the population of wells with the old design. The population parameter of interest is the difference in natural gas production between the two populations. Since it is financially risky to drill a whole bunch of wells with the new design (enough to constitute a population), they drill a few new wells with this design. The company then samples the amount of production from the new wells and compares the average to the average production from the old wells. Hypothesis testing then answers the question: is average production from the new wells statistically significantly greater than the average production from the old wells? 

In order to perform hypothesis testing, one needs a unbiased sample from a population that has variability in the measure of interest. If, for example, the PM2.5 concentration was constant across all areas and all times, then one single measurement would equal the population average, and one could make a definitive statement about this population average without hypothesis testing. There would be no uncertainty in this population parameter. Using a biased sample in a hypothesis test may lead to a false conclusion. If, for example, the electric car company only tested the next 100 cars produced from only one of its factories, it may miss that the defect is originating from another factory. 

As we go through this lesson, you will see that all hypothesis testing follows a similar process:

  1. Formulate your null and alternative hypotheses
  2. Calculate the relevant sample statistic
  3. Conduct the hypothesis test: compute the randomization distribution
  4. (Optional) Visualize the randomization distribution and sample statistic
  5. Calculate the p-value
  6. State the conclusion of your hypothesis test

Even before these steps, an incredibly useful, if not necessary, preliminary step is to explore your data. Visualizations and five-number summaries will help you formulate appropriate hypotheses, which is important for interpreting the results of the subsequent steps. 

Throughout this lesson, the hypothesis testing will be conducted via randomization distributions. Besides allowing you to practice your Python coding, this simulation-based approach provides a more intuitive understanding of what a p-value means and how it is determined, as compared to the method traditionally presented in a statistics class. Furthermore, this randomization distribution approach is flexible and widely-adaptable to other statistics besides the means and proportions we'll cover in this lesson. 

This lesson will start by going over the guidelines for formulating your hypotheses, and then will go into the concept of randomization distributions. Data from the Texas Power Crisis of 2021 will be used to exemplify several different common varieties of hypothesis tests, and the lesson will conclude with some important considerations to sample size, significance, and multiple testing.