
Formulating and Writing Hypotheses
After exploring and visualizing your data, the first step to hypothesis testing is writing your hypotheses to test. Data visualization and summarizing your data (e.g., with the ) help in formulating these hypotheses, in particular with defining the appropriate inequality (whether to use < or > or ≠) in the alternative hypothesis. We can break down the formulation of hypotheses into the following steps below.
Three Steps to Writing a Hypothesis
-
Write your question in plain language. Based upon the data at hand, and the prior exploratory data analysis (e.g., visualization, summarizing), what question are you seeking to answer? An example might be: “Are PM2.5 concentrations unsafe in certain areas?” It is important to frame the question such that it can answered by the data you have. For instance, you wouldn’t want to ask a question about all pollution when you only have data on particulate matter concentration.
-
Write your hypothesis in plain language. The goal in this step is to take a stance on the answer to the question posed in Step 1. Following the example given above, a viable hypothesis is: “Yes, PM2.5 concentrations are at unsafe levels.” This statement doesn’t need to be correct, since the rest of the hypothesis testing procedure will test this statement. In particular, it will tell you whether or not the data support this statement.
-
Write your statistical hypotheses. The task here is really to convert your plain-language hypothesis from Step 2 into statistical notation. Specifically, it needs to be written in terms of the appropriate population parameter for your question. For example, in order to say that PM2.5 concentrations are safe or unsafe, we can look at the mean (), since this would capture the general, average PM2.5 conditions. Furthermore, it needs to be written in terms of two different hypotheses:
-
The null hypothesis (or just “null” for short). Here you are stating that the population parameter for your variable of interest is equal to a specific value (single-sample tests) or another population parameter (two-sample tests). Again, this is a hypothesis so it does not need to be correct; this is what will be tested. The null hypothesis should capture the contradiction to your hypothesis from Step 2. So, for example, with the EPA defining the threshold between safe and unsafe PM2.5 levels as 12 micrograms per cubic meter (40 C.F.R. § 50.18 (2013)(link is external)), our null hypothesis would be . (You might be wondering why ≤ is not used instead of =, since safe levels would be anything at or below . Indeed, many statistics texts introduce the null hypothesis this way, and really either way is allowable. However, when using ≤ (or ≥ in other situations), you end up only testing the = portion of this statement anyway.)
-
The alternative hypothesis (“alternative” for short). Here you are replacing the equality in the null hypothesis with an inequality (>, <, or ≠). The correct choice is that which aligns with your hypothesis in Step 2. So, following the example above, our alternative hypothesis is , because any average PM2.5 exceeding is unsafe.
-
The subsequent hypothesis testing procedure will give us a quantitative metric (the p-value) for saying whether or not the data support the alternative hypothesis. This is framed as either:
-
We reject the null hypothesis in favor of the alternative hypothesis, or
-
We fail to reject the null hypothesis.
Note that the focus here is on rejecting or not rejecting the null hypothesis. This is because the null contains the equality statement, and thus something definitive to test. However, as we will see later, this null is tested against the alternative, which is why in the first option we can say “in favor of the alternative”.
Parameters
Because our statistical hypotheses need to written in terms of the population parameters, let's review some common parameters in the table below:
Population Parameter | Sample Statistic | |
---|---|---|
Mean | ||
Proportion | ||
Standard Deviation | ||
Correlation | ||
Slope (linear regression) | ||
Difference of Means | ||
Difference of Proportions |
This lesson will only cover tests for some of these parameters, and correlation and slope will be covered in Linear Regression.