Significance Level & Multiple Testing

Read It: Significance Level

Why do we use 0.05 as the threshold for deciding if a p-value is low enough to reject the null hypothesis? Well... we don't have to!

"0.05" is the default value of the significance level, noted with α. Other typical values it can take are 0.10 and 0.01. Any p-value that falls below your chosen significance level will indicate statistically-significant evidence against the null hypothesis.

A smaller significance level will require stronger evidence to reject the null hypothesis.

In other words, with all else being equal, the sample statistic will have to be further away from the center of the randomization distribution to reject the null.

Read It: Multiple Testing

The significance level also represents the probability of committing "Type I" error with a hypothesis test. This refers to falsely rejecting the null hypothesis, when in actuality the null hypothesis is true. For example, take our default of α = 0.05. A randomization distribution shows what sample statistics to expect with the null hypothesis being true. So, data could arise, under the condition of the null being true, that would yield a sample statistic that falls in the lower-most or upper-most 5% of the randomization distribution. The probability of this happening is 0.05, or the significance level in this example.

Try It: A Numerical Experiment for Multiple Testing

The DataCamp below has code for running multiple experiments. Each experiment tests whether we have a normal deck of playing cards, insofar as it contains 26 red cards and 26 black cards (in other words, 50% of each color). The population here is the normal deck of playing cards, and each experiment randomly draws a sample of n = 9 cards from the deck, with replacement (so, putting each card back into the deck after it is drawn). Let's say that our proportion of interest, p, stands for the proportion of red cards. Thus, our hypotheses are:

Ho: p = 0.5
Ha: p ≠ 0.5

For each experiment, we then test the hypothesis using a randomization distribution, find the p-value for that test, and record it in the object "p-vals". Run the code a few times. In each run, what proportion of p-values are less than the default significance level of 0.05? For those p-values, the conclusion is to reject the null hypothesis; in other words, the experimental data support the deck NOT being normal. Of course, we have the luxury of knowing in this situation that the deck is, in fact, normal. Thus, these low p-values represent cases of Type I error, where the null hypothesis is falsely rejected.

Edit the code to adjust the significance level from 0.05 to 0.01. How many cases of Type I error do you have now?

Note on the Code: "if" Statements

The code above contains an example of an "if" statement in order to find the smallest tail for calculating the p-value of the two-tailed test. The general structure of an "if" statement follows:

if LOGICAL CONDITION:
 
   CODE TO RUN IF LOGICAL CONDITION IS MET
 
else:
 
   CODE TO RUN IF LOGICAL CONDITION IS NOT MET

One could have multiple lines of code in the CODE TO RUN spaces. This can be extended to include other logical conditions:

if FIRST LOGICAL CONDITION:
 
   CODE TO RUN IF FIRST LOGICAL CONDITION IS MET
 
elif SECOND LOGICAL CONDITION:
 
   CODE TO RUN IF SECOND LOGICAL CONDITION IS MET
 
else:
 
   CODE TO RUN IF NEITHER LOGICAL CONDITION IS NOT MET

Furthermore, the "else" statement at the end is optional; you could choose to run no code if the logical condition(s) is/are not met.

Significance Level & Multiple Testing