06. Power Analysis

How reliable is a statistical test? After the unsettling conclusion of the previous lesson, we are now left with the realisation that the p-value can be very misleading. Thus far, we have been controlling the likelihood of a type-I error through the significance level. However, there is also a type-II error that appears to be the source of our problem. In this lesson, we will learn about the statistical power of our test, how to assess it, and learn about factors that affect it.

Learning Goals #

Understand that a $p$ -value in itself is insufficient to value a statistical test.
Calculate the optimal number of repeated experiments needed to contain the probability of a false negative.
Evaluate the power of a statistical test based on the effect size, sample size and signifiance level.
Defend the ethical implications of the hypotheses used in a statistical test.

Analytical Separation Science by B.W.J. Pirok and P.J. Schoenmakers

READ SECTION 9.3.5

POWER ANALYSIS

1. Type-II error #

We touched upon the significance level in Lesson 4, but never really addressed the different type of errors and what they mean. The previous lesson demonstrated that the p-value itself is not very insightful in assessing these errors either. It is therefore due time that we dive into the type of errors that may occur. We start by once more regarding the confusion matrix in Figure 1.

EXERCISE 1

Take another look at the confusion matrix and examine the figure.

Figure 1. The so-called confusion matrix shows the relation between the hypothesis, the two types of errors that can be made and several definitions for the true negative and true positive.

On one hand we have the significance level, $\alpha$ , which is the probability of a type-I error, a false positive, where we reject $H_0$ while it was true. On the other hand we have the false negative, wrongly accepted $H_0$ while it was true while it was false. This is a type-II error, for which the probability is $\beta$ . We speak of the statistical power of a method, or statistical sensitivity, as the probability of a true positive (correctly finding an effect), which equals 1 – $\beta$ .

A scientist is testing whether the use of 0.1% trifluoroacetic acid (TFA) as additive to the buffer significantly affects retention. The scientist wrongly concludes that there is no effect. What type of error is this?

Correct! The scientist wrongly (“false”) concludes that there is no effect (“negative”). This is a type-II error.

This is not correct. The scientist wrongly (“false”) concludes that there is no effect (“negative”). This is a type-II error.

The type-II error can be better understood by returning to our habit regarding the probability distribution. The usual $t$ -distribution that we have been plotting thus far always assumes that $H_0$ is true. Similarly, there also exists a distribution that expresses the probability that the alternative hypothesis $H_1$ is true. This is illustrated in Figure 2.

Figure 2. Power analysis of the non-matched pairs comparison from Section 4 of the previous lesson reveals a huge likelihood of a type-II error (

\beta

Figure 2 dramatically reveals the importance of the type-II error, $\beta$ . The data shown displays the situation of the non-matched pairs comparison at the end of the previous lesson of Table 9.4, which is again shown below in Table 1. In that lesson, the conclusion was that $H_0$ was accepted, whereas the matched pairs design revealed that it should have been rejected. Figure 2 clearly shows the high likelihood of a type-II error, which supports these conclusions. We can now understand why the $p$ -value of 0.5 was of little use. The probability of making a type-II mistake is 0.89, that is 89%!

Table 1. Summary of measured concentrations of a compound in water using RPLC before and after heating of the samples. Table is identical the final table from the previous lesson, and to Table 9.4 from the book.

	$n$	$\bar{x}$	$s$
Before heating	12	21.95	2.699
After heating	12	22.65	2.283
Difference	12	-0.81	0.604

What is happening? Next to the PDF that represents that $H_0$ is true, we now also see the PDF that represents $H_1$ being true. In this case, $H_1$ follows a non-central $t$ -distribution, which is beyond the scope of this course.

2. Factors affecting statistical power #

We have learned earlier that the area under the curve of $H_0$ that is more extreme than the critical limits (e.g. $t_{\text{crit}}$ ) is equal to the type-I error. Similarly, the area under the curve that supports $H_1$ that is less extreme than the critical limits is equal to $\beta$ .

In other words, the overlapping area between the distribution of $H_0$ and $H_1$ , between the critical limits, is equal to $\beta$ . The more these two distributions overlap, the higher $\beta$ will be. Therefore, we want these two distributions to be away from each other. We’ll now focus what factors affect this.

2.1. Sample size #

An important factor is the sample size. We already noted during Lesson 3 that a higher number of repeated measurements improves the standard deviation and therefore the confidence interval. This is in line with our intuitive concept that the more we measure, the more confident we can be that our understanding of the situation is correct. We can see this in Figure 2, where a high sample size strongly reduces $\beta$ .

Figure 2. Effect of sample size on the

\beta

when effect size and

\alpha

are kept constant with

n

= 50 (top), and 5 (bottom).

2.2. Effect size #

The second important factor is the effect size. The effect size depicts the magnitude of the effect that causes the difference, irrespective of the sample size. It is given by

Equation 9.32: $d=\frac{|\mu-\mu_0|}{\sigma}$

The reference can also be the mean of the second sample in a comparison of two means. Figure 3 shows that the effect size strongly affects the $\beta$ .

Figure 3. Effect of effect size on the

\beta

when sample size and

\alpha

are kept constant with

d

= 0.2 (top), and 1 (bottom).

One way to picture this is shown in Figure 4. In Figure 4A, the beakers are very different in that one is blue and one is pink. In this case, the difference is so profound (i.e. the effect size is large) that we do not need a statistical test to discern them. In this way, one could argue that the distribution that $H_0$ is true and that $H_1$ is true are very far away from each other.

In this light, Figure 4B expresses a much smaller difference between the two beakers. And therefore the effect size is smaller.

Figure 4. When comparing the two beakers in panel A, we do not need a statistical test to tell that they are different. The effect size here is large. Whereas for the case of panel B the difference is less profound, so the effect size is small.

In essence, the effect size expresses the difference between the two datasets. If the difference between the two situation is very small, then it is more difficult for a statistical test to discern between the two situations. Such a situation would have a $d$ of 0.2, which is considered small. Whereas if the situations are very different from each other, it is much easier for the statistical test to capture this. Thus, $d$ of 1.0 is considered large.

2.3. Significance level #

The third and final factor is the significance level, as the critical value for $\alpha$ also acts as critical value for $\beta$ . Figure 5 illustrates this relation clearly. Unfortunately, a smaller $\alpha$ translates directly into a larger probability for $\beta$ .

Figure 5. Effect of the significance level on the probability of a type-II error when the sample size and effect size are kept constant.

Which statement is correct?

This is correct.

Unfortunately, this is not correct. Checkout Section 2 once more and try again!

3. Optimising experiments with power analysis #

It is certainly useful to be aware of the type-II error, but are there other benefits? Well, yes! Power analysis is useful when designing or evaluating an experiment. The previous section showed that four factors are related: significance level, sample size, effect size and statistical power. If we know three, we can calculate the other. This allows us to answer various questions.

How many repetitions do I need to do?
Did I have a sufficiently high number of repetitions to study the effect I was interested in?

There are four types of power analysis, depending on which you want to solve for.

A priori: Compute the sample size, when designing a study. This provides you the required sample size to detect some level of effect with $p$ -values
Post-hoc: Computer the statistical power when completing a study. This tells you if you had sufficient subjects to detect the actual effect you found.
Criterion: Compute alpha. Obviously provides you the probability of a type-I error. We rarely run this. We’ll discuss soon why.
Sensitivity: Compute effect size when the sample size is predetermined by study constrains. For example, there are only 10 subjects available. With this analysis, we can see what level of effect we can still find with our subjects.

4. Ethics and hypotheses #

At this stage we must touch upon an ethical point that is connected to statistics. You may have noticed that we determined in Lesson 4 that no effect always is encompassed by $H_0$ , whereas $H_1$ represents that there is an effect. One could also wonder what would happen if we would turn it around. This would have serious consequences for the meaning of $\alpha$ and $\beta$ .

4.1. Reject-support testing #

Our golden standard is reject-support testing, where $H_0$ is the opposite of the effect that a scientist is investigating. It is rooted in the concept of the deductive standard of falsifiability introduced by Karl Popper. It means that our hypothesis is falsifiable if it can be logically contradicted by an empirical test.

One could argue that as society, we want to keep $\alpha$ low, because a false positive will induce the need for further studies to investigate the effect (that in reality does not exist), which is a waste of resources.

In this context, a false negative just means that the effect – that did exist – was not picked up. A missed opportunity. A disadvantage is that too much power renders trivial effects to be highly significant.

In the middle of the night, a large vessel is sailing over the Atlantic Ocean. The captain verifies that his spectacles are spotless before he peers into the endless distance of ocean once more. “Was that an iceberg?”, he wonders. After serious deliberation, the captain relaxes into the back of his seat and concludes there is no risk ahead for his majestic ship. Five minutes later, the ship would crash into the iceberg.

What type of mistake did the captain just make? First try to intuitively choose which of the two answers feels correct. Only then draw up hypotheses as reject-support testing and see if this matches.

Indeed! “Negative” here stands for no effect. That is, no iceberg on the background of a dark ocean night. Unfortunately, there actually was an iceberg “false”. Interestingly, this feels very intuitive. Had our hypotheses been drawn up as a case of accept-support testing (Null hypothesis = there is an iceberg), then our false negative would have been representing that we wrongly concluded that there WAS an iceberg, which does not feel right. The point is that you are naturally inclined to select reject-support testing.

Unfortunately, this is not correct. “Negative” here stands for no effect. That is, no iceberg on the background of a dark ocean night. Unfortunately, there actually was an iceberg “false”. This can feel very intuitive. Had our hypotheses been drawn up as a case of accept-support testing (Null hypothesis = there is an iceberg), then our false negative would have been representing that we wrongly concluded that there WAS an iceberg, which does not feel right. The point is that you are naturally inclined to select reject-support testing.

In a forensic analogy, the false positive, $\alpha$ is sending an innocent person to jail, and a false negative, $\beta$ is not convicting a guilty person. We ourselves determine $\alpha$ prior to the test, and can thus strictly control it. Whereas $\beta$ we can control by pouring enough money into the investigation (sample size. In other words, the researcher is forced in conducting the study well. Reject-support testing thus can be compared to the concept of “guilty until proven otherwise“.

Rooted in the scientific principle of falsifiability.
$H_0$ opposite of researchers theory.
Rejecting $H_0$ supports researchers theory.
$\alpha$ = false positive: theory ( $H_1$ ) incorrectly accepted. Society does not like this. Waste of resources.
$\beta$ = false negative: Researcher does not like this. Theory incorrectly rejected.

Popular

$H_0$ is theory of researcher.
Accepting $H_0$ supports researchers theory.
$\alpha$ = false negative for the researchers theory
$\beta$ = false positive for the researchers theory.

Figure 6. Comparison of reject and accept-support testing.

4.2. Accept-support testing #

In accept-support testing, the null hypothesis is in agreement with the theory of the researcher. Here, $\alpha$ means that the theory falsely got rejected. The scientist can prevent this by lowering $\alpha$ prior to the study. In contrast, $\beta$ is a false positive. An incorrect accepted theory may lead to useless follow-up studies. Society cannot do anything about this, because the scientist determined the fate of $\beta$ by determining the sample size.

5. Calculating statistical power #

Power analysis is certainly possible using the media that we have been using thus far in this course. However, based on our experience in class, it is much easier for students to use G*Power.

G*Power

In this course we recommend the use of G*Power for power analysis. The program is created and maintained by the Heinrich Heine Universität Düsseldorf. It is freely available for both personal and commercial use.

Download

The graphical user interface offers an accessible method to quickly calculate the statistical power. Figure 7 shows the most important functions.

Figure 7. Overview of the G*Power interface. Note that depending on your operation system and system settings the window may look slightly differently.

There is an additional side panel for the calculation of the effect size that is shown in Figure 2.

Figure 8. Effect size calculation interface.

EXERCISE 5

Regard the data in Table 1 (Table 9.4 in book). Assume that $\alpha$ is 0.05. Use G*Power to demonstrate that:

The matched-pairs comparison treatment of the data yields a statistical power of more than 0.98.
For the non-matched pairs comparison a total sample size over 500 would be needed to achieve a statistical power of 0.9.

Concluding remarks #

We have learned that conducting power analysis is a quick method to assess whether our statistical test features a serious probability of a type-II mistake. We can even use it to calculate the number of experiments that we should do to achieve a certain statistical power. It is recommended to always associate a $p$ -value with a complementary $\beta$ -value.

INFORMATION REPOSITORY

Extra

MSc. Chemometrics & Statistics

MSc. Separation Science

06. Power Analysis

Learning Goals #

READ SECTION 9.3.5

1. Type-II error #

EXERCISE 1