TheBestStatistics.Info

Introduction

Power calculations are used to find the probability that your experiment will detect an effect (or difference) in the world, if it exists. For example, an experiment might test behavioral differences between men and women on a measure of aggression. If there is an effect of gender on the measure of aggression (e.g. men are more aggressive than women), then there is a certain probability that the experiment will reveal this difference. Of course, it is possible that the experiment might not show a difference even though it exists because of the variability within the groups (or the error). In other words, when we take the samples of men and women, there is a possibility that the samples might contain unusually un-aggressive men or unusually aggressive women. Effects of variability are best understood by considering how the mean distributions of the two populations (i.e. "men" and "women" in this case) overlap.

The best way to understand power is to examine how the populations' distributions overlap, and what the overlap implies. Consider the differences between men and women on a measure of aggression in the example above. Suppose the average score for a population of men is 102 (top graph, dist 2 in red) and the average score for a population of women is 100 (top graph, dist 1 in green).

Display of two population score and mean distributions

Note that the score distributions (upper display) overlap a lot more than the mean distributions (middle display). This is always the case because the mean distributions are skinnier by the square root of sample size (according to the Central Limit Theorem). The bottom graph is the same as the middle graph, but adjusted to show the overlap in terms of standard errors. The bottom graph is crucial for understanding what we mean by statistical power. Power is defined to be the probability of selecting a mean from Mean Distribution 2 and having that mean be above the critical value for distinguishing it from Mean Distribution 1. In other words, if we randomly pick a mean from Dist. 2, how likely is it that we will reject the null hypothesis of Distribution 1? In geometric terms, the power is the area in red which represents all possible means from Dist. 2 that are above the critical value of Dist. 1.

Importantly, the power calculations on this website assume that we know the population standard deviation. This allows us to use the Z distribution and greatly simplifies the power calculation. Without this assumption, we would most likely be dealing with t-distributions. This complicates the calculation of power because each t-distribution has a distinct shape depending on the degrees of freedom. With real data, this assumption often cannot be made. However, the simplification of the calculation is worthwhile in the current context because we are primarily concerned with getting familiar with the basic ideas behind power analysis, not in developing a practical skill. Another simplification we make is that the power calculations on this website are limited to the case where one mean is compared against another. For understanding power calculations for 3 or more groups (e.g. ANOVA) or two-way ANOVAs, the reader should refer to more advanced statistical texts.

Estimating the power involves assumptions about the reality of what the experiment will try to measure. In real experiments, this information is not known but might be guessed at by examining other similar experiments, or by reasoning about the sources of variability. To some students, power analysis seems strange because it relies on assumed facts, such as the average scores of groups, and these facts are exactly what the experiment was intended to measure. Power analysis is useful though, because in many cases an experimenter can make very good guesses about how an experiment will probably turn out if their hypothesis is true. In this case, it is important to design the experiment so that it has a chance of detecting the effect, and also to not waste time making an experiment much more sensitive than it has to be (if, for example, the size of the effect is very large). It is also useful to understand power because it ties together many other important concepts such as mean distributions, critical values, directionality, effect size and distribution variability.

Factors Affecting Power

As mentioned in the Hypothesis Testing section of this site, power is affected by many factors, including:

Effect size. This is the distance between the two populations. For example, if women score 10 points higher than men on a standardized test, then the effect size is 10 points.
Standard deviation of scores in the population. This is the width of the score distribution of the two populations. For ease of calculation, the power analyses presented here assume that both populations have the same standard deviation.
Number of tails of hypothesis test. This is determined by the previous evidence. If there is no previous evidence or if the previous evidence does not suggest an outcome in a particular direction, then the hypothesis test would use two tails. If the previous evidence suggested an outcome in a particular direction, then a one-tailed hypothesis would be appropriate.
Alpha level. This is the type I error rate that is set by experimenter. By convention, alpha is usually set to .05.
Sample size. This is the size of the sample used in the hypothesis test.

Before we examine how each of these factors affect power, it is important to note that the experimenter has limited control over most of these factors. Consider the table below which orders the factors in terms of how much the experimenter can control the factor.

Factor	How much control does the experimenter have?
Sample Size	This is the factor the experimenter has the most control over. Limited only by resources, the experimenter's easiest way to increase power is to increase the sample size.
Standard Deviation of Population Scores	The experimenter may have some control over this factor by measuring variables in a way that accurately mirrors the psychological process. Usually, more measurements per participant yield more accurate assessments of the underlying psychological process, but resource limitations may prevent experimenters from creating the variable with the most accurate measurement.
Effect Size	The experimenter has no control the size of the differences between the populations.
Alpha Level	Convention dictates that .05 is a standard alpha level. An experimenter can be more conservative and set the alpha level to .01, but this is rare.
Number of Tails	Almost all hypothesis tests published in journals are two-tailed, even when a one tailed hypothesis might be appropriate (because many researchers are suspicious of on-tailed hypothesis tests)

Experimenter Control over Factors Affecting Power

To understanding the factors that affect power, we need to remember that power is probability of correctly rejecting the null hypothesis. This is best understood in visual terms by realizing that the power of an experiment is the probability of that a mean from a population mean distribution will be above the critical value of another mean distribution. . In other words, the area below in red is the power of the experiment where a sample mean is taken from the red mean distribution and compared to the green mean distribution -- this is because sometimes the sample mean will be above the critical value (the area in red) and sometimes the sample mean will be below the critical value.

Overlap of Two Mean Distributions

In the display above, we can see the power is a little more than 0.5, because a little more than 1/2 the time, a sample mean from the red distribution will be above the critical value of the green mean distribution.

Three of the five power factors involve how the mean distributions overlap. When mean distributions overlap very little, the power is likely to be high because most of the second distribution will be above the first distribution.

The effect size. This is the distance between the means of the two populations. The central limit theorem tells us that the mean distributions have the same mean as the score distributions, so the effect size will also be the distance between the means of the mean distributions,
The standard deviation of the score distributions. From the central limit theorem, we know that the width of the mean distributions (the standard error) is determined by the standard deviation of the scores and the sample size, so any change in the standard deviation of the scores will cause the standard error to change as well.
The sample size. From the central limit theorem, we know that the width of the mean distributions (the standard error) is determined by the standard deviation of the scores and the sample size, so any change in the sample size will cause the standard error to change as well.

After we have determined how the mean distributions overlap, we need to know where the critical value is on the first distribution. Determining the critical value is straightforward -- we simply look it up in the z table using the alpha level and the number of tails.

Before we examine how each of these factors affect power, let's summarize the factors in the two groupings that help us remember the factors.

Factor Group	Power Factors
Mean Distribution Overlap	From the central limit theorem, we know that the width of the mean distributions (the standard error) is determined by the standard deviation of the scores and the sample size, so any changes to either of these factors will affect the standard error. But the overlap is affected not only by the standard error, but also by the distance between the mean distributions -- which is the effect size.
Critical Value	The alpha level and the number of tails determine in the critical value using the z table lookup procedure.

Grouping of Factors Affecting Power

What happens if the sample size increases in an experiment? To see how this would affect power, see the display below which will be used for all of the examples of how factors affect power.

Power Display: Sample Size Increase

The top panel shows the distribution of scores for two different populations. Remember that a power calculation requires that there are two different populations because the power calculation assumes that we are sampling a mean from one distribution and determining the probability that the mean will be above the critical value of the other population mean distribution.

The second panel shows the mean distribution resulting from samples of size 16. Notice how the mean distribution is 4 times skinner because the Central Limit Theorem tells us that the mean distribution is √

= 4 times skinnier than the score distribution.

The bottom panel shows us what happens when we use a large sample size. In this case, the sample size has increased from 16 to 64 (e.g. by a factor of 4). From the Central Limit Theorem, we know that a four-fold increase in sample size will produce mean distributions that are twice as skinny. As you can see, the mean distributions remain at the same absolute place, but because they are now twice as skinny, they overlap a lot less. Because the mean distributions overlap much less, the probability of taking a mean from the red distribution that is above the critical value of the green distribution (i.e. the red area) rises dramatically.

What if the sample size decreases?

Power Display: Sample Size Decrease

In this case, the smaller sample size (changing from 16 to 4) produces much wider mean distributions that now overlap a lot more. The result is a much smaller area of the red distribution that is above the critical value of the green distribution -- and this red area is the definition of power, so power decreases.

What is the effect size increases?

Power Display: Effect Size Increase

If the effect size increases, then the distance between the score distributions increase, which will cause an increase in the distance between the mean distributions. If the red distribution is now further away from the green distribution, then more of it will be above the critical value, and the power will increase,

What if the effect size decreases?

Power Display: Effect Size Decrease

Of course, if the effect size decreases, then the distance between the red and green distributions of scores will decrease -- which will results in the red and green mean distributions being closer together. With the red distribution closer to the green, less of it will be above the critical value and the power will decrease.

What if the standard deviation decreases?

Power Display: Standard Deviation Decrease

If the standard deviation decreases, then the standard error will also decrease because of the Central Limit Theorem. As a result, the mean distributions will overlap less and the power will increase as more of the red distribution is above the critical value of the green distribution.

What if the standard deviation increases?

Power Display: Standard Deviation Increase

If the standard deviation increases, then the standard error will also increase because of the Central Limit Theorem. As a result, the mean distributions will overlap more and the power will decrease as less of the red distribution is above the critical value of the green distribution.

What if the alpha level decreases from the standard .05 to .01?

Power Display: Alpha Level Decrease

With a lower alpha, the critical value will increase. This means less of the red distribution above the critical value of the green distribution which means -- you guessed it, a decrease in power.

What if we run a one-tailed test instead of a two-tailed test?

Power Display: Change Two-tailed Test to One-Tailed test

As you can see, a one-tailed test lowers the critical value which places more of the red distribution above the critical value, resulting in a higher power. Of course, this assumes that the one tailed hypothesis is in the right direction -- if we were creating a one-tailed test in the direction opposite to reality (e.g. expecting a higher mean for the red distribution when in reality the mean was lower), then our power would be 0.

Finally, if we want to get fancy, we can look at two changes at the same time. Here is a display where the effect size becomes twice as small and the sample size becomes 4 times as large...

Power Display: Effect Size Smaller by 50% and N changes to 4N

As you can see, these changes offset each other and the power remains the same.

Now we can summarize our factors with their influence on power

Factor	Effect on power
Sample Size	Increase in sample size will increase power
Effect Size	Increase in effect size will increase power
Standard Deviation	Decrease in standard deviation will increase power
Alpha	Decrease in alpha will decrease power
Tails	Changing from two-tailed to one-tailed will increase power (unless directionality is wrong)

Summary of Power Factors and Effect on Power

Calculating Power

Calculating power is not difficult if we have a good understanding of what power is -- the area of one distribution that is above the critical value of another distribution. Ultimately, calculating power becomes a normal distribution area problem. The difficulty is in determining which area to look up in the z table. As mentioned before, this website only does power calculations for a single sample Z test because this is the simplest kind of power calculation.

Consider the display below that shows the mean distributions and their overlap with a critical value.

Power Calculation Display: Power > 0.5

The key to power calculation is to calculate the red area by determining where the critical value of the green distribution is located on the second red distribution. In the display above, the critical value is below the mean of the red distribution, so the critical value would have a negative z value on the second red distribution. Because the critical value has a negative Z value, then the area of the red distribution above this negative z value would be greater than 0.5. In fact, the total area of the red distribution of the green would be the area in pink plus 0.5 (since the area in dark red is the entire area above the mean of the red distribution which is 0.5).

Of course, sometimes the critical value will be above the mean of the red dsitribution...

Power Calculation Display: Power < 0.5

In this case, because the critical value is above the mean, then the critical value of the green distribution will be a positive Z value. Therefore, the power (the red area) will be less than 0.5.

We've seen the two cases in which power is either above or below 0.5, but we still need to figure out how to calculate the exact z values so we can use the area tables to calculate the power. In order to to calculate power, we will need...

μ₁ = The mean of the first population (the green distributions on the displays)
μ₂ = The mean of the second population (the red distributions on the displays)
σ₁ = The standard deviation of the populations (assumed to be the same for both populations for simpification)
N = the sample size
α = the alpha level of the hypothesis test
The number of tails of the hypothesis test

If we don't have all of these six inputs, then we can't complete the power calculation.

Let's work through a sample problem.

Consider the following information for a power calculation (normal score distribution with μ₁ known and σ of population 1 given):
μ₁=100,μ₂=104,σ_X = 10,N=10,Tails=2,α=.05. Calculate the power of this experiment.

Step 1: Calculate the standard error. Power calculations are determined by the way in which mean distributions overlap, so we need to calculate the standard error given that we have the standard deviation and the sample size. Central Limit Theorem says..

σ_X=σ_X/√N
σ_X=10/√10=3.1623
Step 2: In order to see how much the distributions overlap, we need to determine how far the mean distributions are in terms of the standard error. This effect size in standard errors (Effect-Size_{σ_X}) is key to understanding power because it will help us determine where the critical value of the green distribution is located on the red mean distribution.

Effect-Size_{σ_X}= (μ₂ - μ₁) / σ_X
Effect-Size_{σ_X}= (104 - 100)/3.1623=1.26

On this website, the effect size in standard deviations is rounded to 2 decimal places in this step. At first, this may seem like too much rounding, but given that we will be using z values that are also rounded two 2 digits, this makes good sense.
Step 3:Calculate critical value of Distribution 1. We need to know how far the critical value of the green distribution is above the mean of the green distribution. This is easy -- we just use our given alpha and number of tails..

for tails=2,α=.05. Critical value = 1.96
Step 4:Calculate Z_{Dist 2} of Critical-Value_{Dist 1}. In this step, we determine what the z value on the red distribution will be for the critical value of the green distribution. This one is hard because there is so many terms in one sentence. Once we know where the critical value is on the red distribution, we can use the z table to look up the area. But this is simply a subtraction because we already know the distance between the mean and the critical value (e.g. the critical value by definition) so we can subtract the effect size in standard errors to find the z value on the red distribution of the critical value of the green distribution

Z_{Dist 2}=Critical-Value_{Dist 1} - Effect-Size_{σ_X}
Z_{Dist 2}= 1.96 - 1.26 = 0.7

What we've found here is that the critical value of the green distribution has a z value of 0.7 on the second distribution. The rest is easy....
Step 5: Use the z table to look up the area. Since the z value is > 0, then we casn use the 'area beyond' value of the z table to find the area beyond z=0.7. When we do that, we find the area to be .242. And that's it.

Here's a display of the mean distributions in the previous problem.

Mean distributions for sample problem 1

Here's a problem where the power works out to be more than 0.5.

Consider the following information for a power calculation (normal score distribution with μ₁ known and σ of population 1 given):
μ₁=100,μ₂=104,σ_X = 10,N=100,Tails=2,α=.01. Calculate the power of this experiment.

Answer:0.9222

Step 1:Calculate standard error. σ_X=σ_X/√N

σ_X=σ_X/√N=10/√100=1

Step 2:Calculate Effect Size in standard errors. Effect-Size_{σ_X}= (μ₂ - μ₁) / σ_X

Effect-Size_{σ_X}= (μ₂ - μ₁) / σ_X= (104 - 100)/1=4

Step 3:Calculate critical value of Distribution 1. Critical-Value_{Dist 1} = Look up in Z table

for tails=2,α=.01. Critical value = 2.58

Step 4:Calculate Z_{Dist 2} of Critical-Value_{Dist 1}. Z_{Dist 2}=Critical-Value_{Dist 1} - Effect-Size_{σ_X}

Z_{Dist 2}=Critical-Value_{Dist 1} - Effect-Size_{σ_X} = 2.58 - 4 = -1.42

Step 5:Calculate power. Examine mean distributions and lookup area

Power > 0.5, so to find Area from 0 to Z _-1.42=0.4222 + 0.5 (1/2 the distribution) = 0.9222

Mean Distributions for Sample Problem 2

As you can see, because the effect size is greater than the critical value, step 4 calculates a negative z value for the critical value, which means the pink area is added to 0.5 to determine the power.

Easy Questions

1. Why are power calculations easier when we know the standard deviation of the population?

2. If two mean distributions overlap very little, then what can we say about the power of an experiment?

3. Power calculations tells us ...

4. Which power factors affect the overlap of the mean distributions?

5. Which power factors affect the critical value?

6. Which power factor is the easiest to control?

7. Name a power factor which the experimenter has not control over?

8. How can an experimenter have an impact on the standard deviation of scores in a population?

9. What are five factors that affect the Type II error rate?

10. The power of an experiment is 0.88 (known population standard deviation testing a single sample mean against a population mean). If everything in the experiment remained the same except that the sample size was 9 times as large and the standard deviation of the scores was 3 times as large, what would the experiment's power be?

11. The power of an experiment is 0.73 (known population standard deviation testing a single sample mean against a population mean). If everything in the experiment remained the same, but the sample size was 4 times as small and the standard deviation of the scores was 2 times as small, what would the experiment's power be?

12. Consider the following information for a power calculation (normal score distribution with μ₁ known and σ of population 1 given):
μ₁=100,μ₂=110,σ_X = 10,N=100,Tails=2,α=.05. What is the Effect Size in standard errors in this experiment?

Medium Questions

13. Consider the following information for a power calculation (normal score distribution with μ₁ known and σ of population 1 given):
μ₁=100,μ₂=105,σ_X = 36,N=36,Tails=2,α=.05. Calculate the power of this experiment.

14. Consider the following information for a power calculation (normal score distribution with μ₁ known and σ of population 1 given):
μ₁=100,μ₂=105,σ_X = 36,N=400,Tails=1,α=.01. Calculate the power of this experiment.

Hard Questions

15. The power of an experiment is 0.5. If the effect size is 0.98 standard deviations (alpha = .05, two tails, known population standard deviation testing a single sample mean against a population mean), then how many subjects where used in the study?