The Power of the Chi-Square "Goodness of Fit" Test
This page will remind you why it is a terrible idea to accept the null hypothesis upon failing to find a significant result in a one-dimensional chi-square test. Suppose a researcher frames the hypothesis that four variations on a certain genetic characteristic, A, B, C, and D, exist within the muglout species in the proportions
 A = .20   B = .30   C = .30   D = .20
To test his hypothesis, he examines n=60 randomly selected muglouts and sorts each according to whether it displays variation A, B, C, or D. He then plugs his data into a chi-square "goodness of fit" test, with expected cell frequencies of 12, 18, 18, and 12 for A, B, C, and D, respectively. Finding a non-significant value of chi-square, he joyfully exclaims: "Aha! The observed pattern of frequencies does not significantly differ from the pattern stipulated by my hypothesis. It's a fit. My hypothesis is supported by the data. Tenure Heaven, here I come!"

The logical fallacy in this scenario is contained in the phrase "Therefore—it's a fit!" The non-significant test result in this case would allow the investigator to conclude that this particular sample provides no evidence contradicting his hypothesis. But the absence of evidence contradicting an hypothesis is not at all the same thing as positive evidence in support of an hypothesis. The point of this not too far-fetched scenario is that chi-square is a test of rather low power; its ability to reject the null hypothesis, even when the null hypothesis is patently false, is quite weak. And the smaller the size of the sample, the weaker it is.

In both of the following simulations, pseudo-random numbers are drawn and shaped in such a way as to ensure that the actual proportions within the imaginary muglout population are not

 A = .20   B = .30   C = .30   D = .20 but rather    A = .25   B = .25   C = .25   D = .25
clearly a substantial difference.

In the first simulation, random samples of size n are drawn from the population one sample at a time. With df=3, the critical value of chi-square for significance at or beyond the 0.05 level is 7.815; hence, any calculated value of chi-square equal to or greater than 7.815 is recorded as "significant," while any value smaller than that is noted as "non-significant." The default value of n is set at 60 to correspond to the scenario described above. To simulate this scenario, click the "Run Simulation 1" button and note the results; then do the same thing 15 or 20 times over. And recall as you are doing all this that here is a situation where the null hypothesis is patently false. The greater the power of the test, the greater will be the percentage of results that turn up as "significant"; the lower the power, the more often you will end up with "non-significant." For a sample size as small as 60, you will find "non-significant" turning up distressingly often. Plug in a smaller sample size, and it will turn up even more often. With a larger sample size, "non-significant" will turn up less often, though even with samples as large as n=100 it comes up quite a lot more often than real-life researchers would ever want to contemplate.

The second simulation does the same thing, except that it draws random samples 100 at a time. Here again, the default value for sample size is set at n=60. You can change it to another value if you wish, but be advised that the underlying calculations for larger values of n might take a while, depending on the inherent speed of your computer.

Simulation 1T
 A B C D n Observed Expected (20%) (30%) (30%) (20%)
 Enter sample size:n =

Simulation 2T
Click the "Run Simulation 2" button to draw 100 random samples of size n. Enter a different value of n if you wish, but plan to be patient if n is large and your computer is slow.

 n =

 Home Click this link only if you did not arrive here via the VassarStats main page.