[First published March 15, 2006] Everyone who does a lot of reading of reports, studies, and the like is bound to run across “p < .05” or some other fraction, such as “p <.0005”. Or, instead, they will read that, “the results are significant,” or “not significant.” What is going on here? I will try to explain this, nontechnically, without all the details beloved of the statistician (no Type I and Type II error, no two-sided test versus one-sided, no normal distribution, no equations, etc.), and even oversimplify for the purpose of clarity.
PROBABILITY, NULL HYPOTHESIS, AND SIGNIFICANCE
To begin, p stands for “probability.” Thus, p < .05 means that the probability is less than .05. For example, if there are 100 balls in a basket and 4 of them are red, the probability of blindly selecting a red ball is p < .05, or less than a chance of getting it once in over 20 tries, or 5 times in over 100 tries . But, this understanding by itself can be misleading if a sample of some sort was analyzed.
For example, assume that a randomly selected sample of some sort has been analyzed, as of 100 American college students in order to determine for the population (universe) of all American college students the correlation between getting drunk at least once a month and grades. Let us say the correlation between such drunkenness and grades is .17, p < .05. How to interpret this? Not as a straight forward probability of getting .17.
Rather, the idea is that one has implicitly tested what is called the null hypothesis that the true correlation for all American students is r = 0, which means that hypothetically there is no correlation between getting drunk at least once a month and grades. The p <.05 then means that if one rejects the null hypothesis and accepts that .17 is true for the population of students, the probability of this choice being in error is 1 out of more than 20. Although in research on samples, the null hypothesis is usually not stated, it is there nonetheless (some classes in statistics require students to always state the null hypothesis). Regardless of whether the statistic being applied is a t-test, F-ratio, chi-square, or some other, the implicit assumption usually is that for the population the sample represents, the true statistic is zero. Then, the p indicates the chance of error if this hypothetical value for the population is rejected in favor of accepting the one actually found for the sample.
As another example, in a regression analysis on a sample, the resulting regression coefficients may be given with associated t-tests and p-values. Assume, for example, a regression coefficient is 3.4 with a t-test of 2.0 and p < .03. The assumed null hypothesis is that the regression coefficient for the universe the sample represents really is 0, and if this is rejected in favor of the finding that it is 3.4 for the population, the chance of error in doing this is less than .03. That is, if this study was replicated over 100 times, it is likely that in 3 of them the regression coefficient would be 0.
I have made the null hypotheses equal to 0, which is generally the case. But, it can equal any number. Regardless, the question is still answered by p as to how probably the research will be in error if it rejects the number given for the population in the null hypothesis in favor of the number found by the research.
When a null hypothesis is rejected with little chance of error, the result is called significant. The acceptable probability of error — significance — among scientists is a matter of tradition, which is that if the chance of error is p equal or less than .05, the result is significant. This is a convention, however, and a researcher may be conservative about error and define significance in his research as p < .01. Or, if the researcher believes there is much random error in his data, he may than raise the significance level to something like p < .1. In other words, when a study says its correlation is significant, it is saying in effect that its correlation is such that there is little chance of error in rejecting the possibility that it is zero (or some other number). If a study says it has conducted a significance test, it is saying that it calculated the p-value; and if it says the result was nonsignificant, it means that the chance of error in rejecting the null hypothesis was too great. But, without knowing the p values, there is no way of knowing what chance of error the researcher found acceptable or unacceptable.
The danger in significance tests is that the p-value is completely dependent on the sample size (N). See the change in significance (p-value) for the very low correlation of .15 at different sample sizes:
N = 10, p = .34
N = 50, p = .15
N = 100, p = .07
N = 500, p = .0004
All one needs to do, it seems, is to increase the sample size to get very significant results, although totally meaningless ones. What does this mean? To understand what the correlation coefficient means for the relationship between two variables, for example, square it and multiple by 100. The result will be the percent of variance (variation) in common, or shared between the two variables. So, if one does this for r = .50 (to make this easy), the result is 25%. To say that two variables have 25% of their variation in common is a lot more meaningful then saying that their correlation is .5. Thus, an r = .80 means 64% of the variation is in common; r = .90 means 81% in common, and so on. This is a way of getting at the true empirical meaning of a correlation, and one that is not dependent on sample size and the significance test.
And this displays a major problem with significance tests. Consider the correlation of .01 for a sample of 500 people, and a significant p = .041. By convention, this itty-bitty correlation is significant, and the unwary researcher might so report it. But, it is meaningless. For the variation in common between the two variables is an incredibly low .01%, or virtually a zero relationship. And yet, it is significant! Always consider the variance in common along with the significance test.
Another problem is that the null hypothesis and its significance test assume that the analysis is carried out on a sample selected in some appropriate way to reflect a population. But, the analysis may be of all nations, all American senators, all students at Yale, and so on. There is no sample. One might say, however, as some researchers have tried to do, that this is a sample of all nations, senators, or Yale students that have existed, or will exist. But, then the problem is that the sample is in no way a randomly selected representation of this population, which violates an assumption of the significance test.
So, the usual significance tests are inappropriate when analyzing a whole population. If, for example, r = .60 for the relationship between development and literacy for all 192 nations in 2005, then there can be no null hypothesis, since this correlation is truly .60 for all nations. Yet, as some readers may have noticed, I have p-values scattered throughout my research even though I am analyzing all nations.
There is another way to look at probability, then for samples. One can, for example, calculate the probability of tossing a coin and getting five heads in a row; of not getting a seven in ten tosses of the dice; and of none of the 122 democracies among 192 nations having had any of the conflicts in that year among themselves. Such probabilities can also be given as p-values. The p of getting heads in a row is the probability of getting one head in one toss ( = .5) to the third power, which is p = .125, or p < .50. This would then be significant.
So, in my post yesterday, my analysis of variance of the relationship between the terrorism/human rights scale and freedom was an F-statistic of 81.6, p < .0001.This is saying that for all the data on the two variables, the chance that they would up for all nations such that one would get the F-statistic is <.00001 — almost a 0 probability, and thus very significant. There must be something causing such a near impossible pairing, and I say that it is the democratic nature of a regime.
Thus, in the case of analyzing the whole population, instead of its representative sample, the p now is the chance of getting any statistic, such as a correlation, multiple correlation, regression coefficient, chi-square, and so on, just by chance.
“Significance level” is a misleading term that many researchers do not fully understand. This article may help you understand the concept of statistical significance and the meaning of the numbers produced by The Survey System.
What does “statistical significance” really mean?