What exactly is a p-value?
P-values have been criticised because they are widely misunderstood and don't tell scientists what they want to know. Goodman (2008) has written a nice article on the misinterpretation of p-values, and here are a few examples:
- P = 0.05 does not mean there is only a 5% chance that the null hypothesis is true.
- P = 0.05 does not mean there is a 5% chance of a Type I error (i.e. false positive).
- P = 0.05 does not mean there is a 95% chance that the results would replicate if the study were repeated.
- P > 0.05 does not mean there is no difference between groups.
- P < 0.05 does not mean you have proved your experimental hypothesis.
A p-value means only one thing (although it can be phrased in a few different ways), it is: The probability of getting the results you did (or more extreme results) given that the null hypothesis is true.
Let's look at this definition a bit closer. The null hypothesis is the hypothesis of no effect, no correlation, no association, etc., whatever the case may be. This qualitative statement needs to be converted into something more mathematical, for example, one might say that the difference between the mean of a drug group and control group is zero. This is an improvement because we now have a numeric value that we can work with (i.e. zero), but what we actually need is a distribution of possible values that one might expect to get if the drug actually has no effect. This is referred to as the null distribution, and the key thing to remember is that the null distribution is the distribution of outcomes from an experiment when there is no effect. How do you actually come up with a null distribution? This is already "built in" to the statistical test based on theory, and does not need to be specified directly. But let's make a null distribution from scratch to clarify what it is and how it can be used to make inferences. Suppose we have a fair coin (meaning that it has an equal probability of coming up heads or tails) and another coin that we are not sure about (maybe it was given to us by someone of dubious character). We know that if a fair coin is tossed 20 times, we would expect to get 10 heads. Of course we wouldn't expect to get 10 heads all the time, sometimes we would get 9, sometimes 12, etc. To get an idea of what the distribution of outcomes would look like, we could toss a fair coin 20 times and count the number of heads, and then repeat this 10000 times. An easier option is to simulate 20 tosses of a coin 10000 times, and this shown in the figure below.
It can be seen that 10 heads out of 20 tosses was the most frequent outcome, occurring on 1753 out of the 10000 trials. This is the null distribution, and we can compare the results of tossing the unknown coin against it. Now suppose we toss the unknown coin and observe that it lands heads on 16 out of 20 tosses. We expect 10 heads but observe 16, what do we make of this value? Is it very large, or unusual? This is where the null distribution helps; the fair coin landed with 16 heads on only 50 out of the 10000 trials, so the probability of 16 heads can be calculated as 50/10000 = 0.005. However, recall that in the above definition of a p-value, it was the probability of getting the observed results (16 heads) or more extreme results (17, 18, 19, or 20 heads). In our null distribution, 61 out of 10000 trials ended up with 16 or more heads, for a probability of p = 0.0061. This is the traditional p-value, and it tell us that if the unknown coin were fair, then one would expect to obtain 16 or more heads only 0.61% of the time. This can mean one of two things: (1) that an unlikely event occurred (a fair coin landing heads 16 times), or (2) that it is not a fair coin. What we don't know and what the p-value does not tell us is which of these two options is correct! Of course as scientists we conclude that such an unlikely event suggests that the coin is not fair (i.e. we reject the null hypothesis of "the coin is fair"), but this is not logically entailed by the p-value, and this is where the misinterpretation comes from. What scientists usually want to know (and often think that the p-value gives them) is the probability that the coin is biased, given that 16 heads were observed.
It should be noted that in practise, it is not necessary to simulate null distributions for standard problems, as there are "ready-made" distributions based on statistical theory. In this example we can calculate the probability of 16 or more heads from 20 tosses when each toss has a 50% chance of landing heads by using a binomial distribution, and the result is p = 0.0059, which is very close to p = 0.0061 from the simulated distribution.
Goodman SN, Royall R (1988). Evidence and scientific research. American Journal of Public Health 78(12):1568–1574. [Pubmed]
Goodman S (2008). A Dirty Dozen: twelve p-value misconceptions. Semin Hematol 45(3):135–140. [Pubmed]