Combining results across experiments
It is not unusual for an experiment to be done more than once, or for part of an experiment to be replicated in a follow-up study. This is beneficial because an independent replication provides further evidence for the effect in question (although it is possible to replicate technical artifacts as well). How then does one combine results from two or more experiments? If both experiments produced statistically significant results, then the outcome is clear and there is less to be gained from formally combining the results, although this is still useful to obtain more accurate parameter estimates. But what if the results of the follow-up experiment are not statistically significant? This is when worlds come crashing down, and all sorts of creative explanations are conjectured for the "conflicting results". However, the second non-significant result might actually provide more evidence in favour of the effect, and this is illustrated below.
The data from Experiment 1 is taken from Lazic (2008). Here, rats were given 0, 80, 160, or 240 mg/L of fluoxetine (Prozac) in their drinking water, and immobility time on the forced swim test (FST) was measured. The FST is a standard test to screen compounds for antidepressant activity. Antidepressants tend to decrease immobility time, which is thought to relate to "behavioural despair" or depression. The slope of the regression line in Experiment 1 is -0.25 (p = 0.020), indicating that for every one mg/L increase in fluoxetine, immobility time decreased by 0.25 seconds (95% CI = 0.04 to 0.46 seconds). The data from Experiment 2 is simulated, and it was generated to illustrate the following concepts. For this experiment, the slope was -0.13 (p = 0.102), and the conclusion often reached is that the second experiment did not replicate the results of the first, therefore either the first experiment was a false-positive, or the results are inconclusive and a third study, perhaps with more animals, needs to be conducted.
Three methods for combining results of two or more experiments are discussed: (1) combining all of the results into one big analysis—occasionally referred to as a "mega-analysis", (2) calculating summary statistics such as a mean difference between groups and their standard error, and then combining these in what is known as a meta-analysis, or (3) using a Bayesian analysis, where the results of the first experiment serve as the prior information, which is combined with the results of the second.
Combine all the data into a single analysis (mega-analysis)
Simply combining all of the data into one big analysis does not require any new concepts or methods. The only extra step is to include a variable called "Experiment" into the analysis. In the present example, "Experiment" is a categorical factor with two levels, dose is treated as a continuous variable, and so this is referred to as an analysis of covariance (ANCOVA). It should be noted that it is easy to get silly results from an ANCOVA unless you know what you are doing. Evans & Anastasio (1968), White (2003), and Engqvist (2005) provide some guidance.
The first thing to test is whether there is an experiment x dose interaction; this is testing whether the slope of the two regression lines in the above figure are significantly different from each other. If so, then this means that the effect of fluoxetine was not constant across experiments. The interaction effect was not significant in this case (F(1,36) = 1.03 p = 0.317). It is a good idea to remove a non-significant interaction effect from the model, as this greatly simplifies the interpretation of the results, and this should be an option in most statistics software. After removing the interaction, we might be interested in whether there was an "effect of experiment"; this is usually not of theoretical interest, it just means that the average value from the first experiment was different from the second. It is not unusual for the mean of one experiment to be different from another, especially if what you are measuring is in arbitrary units, such as optical density in Western blots. In this case, there was no effect of experiment (F(1,37) = 1.31 p = 0.259), so this was removed from the model as well. So now what we have done is combined all of the data together, and ignored that they were generated from two different experiments (this could be done because there was no effect of experiment and no experiment x dose interaction). The final question is the one we are interested in: what is the effect of fluoxetine? From the combined analysis we can conclude that for every one mg/L increase in fluoxetine, immobility time decreases by 0.19 seconds (p = 0.004; 95% CI = 0.06 to 0.31 seconds). You might be surprised that the p-value is smaller than in the first experiment. How can it be that combining a study with p = 0.020 with another study where p = 0.102, gives a value of p = 0.004—shouldn't the combined p-value be somewhere in between the other two? This is an example of where intuitions about p-values may lead us astray. A p-value is a function of (1) the size of the effect (the slope in this case), (2) the variability in the data, and (3) the sample size. Even though the effect was smaller in the second experiment, the sample size has doubled, leading to a smaller p-value and more precise estimates (narrower confidence intervals).
A second and more common method that can be used to combine results from multiple experiments is a meta-analysis. Here, the slope of the regression line is calculated for each study separately, along with the standard error of the estimate; these are given as standard output from a regression analysis. The slope of the combined data is a weighted average of the slopes of the original studies, with the slopes weighted according to their precision. The variability of the combined slope estimate is also calculated from the variability of the original estimates. Borenstein et al. (2009) provide a nice introduction to meta-analyses.
The figure below is a forest plot; these graphs are commonly used to present the results of a meta-analysis. This one shows the slopes for the two original studies and for the combined data, along with 95% confidence intervals. As can be seen, the combined slope lies between the values of the two original studies, but it is slightly closer to Experiment 2; this is because the estimate of the slope for the second experiment was more precise (as can be seen by the width of the error bars) and therefore the estimate of the combined data is pulled closer to Experiment 2's value.
The 95% CI for the combined data does not include the value of zero (vertical line), therefore we can conclude that there is a significant effect of fluoxetine. The numeric results for the combined data are: slope = -0.17, p = 0.006, 95% CI = -0.29 to -0.05; the interpretation is that for every one mg/L increase in fluoxetine, immobility time decreases by 0.17 seconds (95% CI = 0.05 to 0.29 seconds).
Summarising results in this way makes for a nice addition to the final chapter of a thesis or dissertation, as it quantitatively combines experimental results which were addressing the same question, obtained by the same person, often over several years.
The final method of combining data from the two experiments uses Bayesian methods, with the results of the first experiment being used as the prior. Bayesian methods are more complex, have more theory associated with them, and thus require a larger investment of time before they can be used. However in this simple example, there are relatively simple equations which can be used to update the results of the first experiment with the results of the second. These equations are implemented in the bayes.lin.reg function in the Bolstad R package. The results of the Bayesian analysis are identical to the meta-analysis (to the reported number of significant figures), and these are displayed in the figure below. Whereas the meta-analysis figure displayed only the mean and 95% CI, this figure shows the full distribution of the two individual experiments and the combined results; the information in both figures is the same however.
When can data be combined?
In general, data should be combined only when the studies are measuring the same thing. This is not always obvious and opinions will differ; for example, what if the second experiment used a different strain of mice, can we assume that the affect of fluoxetine is the same, or is there a strain-specific effect? What if the second study used a different behavioural test (it is possible to combine standardised effect sizes, for example, a percent change between control and treated groups)? Or what about using another compound with the same mechanism of action as fluoxetine? The research question may be different depending on which studies are included. Other meta-analytic methods exist which can deal with more heterogeneous data (Borenstein et al., 2009). You should also be aware that combining data can sometimes have surprising effects. Simpson's paradox is such an example, where the effect may be in one direction in each of the original studies, but then switch direction when the data are combined. Wikipedia has a good entry on Simpson's paradox.
In summary, combining data across studies using any of the methods discussed is the way to understand the results of multiple experiments. In the example shown, the usual method of "vote counting", where the number significant and non-significant studies are tallied up, will only lead to tears.
Borenstein M, Hedges LV, Higgins JPT, Rothstein HR (2009). Introduction to Meta-Analysis. Wiley: Chichester, UK. [Amazon]
Engqvist L (2005). The mistreatment of covariate interaction terms in linear model analyses of behavioural and evolutionary ecology studies. Animal Behavior 70:967–971. [PDF]
Evans SH, Anastasio EJ (1968). Misuse of analysis of covariance when treatment effect and covariate are confounded. Psychological Bulletin 69(4):225–234. [Pubmed]
Lazic SE (2008). Why we should use simpler models if the data allow this: relevance for ANOVA designs in experimental biology. BMC Physiology 8:16. [Pubmed]
White CR (2003). Allometric analysis beyond heterogeneous regression slopes: use of the Johnson-Neyman technique in comparative biology. Physiological and Chemical Zoology 76(1):135–140. [Pubmed]