Statistics for Experimental Biologists

Home

Topic index

Key books

External links

Book

Is the information below useful? Chapter 5 illustrates and discusses the problem with dichotomising continuous variables.




Don't hack up biology to fit a statistical test


It is still common to see continuous variables made into categorical variables, despite the many papers that argue against this practice (Cohen, 1983; Maxwell & Delaney, 1993; Streiner, 2002; MacCallum et al., 2002; Taylor & Yu, 2002; Irwin & McClelland, 2003; Owen & Froman, 2005; Royston et al., 2006; Chen et al., 2007; Lazic, 2008; van Walraven & Hart, 2008; Naggara et al., 2011). This often takes the form of a median-split—dividing the data into two equal sizes—or sometimes a cutoff value is used based on an arbitrary criterion, which divides the data into high and low groups. Occasionally, data are split into more groups such as high, medium, and low. Often the motivation for mangling the data is to get it into a form that is familiar, so that it can be analysed with a t-test (I know t-test, therefore must hack data into form suitable for t-test). There is a consensus regarding this practice: just don't do it.

The arguments against this practice mainly have to do with the (large) loss of statistical power and biased estimates, and are described in detail in the references below. However, this practice should be offensive to biologists; if the phenomenon you are studying is continuous (e.g. gene expression, performance on a behavioural test, body weight, etc.) you are not being true to the biological reality if you chop it up to fit into your preconceived boxes. The world is not divided into those who are obese and those who are anorexic; body mass is a continuous variable (most people are somewhere in the middle) and should be treated as such in an analysis. Labels can be given to various ranges (e.g. normal, over-weight, obese, etc.) to make discussion and communication easier, but these artificial categories should not become the reality. Standard regression methods (or perhaps nonlinear regression) can easily handle this type of data.



References


Chen H, Cohen P, Chen S (2007). Biased odds ratios from dichotomization of age. Statistics in Medicine 26:3487–3497. [Pubmed]

Cohen J (1983) The cost of dichotomization. Applied Psychological Measurement 7(3):249–253. [Link]

Irwin JR, McClelland GH (2003). Negative consequences of dichotomizing continuous predictor variables. Journal of Marketing Research 40:366–371. [Link]

Lazic SE (2008). Why we should use simpler models if the data allow this: relevance for ANOVA designs in experimental biology. BMC Physiology 8:16. [Pubmed]

MacCallum RC, Zhang S, Preacher KJ, Rucker DD (2002). On the practice of dichotomization of quantitative variables. Psychological Methods 7:19–40. [Pubmed]

Maxwell SE, Delaney HD (1993). Bivariate median splits and spurious statistical significance. Quantitative Methods in Psychology 113(1):181–190. [Link]

Naggara O, Raymond J, Guilbert F, Roy D, Weill A, Altman DG (2011). Analysis by categorizing or dichotomizing continuous variables is inadvisable: an example from the natural history of unruptured aneurysms. American Journal of Neuroradiology 32:437–440. [Pubmed]

Owen SV, Froman RD (2005). Why carve up your continuous data? Res Nurs Health 28(6):496–503. [Pubmed]

Royston P, Altman DG, Sauerbrei W (2006). Dichotomizing continuous predictors in multiple regression: a bad idea. Statistics in Medicine 25:127–141. [Pubmed]

Streiner DL (2002). Breaking up is hard to do: the heartbreak of dichotomizing continuous data. Canadian Journal of Psychiatry 47:262–266. [Pubmed]

Taylor JM, Yu M (2002). Bias and efficiency loss due to categorizing an explanatory variable. Journal of Multivariate Analysis 83:248–263. [Link]

van Walraven C, Hart RG (2008). Leave 'em alone - why continuous variables should be analyzed as such. Neuroepidemiology 30:138–139. [Pubmed]