Statistics for Experimental Biologists


Topic index

Key books

External links

Using computers to make life easier

Computers are a standard tool for experimental biologists and are used for writing documents (manuscripts, dissertations, etc.), managing references, storing and analysing data, performing routine calculations, creating figures, etc. Used well, computers can save a tremendous amount of time and effort, but many biologists are not aware of a number of useful tools that can increase productivity, reduce error, and make results more reproducible. After spending the morning repeatedly pipetting, the only thing worse than spending the afternoon repeatedly pipetting is to spend it repeatedly copying-and-pasting. This article isn't directly related to statistics, but time freed up from mundane tasks means more time for learning about statistics! To make it on this list, the software has to be (1) free, (2) not widely used by biologists, and (3) pass the "I wish I had known about this earlier" test. It is a subjective list, and other software may exist that performs some of the same tasks equally well. As with many things in life, you sometimes have to take a step backwards in order to take many steps forward. If you are retiring next year, it may not be worth learning how to write a shell script. However, if you have many years of research ahead of you, then investing time in learning how to use your computer well will provide many returns.

Statistics: R

Most experiments require some type of data visualisation and statistical analysis, and the reasons for using R have been enumerated in a previous article. Visualising and interacting with data (and high dimensional data in particular) can be done with GGobi. The authors of GGobi have also written a nice book entitled "Interactive and Dynamic Graphics for Data Analysis: With Examples Using R and GGobi" [Amazon], which explains how to use the software and the importance of understanding your data by looking at it.

Document creation: LaTeX

LaTeX (pronounced more like LAY-tech) is used to create documents, such as a thesis, report, manuscript, etc. You can use it for many of the things that you currently use MS Word for. LaTeX separates the content of a document from the layout and formatting. You worry about the content, and let LaTeX worry about formatting. The result is a professional looking document and less time wasted specifying fonts, making sure figure captions are on the same page as the figures, or making sure a heading is not at the bottom of one page while the first paragraph is at top of the next. Life is too short to fiddle with these things. A table of contents, bibliography, list of figures and tables, index, etc. can all be created automatically. It makes creating big documents much easier, and is the standard for computer science dissertations (as well as other quantitative fields such as mathematics, statistics, physics, and economics). Many journals also accept manuscripts in LaTeX format and even provide templates. Most of the basic functionality can be found in the LaTeX Wikibook. You can also make presentations (using the beamer package) and posters in LaTeX. Google "why use latex" to find out more about the benefits.

Reference management: JabRef

A good reference manager should allow you to (1) automatically download references from the web (e.g. from PubMed), (2) easily insert references into your document, (3) add annotations, (4) structure references by grouping based on topic or author, and (5) perform standard tasks such as searching and sorting. JabRef has all of these features and integrates nicely with LaTeX. The commercially available Reference Manager does all of this as well, but it will cost you, and if the next lab you move to doesn't have this product, you're out of luck.

Image processing: GIMP

The GNU Image Manipulation Program (GIMP) is essentially a free version of Adobe PhotoShop. If you already have PhotoShop, there is probably little point in switching to GIMP.

Making diagrams: Inkscape

Adobe Illustrator is a standard program to draw figures, and a free alternative is Inkscape. In addition, Dia can be used for simple figures and flow charts.

Extracting data from published papers: g3Data

You may want to extract data from published papers in order to double check the results, perform an alternative analysis, get estimates for a power analysis/sample size calculation, or to combine data across multiple studies. g3data can be used for accurate data extraction, or the R package digitize.

Sharing data and knowledge: HTML

Hypertext Markup Language (HTML) is the basic language that webpages are written in. You may want to create a website for your research group, or for a personal but research-based website or blog, where a copy of recent publications and information on current research can be found. The Internet is the first place someone will go to learn more about you and your research, and having a professional-looking website will allow others to find out about you. Alternatively, it may be useful to create an internal site for lab members only, where protocols and other useful information is kept. In addition, it may be useful to create a report in HTML format, where genes, SNPs, or proteins are linked to their appropriate entries in Ensembl, HapMap, or PDB. You can do most of this without knowing HTML, but knowing how to do it from scratch means you have complete control over all aspects of your site or report. It is also not very difficult to learn.

Automating tasks: Perl/Python/Shell scripts

Scripts, or macros, are a set of commands that tell the computer to perform a task, and thus provide an automated way to repeat routine tasks. For example, suppose you get the results from an assay in tab-delimited text format, and you want to do some analyses. But first you have to remove some header information (the first few lines in the file that are not data), remove a couple of columns that are irrelevant but take up a lot of space, replace spaces in the column names with underscores (maybe because your statistics software doesn't accept variable names with spaces), covert any values of 999 (which is an error code meaning "no data") to "NA", make all the text in the first column uppercase, and finally, save this modified file with a new name. This could be done in Excel, but a short script can be written in a few minutes and can be used many times, both by yourself and others. Automation saves time, and reduces the chance for human error. But just as important, the script is just like a laboratory protocol in that it specifies what was done, and thus documents your research. If this was done in Excel, there wouldn't be any record of how you got from the original file to the modified one. Try remembering what you did a year later when you are writing up your results! There are three principles at work here: (1) if it's worth doing more than once, then it's worth automating, (2) whatever you do to your data should be documented and reproducible, (3) monotonous copying-and-pasting or formatting data is not appropriate work for someone with a science degree, especially an advanced degree. Perl and Python are two common programming languages that can be used for editing and formatting data, as well as a whole lot more. Simple text editing can also be done with a shell script, which is a set of commands (usually saved to a file) which runs on the command line. There are many online tutorials and books to help you get started. If you do not want to learn one of these languages, then also consider learning how to use macros in Excel to automate tasks.

ImageJ is commonly used to analyse images, but the macro functionality seems to be less well known. It is possible to automate much of the pointing-and-clicking that is often required to analyse a set of images. See the documentation for details.

Another important task to automate is searching the literature for new papers in your field. This can be easily done with PubCrawler (which is actually a Perl script). It is important to know how to write an informative query so that you get what is relevant and avoid what is not.

Text editing: Emacs

Analysing data in R, writing scripts, HTML code, and LaTeX documents all involve writing and manipulating plain text. Emacs is a powerful and highly customisable text editor with many features to allow for efficient writing/coding. Think of it as Microsoft Notepad on steroids. There are numerous other text editors available and another popular one is Vim. If you are new to LaTeX, then Kile is a great editor.

Operating system: Ubuntu

The Microsoft operating system is aimed at the mass market, and what you get is something that everyone can use, but isn't particularly good for anyone. The needs of secretaries, grandmothers, teenagers, financial engineers, and scientists cannot be simultaneously optimised. The UNIX operating system was developed by computer scientists for computer scientists, and is therefore optimised for the "power user". While biologists may not need all the flexibility and power available, they are closer to the computer scientist end of the spectrum. A product that is dumb enough to convert NCBI gene IDs to dates is not designed to be used by biologists (Zeeburg et al., 2004)! There are many UNIX-like operating systems (including Max OS X), and a popular distribution is Ubuntu Linux. Google the phrase "why use Ubuntu" or "Linux versus windows" to get more information. One key advantage of Ubuntu is that all of the software described above can be downloaded and installed with a few clicks from the Ubuntu repository. Ubuntu is also user-friendly, and you don't need to be a computer expert to use it.


Zeeberg BR, Riss J, Kane DW, Bussey KJ, Uchio E, Linehan WM, Barrett JC, Weinstein JN (2004). Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics 5:80. [Pubmed]