Fun with stats: how big should my sample size be?

Recently my girlfriend and I gave a talk at her company about p-values and how all manner of silliness can ensue if they are used "willy-nilly" to draw conclusions about things like drug efficacy. At the end, I was asked a very instructive question: how do you figure out how much data you need to collect to detect if there's a difference between two groups? At first, this seemed pretty straightforward, but it turned out that there are some interesting subtleties. After I realized that, I decided to make a little article about it, which is exactly what you are reading now.

This article is not just about that exact question. It's more general and can also serve as a quick guide to figuring out how to choose the right statistical test to compare two groups (aka samples). Let's start with a more specific statement of the question.

Problem statement.

We will be performing measurements of a certain continuous variable (for example, printing time) under two conditions (for example, low and high temperature). We are interested in whether the temperature makes a difference for the average printing time. We want to be able to detect an effect of size as small as, say, 15 seconds, and we are asking: how many measurements should we plan to perform for such a difference to be statistically significant?

Simple case: standard deviations known.

In that case, we know the variance of the difference between the two sample means:

$$Var(\bar{t}_1 - \bar{t}_2)=Var(\bar{t}_1)+Var(\bar{t}_2)=\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}\stackrel{\text{def}}{=}S^2$$

where we have the (unknown) sample sizes in the denominators. If we then want to be able to detect differences of size $\delta$, for example in the sense that we want a measured difference of that size to be statistically significant at the 95% confidence level, then we just need to select the sample sizes to make

$$\delta=1.96S$$

(1.96 being how many standard deviations correspond to the 95% area under the bell curve, aka the z-value.) If we want the two samples to be of equal size, then the above two equations would give us that size:

$$n_1=n_2=\frac{\sigma_1^2+\sigma_2^2}{S^2}=\frac{(\sigma_1^2+\sigma_2^2)1.96^2}{\delta^2}$$

More interesting case: standard deviations unknown.

In that case, the short answer is that there is no way to have a formula for the required sample size without making concrete assumptions about the standard deviations: indeed, if the standard deviations can be anything from tiny to huge, then the required sample sizes can also be anything from small to huge. We can see that from the formula we just derived or, conceptually, from the fact that if the standard deviations are tiny then our sample means will be very precise estimates of the true means, so we don’t need a lot of measurements to detect a true difference; whereas, if the standard deviations are huge, then even after many measurements our two sample means can be very different and we still wouldn’t be sure if that represents a true difference or not.

The solution then is to either use an upper estimate of the standard deviations if we have that (this would overestimate the required sample size, possibly by quite a lot) or use an iterative approach, where we basically collect enough measurements to estimate the standard deviations and then use those estimates to get a reasonable estimate for the required sample size (and possibly refine it as we collect more measurements and get more precise values of the standard deviations). This document describes the latter approach in a more technical way (in the somewhat different context of having only one sample rather than two).

Note also that after we collect our measurements we should use Welch's t-test rather than Student's t-test to test for equal means if we can’t safely make the assumption that the two standard deviations are roughly equal (how roughly? If the two sample sizes are equal, one of the standard deviations can be up to twice the other, but if the sample sizes are different even by 10% the standard for “roughly equal” needs to be stricter - source).

More exotic situations: we can’t even make the normality assumption.

The t-tests make the normality assumption, but it’s often misunderstood. Even the otherwise good explanations of the t-tests linked to above incorrectly state that the measurements need to be randomly distributed. However, as explained, for example, here, the important factor is whether the sample means, rather than the individual measurements, are approximately randomly distributed. The sample means can satisfy that assumption either (a) because the individual measurements themselves are approximately normally distributed or (b) because the sample in question contains enough measurements to overcome the deviation from normality. Depending on how non-normal the distribution of the individual measurements is, sometimes even a tiny sample size could be sufficient to produce a normal distribution and for a t-test to be possible, while other times even the commonly cited figure 30 may not be a sufficient sample size.

So what do we do if the normality assumption is not satisfied, i.e. what if in addition to not knowing the two standard deviations, we also think that the distributions are so skewed that the anticipated sample sizes won’t be large enough to make the sample means be normally distributed? Then we could use the non-parametric equivalent of the two-sample t-test known as the Mann-Whitney U test. Again, there is no formula to get the required sample size in advance, so we would again probably want to use an iterative approach.