“Powered to Detect Small Effect Sizes”: You keep saying that. I do not think it means what you think it means.

Last month Aisling Ni Chonaire and I published a new Working Paper through the The Centre for Market and Public Organisation research centre. The paper explores how researchers can choose a sample size large enough to detect an effect in a randomised control trial, but small enough to make the trial workable. They have focussed on education research in particular, but the issues discussed are applicable to multiple policy areas. The main arguments are summarised below:

Randomised trials – the gold standard?

While Randomised Controlled Trials have been accepted in medicine for a while now, it is only in recent years that they have risen to prominence in the social sciences and in policy circles. Leading the charge to a large extent have been the Education Endowment Foundation (EEF), which was set up in 2010 by the department for education with a mission to test innovative policies using RCTs. Since then, the EEF have funded scores of RCTs in education – an amazing contribution to what we know about this vital area of public policy.

Unfortunately, not all RCTs are equal. One of the first steps in running a trial is to work out how big of a sample you will need. An important factor here is the size of the effect you are looking for. The bigger the effect, the easier it is to see – and hence, the smaller a sample you need. The biggest problem with this is obvious – if you know the effect size an intervention is going to have, you don’t need to conduct a trial. As a result, researchers must rely on certain rules of thumb. The most commonly cited of these rules is that provided by Cohen (1988) – which gives values for ‘small’, ‘medium’, and ‘large’ effects. Analysing the results of over 100 interventions in education, we find that 87% of the reported findings from the papers we surveyed report findings that are smaller than “small” – and only about 2.5% of studies qualify as ‘large’. The average effect size is roughly half a “small” effect. More than half of studies find nothing statistically significant.

“Null” Results

It is tempting to write these interventions off as failures. However, it could be that our concept of smallness is wrong. A study by William (2008), finds the the effect on grades of an extra year of schooling is roughly at the level we think of as “small”. An entire year of schooling seems like quite a meaty intervention to us, and so should we really be designing studies that can only detect increases of that size? If an intervention is cheap and only as effective as, say, one month of schooling, that would still seem worth pursuing. Unfortunately, a lot of the research out there does not have large enough samples to let us differentiate between effects that are small, and ones that are simply not there at all.

It’s a broad problem

This problem is not limited to education research – far from it. We have chosen education for our first foray in this field simply because there are many studies available, because there are standard ways of measuring effects, and because, thanks to the EEF, the field is less mired in other issues like publication bias, than many others. There would certainly seem to be merit in a range of similar review in other fields that would allow us to design better trials with a higher chance of detecting plausible effects based on what we already know, rather than generic rules of thumb. Otherwise, we might find that our gold standard begins to tarnish.

“Powered to Detect Small Effect Sizes”: You keep saying that. I do not think it means what you think it means.

Authors

Michael Sanders