For around ten years, I've had my introductory biology students perform experiments to attempt to determine the effectiveness of soaps and hand cleansing agents. This is really a great exercise to get students thinking about the importance of good experimental design, because it is very difficult to do an experiment that is good enough to show differences caused by their experimental treatments. The bacterial counts they measure are very variable and it's difficult to control the conditions of the experiment. Since there is no predetermined outcome, the students have to grapple with drawing appropriate conclusions from the statistical tests they conduct - they don't know what the "right" answer is for their experiment.
We are just about to start in on the project in my class again this year, so I was excited to discover that a new paper had just come out that purports to show that triclosan, the most common antibacterial agent in soap, has no effect under conditions similar to normal hand washing:
Kim, S.A., H. Moon, and M.S. Ree. 2015. Bactericidal effects of triclosan in soap both in vitro and in vivo. Journal of Antimicrobial Chemotherapy. http://dx.doi.org/10.1093/jac/dkv275 (the DOI doesn't currently dereference, but the paper is at http://m.jac.oxfordjournals.org/content/early/2015/09/14/jac.dkv275.short?rss=1)
The authors exposed 20 recommended bacterial strains to soap with and without triclosan at two different temperatures. They also exposed bacteria to regular and antibacterial soap for varying lengths of time. In a second experiment, the authors artificially contaminated the hands of volunteers who then washed with one of two kinds of soap. The bacteria remaining on the hands were then sampled.
The authors stated that there was no difference in the effect of soap with and without triclosan. They concluded that this was because the bacteria were not in contact with the triclosan long enough for it to have an effect. Based on what I've read and on the various experiments my students have run over the years, I think this conclusion is correct. So what's my problem with the paper?
Why do we like to show that things are different and not that they are the same?
When talking to my beginner students, they often wonder why experimental scientists are so intent on showing that things are significantly different? Why not show that they are the same - sometimes that's what we actually want to know anyway.
When analyzing the results of an experiment statistically, we evaluate the results by calculating "P". P is the probability that we would get results that are this different by chance, if the things we were comparing are actually the same. If P is high, then it's likely that the differences are due to random variation. If P is low, it's unlikely that the differences are due to chance variation, but rather that they are caused by the real effect of the thing we are measuring. The typical cutoff for statistical significance is when P<0.05 . If P<0.05, then we say that we have showed that the results are significantly different.
The problem lies in our conclusion when P>0.05 . A common (and wrong) conclusion is that when P>0.05 we have shown that the results are not different (i.e. the same). Actually, what has happened is that we have failed to show that the results are different. Isn't that the same thing?
Absolutely not. In simple terms, I put the answer this way: if P<0.05, that is probably because the things we are measuring are different. If P>0.05, that is either because the things we are measuring are the same OR it's because our experiment stinks! When differences are small, it may be very difficult to perform a good experiment and show that P>0.05 . On the other hand, any bumbling idiot can do an experiment that produces P>0.05 by any number of poor practices: not enough samples, poorly controlled experimental conditions, or doing the wrong kind of statistical test.
So there is a special burden placed on a scientist who wants to show that two things are the same. It is not good enough to run a statistical test and get P>0.05 . The scientist must also show that the experiment and analysis was capable of detecting differences of a certain size if they existed. This is called a "power analysis". A power analysis shows that the test has enough statistical power to uncover differences when they are actually there. Before claiming that there is no effect of the treatment (no significant difference), the scientist has to show that his or her experiment doesn't stink.
So what's wrong with the Kim et al. 2015 paper???
The problem with the paper is that it doesn't actually provide evidence that supports its conclusions.
If we look at the Kim et al. paper, we can find the problem buried on the third page. Normally in a study, one reports "N", the sample size, a.k.a. the number of times you repeated the experiment. Repeating the experiment is the only way you can find out whether the differences you see are due to differences or bad luck in sampling. In the Kim et al. paper, with regards to the in vitro part of the study, all that is said is "All treatments were performed in triplicate." Are you joking?????!!! Three replicates is a terrible sample size for this kind of experiment where results tend to be very variable. I guess N=2 would have been worse, but this is pretty bad.
My next gripe with the paper is in the graphs. It is a fairly typical practice in reporting results to show a bar graph where the height represents the mean value of the experimental treatment and the error bars show some kind of measure of how well that mean value is known. The amount of overlap (if any) provides a visual way of assessing how different the means are.
Typically, 95% confidence intervals or standard errors of the mean are used to set the size of the error bars. But Kim et al. used standard deviation. Standard deviation measures the variability of the data, but it does NOT provide an assessment of how well the mean value is known. Both 95% confidence intervals and standard error of the mean are influenced by the sample size as well as the variability of the data. They take into consider all of the factors that affect how well we know our mean value. So the error bars on these graphs based on standard deviation really don't provide any useful information about how different the mean values are.*
The in vivo experiment was better. In that experiment there were 16 volunteers who participated in the experiment. So that sample size is better than 3. But there are other problems.
First of all, it appears that all 16 volunteers washed their hands using all three treatments. There is nothing wrong with that, but apparently the data were analyzed using a one-factor ANOVA. In this case, the statistical test would have been much more powerful if it had been blocked by participant, since there may be variability that was caused by the participants themselves and not by the applied treatment.
Secondly, the researchers applied an a posteriori Tukey's multiple range test to determine which pairwise comparisons were significantly different. Tukey's test is appropriate in cases where there is no a priori rationale for comparing particular pairs of treatments. However, in this case, is is perfectly clear which pair of treatments the researchers are interested in: the comparison of regular and antibacterial soap! Just look at the title of the paper! The comparison of the soap treatments with the baseline is irrelevant to the hypothesis that is being tested, so its presence does not create a requirement for a test for unplanned comparisons. Tukey's test adjusts the experiment-wise error rate to adjust for multiple unplanned comparisons, effectively raising the bar and making it harder to show that P<0.05; in this case jury-rigging the test to make it more likely that the effects will NOT be different.
Both the failure to block by participant and using an inappropriate a posteriori test makes the statistical analysis weaker, not stronger, and a stronger test is what you need if you want to show that the reason why you failed to show differences was because they weren't there.
The graph is also misleading for the reasons I mentioned about the first graph. The error bars here apparently bracket the range within which the middle 80% of the data fall. Again, this is a measure of the dispersion of the data, not a measure of how well the mean values are known. We can draw no conclusions from the degree of overlap of the error bars, because the error bars represent the wrong thing. They should have been 95% confidence intervals if the authors wanted to have meaning in the amount of overlap.
Is N-16 an adequate sample size? We have no idea, because no power test was reported. This kind of sloppy experimental design and analysis seems to be par for the course in experiments involving hand cleansing. I usually suggest that my students read the scathing rebuke by Paulson (2005)  of the Sickbert-Bennett et al. (2005) paper that bears some similarities to the Kim et al. paper. Sickbert-Bennett et al. claimed that it made little difference what kind of hand cleansing agent one used or if one used any agent at all. However, Paulson pointed out that the sample size used by Sickbert-Bennett (N=5) would have needed to have been as much as 20 times larger (i.e. N=100) to have made their results conclusive. Their experiment was way to weak to draw the conclusion that the factors' had the same effect. This is probably also true for Kim et al., although to know for sure, somebody needs to run a power test on their data.
What is wrong here???There are so many things wrong here, I hardly know where to start.
1. Scientists who plan to engage in experimental science need to have a basic understanding of experimental design and statistical analysis. Something is really wrong with our training of future scientists if we don't teach them to avoid basic mistakes like this.
2. Something is seriously wrong with the scientific review process if papers like this get published with really fundamental problems in their analyses and in the conclusions that are drawn form those analyses. The fact that this paper got published means not just that four co-authors don't know basic experimental design and analysis, but two or more peer reviewers and an editor can't recognize problems with experimental design and analysis.
3. Something is seriously wrong with science reporting. This paper has been picked up and reported online by newsweek.com, cbsnews.com, webmd, time.com, huffingtonpost.com, theguardian.com, and probably more. Did any of these news outlets read the paper? Did any of them consult with somebody who knows how to assess the quality of research and get a second opinion on this paper. SHAME ON YOU, news media !!!!!
Steve Baskauf is a Senior Lecturer in the Biological Sciences Department at Vanderbilt University, where he introduces students to elementary statistical analysis in the context of the biology curriculum.
* Error bars that represent standard deviation will always span a larger range than those that represent standard error of the mean, since standard error of the mean is estimated by s/ sqroot(N). The 95% confidence interval is + or - approximately two times the standard error of the mean. So when N=3, the 95% confidence interval will be approximately +/- 2s/ sqroot(3) or +/-1.15s. So in the case where N=3, the square root error bars span a range that is slightly smaller than the 95% confidence interval error bars would span. This makes it slightly easier to get have error bars that don't overlap than it would be if they represented 95% confidence intervals. When N=16, the 95% confidence interval would be approximately +/- 2s/ sqroot(16) or +/-s/2. In this case, the standard deviation error bars are twice the size of the 95% confidence intervals, making it much easier to have error bars that overlap than if the bars represented 95% confidence intervals. In a case like this where we are trying to show that things are the same, making the error bars twice as big as they should be makes the sample means look like they are more similar than they actually are, which is misleading. The point here is that using standard deviations for error bars is the wrong thing to do when comparing means.
 Paulson, D.S. 2005. Response: comparative efficacy of hand hygiene agents. American Journal of Infection Control 33:431-434. http://dx.doi.org/10.1016/j.ajic.2005.04.248
 Sickbert-Bennett E.E., D.J. Weber, M.F. Gergen-Teague, M.D. Sobsey, G.P. Samsa, W.A. Rutala. 2005. Comparative efficacy of hand hygiene agents in the reduction of bacteria and viruses. American Journal of Infection Control 33:67-77. http://dx.doi.org/10.1016/j.ajic.2004.08.005