This weekend I find myself at an invitation-only event in Phoenix, Arizona, organized by the Broken Science Initiative and called *The Broken Science Epistemology Camp*. I flew here on Thursday and will be returning on Tuesday, so it’s a flying visit to the USA. I thank the organizers Greg Glassman and Emily Kaplan for inviting me. I wasn’t sure what to expect when I accepted the invitation to come but I welcomed the chance to attend an event that’s a bit different from the usual academic conference. There are some suggestions here for background reading which you may find interesting.

Yesterday we had a series of wide-ranging talks about subjects such as probability and statistics, the philosophy of science, the problems besetting academic research, and so on. One of the speakers was eminent psychologist Gert Gigerenzer, the theme of whose talk was the use of p-values in statistic and the effects of bad statistical reasoning in reporting research results and wider issues generated by this. You can find a paper covering many of the points raised by Gigerenzer here (PDF).

I’ve written about this before on this blog – see here for example – and I thought it might be useful to re-iterate some of the points here.

The p-value is a frequentist concept that corresponds to the probability of obtaining a value at least as large as that obtained for a test statistic under a “null hypothesis”. To give an example, the null hypothesis might be that two variates are uncorrelated; the test statistic might be the sample correlation coefficient *r* obtained from a set of bivariate data. If the data were uncorrelated then *r* would have a known probability distribution, and if the value measured from the sample were such that its numerical value would be exceeded with a probability of 0.05 then the p-value (or significance level) is 0.05.

Whatever the null hypothesis happens to be, the way a frequentist would proceed would be to calculate what the distribution of measurements would be if it were true. If the actual measurement is deemed to be unlikely (say that it is so high that only 1% of measurements would turn out that big under the null hypothesis) then you reject the null, in this case with a “level of significance” of 1%. If you don’t reject it then you tacitly accept it unless and until another experiment does persuade you to shift your allegiance.

But the p-value merely specifies the probability that you would reject the null-hypothesis if it were correct. This is what you would call making a Type I error. It says nothing at all about the probability that the null hypothesis is *actually* a correct description of the data or that some other hypothesis is needed. To make that sort of statement you would need to specify an alternative hypothesis, calculate the distribution based on it, and determine the statistical power of the test, i.e. the probability that you would actually reject the null hypothesis when the alternative hypothesis, rather than the null, is correct. To fail to reject the null hypothesis when it’s actually incorrect is to make a Type II error.

If all this stuff about p-values, significance, power and Type I and Type II errors seems a bit bizarre, I think that’s because it is. It’s so bizarre, in fact, that I think most people who quote p-values have absolutely no idea what they really mean. Gert Gigerenzer gave plenty of examples of this in his talk.

A Nature piece published some time ago argues that in fact that results quoted with a p-value of 0.05 turn out to be wrong about 25% of the time. There are a number of reasons why this could be the case, including that the p-value is being calculated incorrectly, perhaps because some assumption or other turns out not to be true. For instance, a widespread example is assuming that the variates concerned are normally distributed. Unquestioning application of off-the-shelf statistical methods in inappropriate situations is a serious problem in many disciplines, but is particularly prevalent in the social sciences when samples are also typically rather small.

The suggestion that this issue can be resolved by simply choosing stricter criteria, i.e. a p-value of 0.005 rather than 0.05, does not help because the p-value is an answer to a question about what the hypothesis says about the probability of the data, which is quite different from that which a scientist would really want to ask, namely what the data have to say about a given hypothesis. Frequentist hypothesis testing is intrinsically confusing compared to the logically clearer Bayesian approach, which does focus on the probability of a hypothesis being right given the data, rather than on properties that the data might have given the hypothesis. If I had my way I’d ban p-values altogether.

The p-value is just one example of a statistical device that is too often applied mechanically without real understanding, as a black box, and which can be manipulated through data dredging (or “p-hacking”). Gert Gigerenzer went on to bemoan the general use of “mindless statistics”, the prevalence of “statistical rituals” and referred to much statistical reasoning as “a meaningless ordeal of pedantic computations”. It

Bad statistics isn’t the only thing wrong with academic research, but it is a significant factor.