A while ago I found a very interesting paper from Leah R. Jager and Jeffrey T. Leek via a post in the Simply Statistics blog arguing that most published medical research is true with a rate of false positives among reported results of 14% ± 1%. Their paper came as a response to an essay from John P. A. Ioannidis and several others authors claiming that most published research findings are false.
After dealing with some criticisms Mr. Leek made a good point in his post:
“I also hope that by introducing a new estimator of the science-wise fdr we inspire more methodological development and that philosophical criticisms won’t prevent people from looking at the data in new ways.”
And thus, following this advice, I didn’t let criticisms prevent me from looking at the data in a new way. So for this problem I have devised a probability distribution for p-values to then fit the data via MLE and infer from there the rate of false positives.
So this is my take; 15.33% rate of false positive with a worse case scenario of 41.75% depending on how mischievous researchers are but, in any case, and contrary to what others authors claim, most medical research seems to be true.
Leah R. Jager and Jeffrey T. Leek’s paper methodology has to main characteristics:
- It focus on a method that can be applied to only reported p-values less than 0.05
- Models the p-values distribution from 0 to 0.05 using the following mixture:
The first obvious issue the authors deal with is the conspicuous rounding of p-values when published, we can also see how it makes sense to cut data at the 0.05 threshold since there is a “big jump” in the frequency, most probably due to publication bias and several flavors of p-hacking experiments to pass editors’ filters and publish research.
Publication bias should not affect the density of p-values in the range [0,0.05] but the p-hacking is another story because we can only guess what researchers are doing. This means that any calculations will have to be either robust to any kind of p-hacking or to assess precisely the p-hacking impact which I see as a difficult task.
Although Mr. Leek addresses this problem in his post, this is the first thing I will do differently in my approach; in order to avoid as much as possible the p-hacking bias in the data instead using p-values in the range [0,0.05] we will use the p-values in the range (0.05,1]. This range also suffers from the “harmless” publication bias but, since there is no reason (at least not a convoluted one) to p-hack experiments to present results above 0.05, we can fairly assume that these p-values densities are more “honest”.
Besides, if the p-values close to 0.05 among these honest p-values were artificially pushed by researchers into the significant threshold, the altered densities of these p-values would favor the pessimistic view of high rates of false positives since it would increase the prior probability for studies not to be statistically significant. Therefore, doing inference with these “honest” p-values guarantees conservative results.
Jager and Leek also use a mixture of Beta and Uniform distribution to model the p-values. This roughly splits experiments into those with some effect and those with no effect at all which at first sight might be perhaps a bit unrealistic for a model since experiments with no effect at all virtually do not exist. On the other hand the flexibility the Beta distribution offers might be seen more of a problem that a quality in this context since it might adapt to anything p-value like regardless where it comes from. However, Leek also offers some simulations in his post showing the strength of their approach when p-hacking is present.
Nonetheless, instead using a mixture, we will construct from scratch a truncated statistical distribution based on the p-values obtained from one-sided Hypothesis Test experiments where the Null Hypothesis follows a standard normal distribution and all the possible Alternative Hypothesis spread following a with . This might result in a more realistic distribution to describe the p-values since it has less flexibility than the Beta distribution and has no need for a mixture with an Uniform distribution to account for non significant results…. So let’s begin.
Next we can see the (0.05,1] “honest” p-values from Pubmed:
We can first observe how researchers usually round the p-values to two digits, so in order to even out the distribution we will round to two digits all the p-values available, this is the result:
Now the data looks good except for tiny peaks showing that some researchers round the p-values to one digit instead two. The effect is though so small that we might as well ignore it since it would make no sense to lose further information rounding the data to one digit. We can instead adjust the values for the sequence 0.10, 0.20, … , 0.90 by averaging the two previous and two posterior rounded set of values and use this result instead. The final result is shown next:
Once these rounding peaks have been smoothed out we have the data ready to work with. Now we just need a meaningful probability distribution to fit the data and do some inference.
The P-Value Distribution
Let’s consider a collection of experiments where the Null Hypothesis follows a and all the possible Alternative Hypothesis spread following a with . Such scenario will look something like this:
So we will assume that each p-value from our data comes from one of these experiments. If now we consider the collection of experiments as a whole we have that
Which means we can simplify the collection of experiments into one unique normal distribution that fully characterizes the collection as we can see in the next figure:
So now instead treating each p-value as coming from different experiments we can consider all of them coming from repetitions of one unique experiment without losing generality. This insight will allow us to easily treat the p-values as a random variable and to produce a CDF for them based on this scenario.
P-Value CDF and PDF
To figure out the CDF and the PDF for a random variable P behaving as the p-values previously described we can proceed as follows:
Since the CDF for X is the same as for -X we can now calculate the CDF and PDF for P:
The next plot shows how the P-Values’ PDF behave for a variety of scenarios:
Just as expected, once the behaves as the we have a uniform distribution and, the bigger the size effect, the closer the looks to the p-values we collected from Pubmed.
So now we need to figure out which from among all possible scenarios is the one that better describes the p-values we have from Pubmed, and we will do so via MLE.
Likelihood Function for P-Values
Considering a collection of i.i.d random variables following the P-Value distribution as previously described we have the following joined PDF:
Thus the likelihood function will be:
In order to maximize the likelihood for our data we will use instead the log-likelihood for computational reasons
So to obtain the MLE estimators from and we simply need to maximize the log-likelihood
Now, since we are only going to use honest p-values that is, p-values in the range (0.05,1], we need to use the PDF truncated version for p-values to estimate the parameters, therefore we have that if then the truncated PDF is zero, otherwise:
And the log-likelihood function that we need to maximize is
And if we do so with the honest p-values from Pubmed we obtain the following result:
Which means that with the honest p-values coming from Pubmed suggest that medical experiments as a whole behave like multiple repetitions of an Hypothesis Test experiment where and . The next figure shows this scenario:
And this scenario is equivalent to having the p-values coming from a collection of with as shown in the next plot:
The percentage of experiments with significant expected values with can be inferred from the power of the test when is considered as a unique experiment, or simply by calculating . This percentage is ~54.768%.
Rate of False Positives
The model fitted with the honest p-values suggest that they come from an framework where 55% of hypothesis tested are statistically significant with . Therefore, If no p-hacking or publication bias would take place the percentage of statistical significant medical literature in Pubmed should be around 55%. However, the actual percentage of statistical significant literature in Pubmed is 80.7%… suspiciously close to the 80% power with which most hypothesis tests are designed.
This 25% gap might be explained by editors rejecting non-significant papers to achieve the suspicious 80% of statistical significant literature in their journals, by researchers torturing their experiments to achieve significance or, the most likely reason, a combination of both.
If all the bias was due to editors choices then the rate of false positives would be unaffected. The total rate of false positives coming from the model can be calculate via
And since the model fitted with honest p-values suggests a 55% percent of statistical significant results, if now we consider that 8.4% of that 55% are false positives we have a relative rate of false positives of 15.33%.
On the other hand, if we consider the worst possible scenario where all the 25% gap is caused by p-hacking non statistically significant experiments, then we would have 25% + 8.4 % for an 80% rate which would mean that we would have a relative rate of 41.75% of false positives so, even in the worst case, we would still have a majority of around 58% of correct statistically significant publications.
The 15.33% false positive rate we have obtained for a non p-hacking scenario falls quite close to the 14% ± 1% from Jager and Leek. However, since it is just as hard to believe that the 25% gap from the estimated 55% to the real 80% is all due to p-hacking, as that there is no p-hacking at all, it is reasonable to assume that the real percentage should be something in between the mildly optimistic 15.33% and the overly pessimistic 41.75%.
In any case the approach presented in this post vindicates the main claim from Jager and Leek about most medical research being true.
Studies arguing about most research being false are based on research having low prior probability to be statistically significant, which might not be a realistic scenario since researchers do not randomly try substances to see what happens, I mean, it is not like one day a researcher decides to test if Cocoa Puffs cures cancer.
UPDATE: I have reasons to believe that honest p-values are not that honest after all; revisiting the calculations for the MLE fitting I removed the 0.9-0.1 p-values and when fitting and the resulting PDF does not nicely fits the histogram in the higher p-values range:
So if we assume that researchers might be tempted to publish results close to those statistically significant (or editors tempted to choose those) turns out that I will need to repeat the calculation with another set of honest p-values, that is, those in the range (0.1,0.9]. Now, when doing the fitting in this range this what we see:
And this means that the mu and sigma (1.870551, 1.883980) obtained before turns into (0.6283843, 1.3576996) which means I need to recalculate the whole thing and update the results in the post… yaaay!
- I don’t believe the paper, “Empirical estimates suggest most published medical research is true.” That is, the claim may very well be true, but I’m not at all convinced by the analysis being used to support this claim. (andrewgelman.com)
- How to make a scientific result disappear (r-bloggers.com)
- Why Science Can’t Be Trusted (asserttrue.blogspot.com)
- Estimating the Price Elasticity of Beer: Meta-Analysis of Data with Heterogeneity, Dependence, and Publication Bias (bespacific.com)
- Medical literature: do wrong results prevail? (motls.blogspot.com)
- Is Medical Science Really 86% True? (neuroskeptic.blogspot.com)
- Publication bias (philmed2011.wordpress.com)
- Publication Bias: The Hidden Systematic Flaw In Medicine That Can Threaten Your Life (edencultures.com)
- Studies prove most medical research is false (prn.fm)