**I find the use of statistics in the justice system a thrilling subject, specially so when you find out that some persons like Lucia de Berk have been handed life sentences based solely on flaw statistics coming from experts like Mr. Henk Elffers. So I’ll talk in this post about what he did wrong and how to avoid this kind of huge boo-boo in our statistical lives.**

The use of statistics in the justice system has actually a long history, the amazing mathematician / engineer / physicist / philosopher of science Henri Poincaré already had to correct the misuse of statistics in the infamous Dreyfus trial.

But it was in the Lucia de Berk trial where combining p-values wrongly handed her a life sentence. I won’t go into the details of the trial, for that there are many other places like **Mr. Richard D. Gill** web page account of the trial and a video worth to have a look to. Instead **I will focus on how to appropriately deal with a bunch of p-values to make sense of our data.**

Mr. Elffers, the expert used by the prosecution in Lucia’s case, stated in court *“1 in 342 million that such an extreme number of incidents would be concentrated in her shifts purely by chance”* and this pretty much sentenced her for life but let’s see how he arrived to this number.

He had three small p-values* p1, p2, p3* corresponding to how significant were the number of deaths in her shifts compared to what it is considered normal, he then proceed to multiply these p-values and then he did a Bonferroni correction multiplying again the result by 27, then he inverted the final “p-value” to calculate the odds and voilà; “Guilty of all charges” with a 1 in 342 million chance… blah blah blah.

**The huge gigantic mistake he made **(among others)** was to confuse the probability of something happening with its statistical significance.** Let us consider an example, imagine we are told to go around the block and bring back the sequence of male (1) and female (0) we met in our journey. We might come back with a sequence like **100101010001111101000101010110**.

What’s the probability to find such an event? Well, if we take as the probability to find a male 1/2, and to find a female 1/2, then the chances to find **that** sequence of 30 males and females is 1/(1/2)^30, which is around **1 in a billion!** So if our freedom depends on us telling the truth these chances presented to a jury might give us a life sentence too, and this is pretty much what Mr. Elffers did… **And the defense team ignored.**

# The p-value salad

So what to do when we have p-values from all sort of sources and we want to combine them and interpret them sensibly? In fact there are many ways to combine p-values based on what we actually want to know, and these methods don’t have to be exclusive since each might show a different aspects in our data. Let’s now consider the following flowchart.

A piece of warning before continuing, this is the flowchart that **I made and use myself** when combining different p-values, but there are many other methods for combining significance levels that might be more appropriate for a particular problem, nonetheless, for most situations this flowchart should be more than enough to arrive to sensible results. Let’s now comment on its steps.

**The first question** we have to make to ourselves is whether we expect high p-values to counteract small ones. This would be the case for p-values coming from experiments where, had we the all the data from all experiments, we could merge it into one unique experiment.

Since we would like to have the p-value that would be returned if we would use all the data in the same experiment, then we have to account for the size of the experiments where the p-values come from. If the p-values come from experiments having the same weight then the **Z-Method** is appropriate, otherwise the **Weighted Z-Method** will account for different weights.

The **one tail** question comes immediately after having a positive answer to the * same experiment* question. For two tail experiments we want to use a Z-method to counter balance values in opposite sides of the distribution whereas when we have one tail Fisher is the way to go.

The **correlation** among the variables considered to calculate the p-values is only relevant if the p-values are not in the same experiment (e.g. two p-values from unrelated medical tests but that belong to the same person).

If the variables happen to be correlated then we need a more sophisticated method for combining p-values like **Brown’s Method** and others. Otherwise we can resort to the old good **Fisher’s Method** for calculating the significance of the product of independent probabilities.

As I mentioned before **these methods are not exclusive**, for example, we might want to know the merged result for a set of p-values but also how significant are the independent probabilities. (e.g. 0.9, 0.1 and 0.8, 0.2 might have the same “merged” p-value using the Z-method but 0.9, 0.1 would be more significant than 0.8, 0.2 when considered independently with Fisher’s Method).

Finally, once we have properly combine all our p-values, we might still have a bunch on p-values coming from different experiments and we might want to account for the family wise error coming for multiple comparisons. In this case an appropriate method would be the **Holms-Bonferroni** correction, and if we can guarantee the independence of the p-values then **Hochberg’s Method** would give us more statistical power.

# Philosophical Issues

The confusion on how to combine p-values is not unfortunately rare. For example, we can find weighted versions of Fisher’s method, but some authors consider this approach senseless since the weight has already been accounted for in the test. Also we can find authors like M. C. Whitlock who considers the Z-Weighted method superior to Fisher’s method when, as explained before, both approaches have different goals.

Mr. Whitlock shows his misunderstanding of Fisher’s method when he complains about this method not being symmetrical when, given its purpose, it cannot possibly be otherwise. Nonetheless it is hard to blame anyone for this general confusion about combining p-values when even experts hand out live sentences in trials due to their lack of understanding.

I would say this is an everyone’s responsibility scenario and we should start by demanding faculties giving statical courses to do a better job forming students since, after all, **students do as well as they are taught**, and if the expert sentencing Lucia have had a better formation this would not have ever happened.

Maybe one way to address this problem would be to introduce **philosophy of science courses** in every major dealing with statistics so that students do not just learn a bunch of formulas but also acquire a deep understanding of what they mean and their limitations.

So I will end this post with a piece of advice: If you are a statistician or you heavily work with statistics and, just like myself, missed this kind of philosophical training in your student years, a great place to start filling the gaps would be Ms. Mayo’s blog. There you’ll find many resources to go beyond the formula…. **And hopefully not to destroy anyone’s life with bad stats.**

###### Related articles

- The Dreyfus Affair Holds a Sacred Place in French History. Is There Room for Debate? (3quarksdaily.com)
- The Manning Show Trial Is Another Dreyfus Affair by Jerome Irwin (zcommunications.org)
- French Polymath Henri Poincaré on How Creativity Works (brainpickings.org)
- 116. Innocent until proven guilty (threemagical.wordpress.com)
- Guilty Until Proven Innocent: The Judicial Myth (iowasjudicialcorruption.wordpress.com)

[…] agree that none of them know how the experiment was designed by the third scientist and, therefore, combining both p-values into one unique p-value would be […]