Objectivity is dead, long live Objectivity!

Are p-values an objective measure? Bayesian Statistics are not as objective as Frequentist statistics for the simple reason that they need more assumptions, that is, a prior. This is why to even talk about Objective Bayesian Statistics is an oxymoron and yet seems to be the most popular Bayesian school out there. But anyhow, how about p-values then, can they be subjective? Is there such thing as Objectivity in statistics? death_of_the_justice_by_quadraro-d6sapo4

For a time I thought p-values were an objective measure but then a couple of blows put to rest my dream on having an objective procedure to deal with uncertainty. This is the story of the Subjectivity one-two combo that knocked out flat my Objectivity dreams…

The first blow against Objectivity

And thus one day I learned about the likelihood principle. Estating that the same data should drive any researcher to the same inferential decisions seemed reasonable and made me doubt about the p-value objectivity for a moment.

So yeah, all right, If you want to show that Frequentist Statistics do not hold the likelihood principle we can design two experiments using the same data with the same significance and have different inferential results.

So what? Admittedly the likelihood principle example  is a bit contrived and unconvincing since it is hard to imagine a real situation where two scientists will disagree on the likelihood function to analyze the undocumented data of an experiment left by a third scientist.

But even if this was the case, when looked in detail, the purpose of  Frequentist Statistics, or Error Statistics as Philosopher of Science Deborah Mayo would call it, is to establish a procedure which, when used over and over again, will guarantee an error rate defined by the significance of our experiments.

So it would be reasonable, and quite objective, that both scientist would agree that none of them know how the experiment was designed by the third scientist and, therefore, combining both p-values into one unique p-value would be warranted.

However, to be honest, it was quite annoying that two frequentist scientists might reach different conclusions with the same data. So though I found the example peculiar and made my belief in objectivity stumble it was still not enough to desist on my Objectivity dream.

Second blow & Objectivity Knock Out

However, let’s consider the following example based on my experience with analysis of historical data and that knocked out my belief in objectivity; let’s say we are team leaders of a group of two scientists and we ask them to investigate if jelly beans cause acne. Let’s also imagine that our two scientists disagree on how to design the experiment.

John wants to check whether green jelly beans cause acne since, in his scientific opinion, green is the only color that makes sense to check, Mary agrees that green jelly beans might cause acne but she also believes that there are other nineteen jelly bean colors that might also cause acne. So both scientists come to us and, as project leaders, we decide to give them freedom to design both experiments independently.

Mary, however, does not want to waste money in a separate experiment for green jelly beans and tells John that she will replicate his exact experiment design for the other colors so that she can use with ease the data coming from his experiments with green jelly beans.

They both agree on running their experiments with an alpha significance of 0.05 and they obtain the following p-values for each individual color experiment:


John:  0.01 < 0.05
Mary:  0.35, 0.01, 0.04, 0.12, …, 0.65 being 0.01 the smallest p-value and with Bonferroni correction we have that all p-values are  well above 0.05/20 = 0.0025 (fwer)

So John is quite excited about his results and rushes to present it to us claiming he has clear results showing that green jelly beans are linked to acne. (0.01 < 0.05)

Mary is also quite happy because she believes to have strong evidences against jelly beans green, or otherwise, to be linked to acne. (0.01 > 0.0025 fwer)

So what do we do now? They both are right and their the results are way above and way below their significance levels despite they both are considering the same data for the green jelly beans analysis.

The fact that this kind of situations happen to be very common when analyzing already existing data, and not just an artifact to prove a likelihood point, made my Objectivity belief to hit the ground… So that’s it.

Objectivity is dead, long live Objectivity!

Okay, both scientists will keep their respective and subjective error rates, no problem there but, as project leaders, now we have to make decisions. We could do a number of things to keep intact our error rates as a team:

  1. We could combine the p-values from both experiments (My choice)
  2. We could flip a coin to choose between John & Mary
  3. We could favor one scientist approach versus the other before we see the results

These options might not make John & Mary happy since they both stand behind their results, however, this would keep our team error rates unaltered even though all these options are as subjective as subjective were the choices John & Mary made.

Unfortunately we cannot always have the luxury to repeat over and over again an experiment to see if the results for green jelly beans really hold. Sometimes all we have is one shot and we need to make the most of it (e.g. the cosmic microwave background), and it seems only one person is allowed to aim and pull the trigger.

Take another example; the Particle of God experiment in the LHC. Scientist A might request to look at energies only in the range around 125 GeV because his theory says so, however, Scientist B wants to look in the range around 125, 135 & 145 because his theory says so too.

When turns out the particle is around 125.09 GeV with 5 sigma the Scientist A has already achieved the prove he needs whereas Scientist B needs to keep smashing particles to achieve the same significance since he is checking a wider range of energies.

We could potentially re-recreate this situations with any other experiment so yeah, Objectivity is so dead. No point in keep crying about it, but this is just my subjective opinion…

11 thoughts on “Objectivity is dead, long live Objectivity!

  1. Hi again Fran, your posts touch various things in such a manner that i like to post a couple of comments once in a while (if this is not a burden of course).

    Well, i could agree with the post, that’s why i tend to frequentist approaches!

    Having said that, one thing needs further clarification.

    That is uncertainty vs estmation. Or in other words, probability as measure vs probability as possibility (as having already mentioned in another comment on your About page)

    Frequentist approach (as far as i can tell, or at least what i refer as frequentist), deals with the first part That is probability as (invariant) measure on a suitable space, related to symmetries of a process (or frequencies if you like, yes E. Jaynes i fully agree here).

    (“what exactly does it mean to be 1/3 pregnant?”, a usual greek reply, when probabilities seem meaningless in context)

    i could say that bayesian statistics can do (and in some cases at least, indeed does) the same approach, by an appropriate re-interpreattion of the symbols used.

    (who said there is such a thing as a frequentist vs bayesian schism? 🙂 )

    The other cases (probability as possibility) are, to be precise, simply uncomputable (after all that is what uncertainty and true randomness mean in the final analysis). So in this case, whatever one would follow or call it, is simply useless (or meaningless if you like)

  2. Furthermore, after reading the post again, let me point to another thing. As mentioned, people sometimes (or many times), take the data out of context or in other words talk about different experiments. This is subtle but important.

    Sometime ago i posted sth on mathoverflow which was heavily downvoted, but actually stated the obvious, about “coin” experiments.

    Let me summarise here.

    The question (which was also mine) asked for a probability of the next throw, after having observed $N$ previous throws of the same “coin” and how fair or not it was. It also involved concepts of entropy and so on..

    Most people (“correctly” by their textbook education) noted thet since throws are independent, the next throw should have no effect, having obsered $N$ previous throws, the coin has “no memory” and so on..

    Not suprisingly, the question (and my own personal answer since i was dissatisfied with the other answers) were heavily downvoted (sth that happens in these sites for various reasons least of which is correctness or not of the actual content).

    Yet these same people could not explain why entropy (meaning, for example assuming a fair “coin”, sequences which increasingly demonstrate unequal probabilities are NOT observed) is in-effect.
    Nor where exactly does this start to take effect or not (in fact also gave a small proof of that myself and then related to certain aspects of physics albeit quite summarily).
    The answer to this is that the students (and the textbook literature, as far as i can tell thus far) conflates $3$ different experiments:

    A Throw one and the same “coin” $N$ times
    B. Throw $N$ similar “coins”, $1$ time each
    C. Repeat experiment B $N$ times

    The most usual confation is between A and B (which are then related employing various LLN-like theorems through experiments C). Yet they do not see at least one simple thing! That throwing the SAME “coin” $N$ times, one can make an increasingly significant estimation of the fairness (or not) of that “coin”. While this is NOT the case fo the experiment B (an increasing sequence of $N$ tails, for example, is perfectly fine for $N$ fair “coins” thrown once each). But severely problematic for one and the same fair “coin” of A. Wether one would like to call that a memory of the coin, a symmetry of the coin, a compatibilioty condition of the coin (for example entropy invariant), is NOT of the essence.

    (where “coin” use anything you like, e.g “dice”, or some other process)

    The takeaway (apart from my little story) is that the context and setup of experiments ARE part of the inference process

    • Not suprisingly, the question (and my own personal answer since i was dissatisfied with the other answers) were heavily downvoted

      Do not worry much about that, stackexchange and other similar sites can have quite aggressive / ignorant moderators and you can only see the world through their narrow minds. Long ago I decided not to participate in forums where downvoting or censorship is allowed, which pretty much means all of them :D. Funny enough, Facebook (not a fan though for other reasons) has it right; you can only upvote and if you don’t like someone’s answer you just write your own or move on.

      Truth is not a democracy 😉

      • To make a small correction, it was mathematics.stackexchange and not mathoverflow (there are two sites for math on the stackexchange network, i was in both). i just related it to stackoverflow which is more known/popular, but better correct that

        In one post of mine i also referenced your site since some content i was referring was here, so there is a chance you get some traffic from there as well.

        i’m not a member since some time now (after making serious critic of the thing)
        i try to contribute as far as i can through various channels, that being one of them, and since i myself in a number of cases, did find some good information on stackoverflow (mostly) it was only fair to contribute a bit. i guess after all, it didnt worth it! And i dont mind downvotes or critic as long as it is to the point and not speechless

        i left another comment here, discussing the actual experiments you describe, it was waiting for moderation, should i re-post it?

        • Oh no need to re-post, I just saw it now. It seems WP has some sort of moderation after N messages or something.

          Anyway, I believe that downvoting is a problem because it is too close to censorship and, in fact, a few downvotes in stackoverflow and similar sites allow admins to precisely do so by removing your contribution.

          But even if they don’t remove it, imagine you have 1 million upvotes and 1 million downvotes, obviously your contribution is hugely controversial and highly interesting but according to these sites you contribution has a score of zero!… Sorry but nope, not playing the game.

  3. Discussing the examples used in the actual post.

    The apparent contradiction or apparent confusion can be resolved by understanding that we are talking about different experiments
    (which can make a difference).

    After a small lookup of what Bonferroni correction means, it becomes even clearer.
    The compensation factor is assigned uniformly i.e $\frac{1}{m}$ for a combination of $m$ experiments into one (i guess this is following Laplace’s maxim of assigning a-priopri weights). But this is exactly what i mentioned in the very first comment in this post. That the principle of insufficient reason (uniform a-priori assignment), is NOT that.The assignment should be based on the actual symmetries of the process (a-la Jaynes), and in fact this is exactly where Laplace’s principle as usualy stated (uniform distribution) gives correct results when the underlying symmetries support it.

    For example, falling from a tall building, there are roughly two outcomes, “live or die”. But the symmetries of the process do NOT lead to assigning $1/2$ probability to each one. This is one point.

    The second point here is the actual fact of a different experiment. The experiment of Mary is an experiment about the whole set of jelly beans (or at least a large part of it), while the experiment of John is about one kind of jelly bean. So what the results say is not the same. The first p-value is about one type of jelly beans, while the second is about the whole group of jelly beans and, of course, DOES NOT contradict the fact that although the group as a whole has low p-value, one type from the group CAN have a quite large p-value, similar to “regression to the mean” phenomenon). They measure different things. Representative samples are meaningful to discuss.

    As for a policy for combining the apparently contradicting experiments, it is not needed! 🙂

  4. The compensation factor is assigned uniformly i.e $\frac{1}{m}$ for a combination of $m$ experiments into one (i guess this is following Laplace’s maxim of assigning a-priopri weights). But this is exactly what i mentioned in the very first comment in this post. That the principle of insufficient reason (uniform a-priori assignment), is NOT that.The assignment should be based on the actual symmetries of the process (a-la Jaynes), and in fact this is exactly where Laplace’s principle as usualy stated (uniform distribution) gives correct results when the underlying symmetries support it.

    Frequentist statistics is all about controlling error rates, Bonferroni procedure simply guarantees a family wise error rate for a given set of assumptions. You might have other non-linear approaches like Šidák correction under a more restrictive set.

    You made an interesting and intriguing connection here; that Bonferroni might be ultimately using a Bayesian principle. But, in this philosophical matters about probability I tend to be careful, other than mocking my religious Bayesian friends 😉 , and specially so after reading Henri Poincaré’s “La Science et L’hypothèse” where Poincaré, giant of Mathematics and Philosophy of science, happens to be very modest and humble and rejects to take a position in matters of probability other than simply listing a few problems he believes we need to consider when working with it.

    As for a policy for combining the apparently contradicting experiments, it is not needed!:)

    Good luck trying to convince the FDA about that 😉

    • Exactly, i do make such a connection, and i think maybe Bonferroni himself (or whoever suggested that) might had this in mind as well.

      In any case, i make the following observation (following Jaynes), that what are called priors (bayesian framework) or compensation weights, SHOULD be based on objective conditons of the process under study, that is what Jaynes refers as “symmetries”.

      These symmetries are part of the process and are invariant, relating probability to measure over invariant spaces (invariant measures). And it is a fact that given a process these symmetries are indeed known in advance, so they can be used as “objective” priors and this is the “true meaning” of the Laplacian principle of a-priori weights.

      In other words i follow Jaynes (who i highly admire) but into a frequentist framework

      • And what would Jaynes say if the problem has no symmetries or it is not fair to establish one? What prior or principle would Jaynes use?

        • Ha, ha, that’s a nice subject for a next post!

          In the meanitime, i will just quote and reference a couple of papers on this

          Prior Probabilities Edwin T. Jaynes

          Click to access prior.pdf

          […]However, an ambiguity remains in setting up a prior on a continuous parameter space because
          the results lack invariance under a change of parameter; thus a further principle is needed.

          It is shown that in many problems, including some of the most important in practice, this
          ambiguity can be removed by applying methods of group theoretical reasoning which have
          long been used in theoretical physics…

          The introduction of symmetry constraints within MaxEnt Jay
          nes’s methodology


          We provide a generalization of the approach to geometric probability advanced by the great
          mathematician Gian Carlo Rota, in order to apply it to generalized probabilistic physical theories.
          In particular, we use this generalization to provide an improvement of the Jaynes’ MaxEnt method.
          The improvement consists in providing a framework for the introduction of symmetry constrains.
          This allows us to include group theory within MaxEnt. Some examples are provided

        • For example, to relate to the current post.Consider what would mean if the green jelly beans made up 98% of the whole population of jelly beans, while the rest kinds of beans made-up the remaining 2%. Or if each kind of jelly beans has a population of (1/N)% (N different kinds of jelly beans) over thw whole population.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s