Talk:Lindley's paradox
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||
|
question
[edit]Can someone give some concrete examples demonstrating this paradox. I find the article a bit difficult to understand. —The preceding unsigned comment was added by 24.227.161.246 (talk • contribs) 2007-05-26T02:37:18.
- Probably because the article IS very difficult to understand. It's almost all formulas with very little English explanation of them.Statalyzer (talk) 18:35, 20 May 2012 (UTC)
- There are two different ways of interpreting the results of a statistical experiment, the so-called "frequentist" and "Bayesian" approaches. These two ordinarily produce similar results in practice. In some settings, however, the two approaches produce completely opposite results, to the point where one of them would give a strong answer in favor of the hypothesis while the other would give a strong answer against it. CMummert · talk 12:12, 26 May 2007 (UTC)
Better example?
[edit]I feel like the gravity/airplane example does not really demonstrte Lindley's paradox. It is stated that the prior weakly supports the null hypothesis, and that the evidence causes opposite Bayesian and frequentist conclusions. Neither of these occurs in the gravity example, where the prior on H0 is strong, and where the evidence reduces our belief in both paradigms (just to different extents -- in the example, the frequentist rejects it, while Bayesian just barely reduces his belief). But that seems to miss the underlying paradox. Could we have a better example?
Maybe someone could flesh out an example like the following. Suppose theoretical physicists have two competing "theories" of unification - call them string theory and the loop model (I'm making this up - don't take this to be an accurate characterization of actual physics). Suppose we take the null hypothesis that the loop model is correct to be p(H0)=0.001. (i.e., The odds of it being correct is small). String theory, however, is actually a class of theories, rather than a single theory, and in fact, let's assume that it is an infinite class of theories (i.e., there are some parameters, and for a particular choice of parameters, you end up one one possible instantiation of a string theory). So we have a probability density over all these potential string theories, with no single instantiation having a strictly positive probability.
Now, we make some observation. It could be any observation with a non-zero likelihood under each theory, e.g., p(x|H)>0 for H=loop and for H=any string instantiation, but also with p(x|H0)<5%. The frequentist rejects H0 -- concludes that loop theory is false. Now, as the entropy of the prior (over the string theories) increases, the posterior p(H0|x) gets arbitrary close to 1. So the Bayesian becomes more inclined to accept H0 (loop theory). I think this is the essence of the paradox -- the observation (and it doesn't really matter what the observation is) causes the Bayesian to become very confident the null hypothesis is true (even with a very small prior), while the frequentist sees the observation as a reason to reject it. Someone needs to flesh out this, or some other example, a little more cleanly and make it understandable to the wiki audience before it would be suitable for the wiki page, but I propose that an example along those lines would demonstrate the essence of the paradox better than the current example.
- The current example is also not great. It assumes that "We have no reason to believe that the proportion of male births should be different from 0.5", whereas actually any reproductive biologist worth his or her salt knows that the proportion of make births in humans is not 0.5. It's generally higher. Bondegezou (talk) 14:26, 29 November 2010 (UTC)
The paper www.dmi.unipg.it/mamone/sci-dem/nuocontri/lad2.pdf (Frank Lad) contains an example where the frequentist approach keeps H0, while the bayesian approach rejects H0.
How do both results come together?
89.204.152.52 (talk) 01:11, 22 November 2011 (UTC)
Gobbledygook
[edit]"Lindley's paradox describes a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give opposite results for certain choices of the prior distribution." Unless you already understand it, this is singularly unhelpful. TREKphiler hit me ♠ 05:44, 22 October 2008 (UTC)
- I find the lede to be very well phrased, particularly given its links to statistics, Bayesian inference, frequentist, hypothesis testing, and prior distribution. -- 110.49.227.102 (talk) 16:16, 1 November 2011 (UTC)
Similarity to False Positive paradox?
[edit]Isn't this roughly the same thing as the false positive paradox? or possibly a generalization of the false positive paradox? As in:
A certain diagnostic test has 95% accuracy. Suppose you test positive. The result is statistically significant at the 5% level, and you would reject the null hypothesis that you do not have the disease.
However, suppose the incidence of the disease is 1%. (Here, disease incidence is the prior distribution.) Then, the probability of a true positive--that you have the disease and tested positive is (.01 x .95)=.0095. Similarly, the probability of a false positive--that you don't have the disease and test positive--is (.99 x .05)=.0495 Thus, the conditional probability of having the disease given a positive result is only 16.1% (.0095/(.0095+.0495)=.161), not 95%. Thus the Bayesian analysis would say the posterior probability of the null hypothesis is 84.9%, indicating it likely to be true.
I'm not familiar with Lindley's paradox, so I don't know if my 'translation' is apt, but the two at least seem characteristically similar. —Preceding unsigned comment added by 71.232.22.81 (talk) 20:57, 12 December 2009 (UTC)
>Isn't this roughly the same thing as the false positive paradox?
No. The paradox only applies when a hypothesis test is done on a sharp hypothesis, i.e. that some parameter equals a specific value exactly. (This is often an absurd thing to assume, but sometimes it isn't.) A sensible prior is then a single blob of probability on that value, plus a distribution on the other possible values. It's thus a prior with two parts, a point plus a continuous distribution. The Bayesian approach works fine, but the classical approach (which Bayesians would say is unrigorous anyway) goes gaga. It can be that the Bayesian approach is (correctly) telling you that the null has high posterior probability while the classical hypothesis telling you that the null is highly unlikely to be true. One more reason to be a Bayesian. :) Blaise (talk) 17:32, 30 January 2010 (UTC)
- That's not how I understand the paradox. In my mind, the problem is more in the Bayesian analysis in moving from "not wanting to favor H_0 or H_1" to the choice of a prior with P(H_0) = 0.5. In the case that H_0 is a point distribution for a parameter, and H_1 a more diffuse distribution, this ends up producing a strange prior on the parameter, with half the probability mass at a point, which obviously dramatically favors H_0 over any possible value of H_1.
- In the sex-ratio of births, the frequentist result from the so-called "paradox" is actually correct, in that there are significantly more boys born than girls. If one uses a reasonable prior in the Bayesian analysis, or a more careful phrasing in either framework, e.g., compare H_a = more boys than girls vs. H_b = more girls than boys, then the answers will almost always agree from the Bayesian and Frequentist approaches. 195.220.100.11 (talk) 09:56, 23 April 2012 (UTC)
- (I'm the author of the previous comment). I rewrote most of the article, and along the way realized that the real issue is that
- The frequentist tests H_0, finds it to be a poor explanation for the observation, and rejects.
- The Bayesian tests H_0 vs. H_1, and finds H_0 to be a better explanation than H_1, significantly so.
- So it's not really a paradox at all. The Frequentist says "H_0 is bad" and the Bayesian says "H_0 is better than H_1". They're not in disagreement. ThouisJones (talk) 08:01, 25 April 2012 (UTC)
- (I'm the author of the previous comment). I rewrote most of the article, and along the way realized that the real issue is that
error in description
[edit]User:Michael Hardy was correct in his original text in the "Description of the paradox" section; oddly, User:66.127.52.57 altered it so it was backwards. I've now restored the original text that Hardy had. 128.112.21.128 (talk) 20:10, 24 September 2010 (UTC)
The lack of an actual paradox - Section is Unclear
[edit]Alot of numbers thrown around with little to back it up. I am having some issues understanding where they came up with these conclusions.
"For example, this choice of hypotheses and prior probabilities implies the statement: "if θ > 0.49 and θ < 0.51, then the prior probability of θ being exactly 0.5 is 0.50/0.51 ≈ 98%." Given such a strong preference for θ=0.5, it is easy to see why the Bayesian approach favors H_0 in the face of x≈0.4964, even though the observed value of x lies 2.28σ away from 0.5."
why choose "θ > 0.49 and θ < 0.51"? Why not use 0.25 - 0.75 as the range? Where did they get x = 0.4964, wasn't x=0.5036? How did they get that this was 2.28σ away from 0.5? BYS2 (talk) 11:50, 23 July 2012 (UTC)
- You can choose any range you want. Range 0.49 .. 0.51 is useful to demonstrate that the prior distribution, though seemingly fair, actually has strong bias towards 0.50.
- 0.4964 looks like a mistake. Regardless, 0.5036 and 0.4964 are equidistant from 0.5.
- Using notation currently in the article, =49225.5, =24612.75</math>, =156.88. The observed number of boys is 49581, which is (49581-49225.5)/156.88 = 2.27 sigma away from 0.5.--Itinerant1 (talk) 12:59, 15 September 2012 (UTC)
There's still a problem with the claim "prior probability of θ being exactly 0.5 is 0.50/0.51 ≈ 98%." How in any way does the ratio of a value of θ to the upper limit of its range equate to its prior probability? By this reasoning, the prior probability of θ being exactly 0.4999 is 0.4999/0.51 ≈ 98%, also. Etc. 131.225.23.168 (talk) 18:05, 24 March 2014 (UTC)
Another problem: "Looking at it another way, we can see that the prior is essentially flat with a delta function at theta = 0.5. Clearly this is dubious. In fact if you were to picture real numbers as being continuous, then it would be more logical to assume that it would impossible for any given number to be exactly the parameter value, i.e., we should assume P(theta = 0.5) = 0." If this is the case then the null hypothesis is a-priori false and no data can change our minds about it, so why are we doing a hypothesis test?
Continuing from the above: "For example, if we replace with , i.e., the maximum likelihood estimate for , the posterior probability of would be only 0.07 compared to 0.93 for ". If our priors are the same as the original prior this is just plain wrong: H_2 has a prior volume of zero so H_0 must always win. If the writer was proposing that H_2 be given half the initial prior then this is also ridiculous, because H_2, being the maximum likelihood estimate, cannot be known before looking at the data, so it obviously cannot have a prior probability of 0.5. This whole section needs serious changes. (Bjfar (talk) 23:22, 27 November 2012 (UTC))
A Ghastly "Flat Priors" Bayesian Analysis
[edit]My take on this is that the so called "flat" or "uninformative" or "ignorance" Bayesian prior, as a concept, is already somewhat absurd, in general (my hunch is that Bayes himself would, quite likely, have ridiculed it, or at least rejected all personal association with it). But here what purports to be a standard Bayesian "ignorance prior" analysis is even worse than usual: One is testing a "simple" hypothesis against a composite alternative (in this case being the logical complement of ). To assign prior probability 0.5 to each -- merely because we have decided to slice up the world of possibilities in this or that particular way in the process of formulating formal test hypotheses -- is to me intuitively very strange (though unfortunately not so unusual). But the further step of assuming that within the composite parameter region we should further apply this "flat" or "uninformative" paradigm a second time in a different way, i.e. now in order to further distribute the supposedly already "flatly" assigned 0.5 (i.e. to allocate the 0.5 amount of prior belief now "uniformly" within ) is insane. One could of course obtain any prior at all in this manner, by repeatedly repartitioning chunks of the universe and invoking varying concepts of flatness each time.
The main point that this silliness illustrates is that so called "flat" priors are always only "flat" w.r.t. a particular underlying measure on a particular formal "universal set" representation of the world of possibilities (atomic down to some fairly arbitrarily defined level of fine-ness of discrimination). If you chop and choose arbitrarily when selecting this set, and then this measure upon it w.r.t. which is defined the latest version of what constitutes flatness/indifference/ignorance, then you can find pretty much any conclusion you like, if you stay at it long enough. In doing this you have parted completely from Bayes' original intuition, in my view. — Preceding unsigned comment added by 138.40.68.40 (talk) 21:39, 13 September 2012 (UTC)
- The flat priors may not be great in this example, but the division between theta=0.5 and the rest is not unreasonable, if taken at face value. In the current case the priors used effectively claim that we have some a-priori reason to expect that theta=0.5 is special, and are indifferent about all other possible values of theta. This is probably a stupid prior because the fact that we are doing the test implies that we think theta is probably small if it differs from 0.5 -- i.e. probably we do not actually think a-priori that theta=0.51 is equally likely to theta=0.99 if we place so much prior on theta=0.5 -- but the delta function bit is not such a serious problem, and indeed is necessary if the theta="exactly 0.5" point hypothesis is to have non-zero probability. If you argue that it does have zero probability then this whole null-hypothesis test makes no sense, since you are claiming that the null hypothesis is false to start with.(Bjfar (talk) 23:40, 27 November 2012 (UTC))
Flawed Null Hypothesis
[edit]It is a fair to ask if really more boys than girls are born as the observation suggests. But no statistics can ever give us the answer whether theta=0.5 (exactly) or not. Not being an expert of statistics, I find it somewhat amusing to watch experts seriously discussing this obviously flawed null-hypothesis. Clearly, we can only ask if theta lies within an interval [0.5 -d .. 0.5 .. 0.5+d] for some value of d and some level of confidence. Similarly, we can never decide statistically whether a dice is perfectly fair or not. Experiments will tell whether the tested dice is fit for a certain use or not – probably... H.J. Martens84.227.98.31 (talk) 09:47, 3 February 2019 (UTC)
- It is just a numerical example. And please add new topics at the end of the page, not somewhere in the middle. --mfb (talk) 11:59, 3 February 2019 (UTC)
Question about example
[edit]I'm confused about the worked example. It says that is going to be the hypothesis . But the calculations make it look like you're using a different hypothesis, " is uniformly chosen from [0,1]". Am I misunderstanding this, or is there an error in the example? -- Creidieki 02:38, 13 November 2012 (UTC)
- Do you refer to the statement "Under , we choose randomly from anywhere within 0 to 1, and ask the same question."? If so your confusion is well valid: this is a horrible way to explain what is going on in the Bayesian reasoning process. It is clearly a frequentist way of thinking about the problem, not a Bayesian way. The math works out the same, but the philosophy is not this at all. The text should be changed. To the Bayesian, it is our belief in the different hypotheses that is distributed uniformly (there is no random variable for ) and we obtain the performance of the compound hypothesis as effectively an average across the sub-hypotheses, weighted according to how a-priori plausible each sub-hypothesis is, in accordance with the rules of probability. (Bjfar (talk) 02:45, 28 November 2012 (UTC))
Numerical example is unreasonable
[edit]The way the gender ratio at birth example is formulated, the Bayesian calculation is wrong. The probability of the parameter being 0.5 is exactly zero. The probability of the parameter being "not 0.5" is one. This is prior to the observation, and the observation cannot change this.
A priori, the parameter's value is a real number in the range [0, 1], endpoints included. This is an uncountable set. (As is typical with Wikipedia, the article about uncountable sets fails to tell at the outset that the concept is a second degree of infiniteness, "countable" being an ellipsis for "countably infinite".)
There is no reason to assume that the value of the parameter should be any particular real number. There is no meaningful way of distinguishing the hypothesis Ha, "The parameter is 0.5", from the hypothesis Hb, "The parameter is 0.5 + 10—1000." Therefore, these two values must have the same probability, or probability values similarly close. Even in an utterly small (but not singleton) interval, there are an uncountable infinity of possible values of the parameter, all having similar probabilities. The sum of all these probabilities must be no more than one, so the probabilities of each value is "infinitely small". In the real numbers, there is no such thing as "infinitely small", apart from zero.
(Consider an utterly small interval [a,b]. Assume p is a lower bound on the probability of the hypothesis "The parameter has value x", for x ranging over the interval. But summing the probabilities over an infinite number of values from the interval gives ∞p plus the sum of the corresponding differences. The sum would have to be infinite unless p=0.)
It may appear to you that the sum of infinitely many terms that are all zero must be zero. But what really happens is that such a sum is ill defined. You can only define the probability of the parameter lying in a specific interval. Then there is an algebra of sets with associated probabilities. (Read about measure theory.)
From the observation, you may deduce a probability density. Or, you may assume some prior probability density, e.g. the constant density over the interval [0,1], and use Bayesian logic to assign a posterior probability density based on the observed evidence.
To keep the example as much as possible as it is, assume that you divide the interval [0,1] in N disjoint intervals of equal length, such that the value 0.5 unambiguously belongs to one particular of the intervals. For simplicity, assume N is an odd number, so that the value 0.5 is in the middle of interval number m=(N+1)/2. Now you may define the hypothesis H0, "The parameter lies in interval number m", and the complementary hypothesis H1, "The parameter does not lie in interval number m."
The probability P(E|H0) of the observed sex ratio (k boys in n births) given this hypothesis must be computed as an integral over the interval that the parameter is supposed to be in:
where am and bm are the boundaries of interval number m. If N is chosen large, H0 approaches the value of the computation done by the previous author. In the limit we are back at what the previous author wrote.
The probability P(E|H1) of the observed sex ratio given the alternative hypothesis is, of course,
Again, in the limit of large N, we get the same result as the previous author. (I have not checked his computation of the integral.)
In other words, the error is not in the computation of the probabilities P(E|H0) and P(E|H1). The error is in assuming that H0 has a finite prior probability when it presupposes a single fixed value for the parameter θ, while the infinitely many other values that are infinitely close to the same value, each get an infinitely small (i.e. zero) prior probability.
If instead the prior probabilities are based on some continuous probability density, we should likely find that Bayesian logic prefers H1 if it covers a broad interval containing a value equal to the sex ratio k/n.
In the limit of large N, this is obviously so, since we then must set the prior probabilities to and , and we get the posterior probabilities and .
Cacadril (talk) 15:13, 15 July 2013 (UTC)
Dear Cacadril,
By construction the prior probability of theta=0.5 is 50% and the probability of it being different from 0.5 is the uniform distribution with mass 50%. This is based on Lindley's original paper. The point of the article is not whether this kind of prior is sensible or not, but that it leads to the paradox.
Best, 128.40.79.203 (talk) 15:47, 14 April 2014 (UTC)
Why no prior for Bayesian approach
[edit]I am curious why the example does not use a Beta prior on the probabilities of the binomial model. Concretely, I am talking about
We are talking about the marginal likelihood here that is then used for the Bayes factor. Why go without a prior in that case?
I would rather go with:
With flat belief as we have in this case we can set — Preceding unsigned comment added by Ph singer (talk • contribs) 09:57, 13 January 2015 (UTC)
Erroneous statement
[edit]The statement
In fact if you were to picture real numbers as being continuous, then it would be more logical to assume that it would impossible for any given number to be exactly the parameter value, i.e., we should assume P(theta = 0.5) = 0.
is wrong. P is a probability density, not a probability, so it can be finite on an infinitesimal range. This subsection - under the "disputed" box - is out of character with what is otherwise a good explanation. — Preceding unsigned comment added by 149.217.40.222 (talk) 17:10, 29 April 2016 (UTC)
- You got the point. Sorry to say this, but this is no paradox, but just strange assumptions.
I will shortly explain what the problem is: This is Bayes' theorem for and k:
We here only have assumptions about , but no clear value. As the previous statement says, is 0 if you assume to have an infinite number of alternative . That means is also 0 for any k. We can see this also (and more useful) by using the law of total probability to rewrite to:
When we assume that the probability for all is the same, this cancels out to:
Now comes the point of no mistake: The denominator now increases to + infinity if you add more and more This is correct and the same as the previous statement of a vanishing chance for This is what you would usually do, when approaching that problem.
This is the question: What is the chance for any H with p≤0.5 to create male. This makes some sense if you want some information about the chance of p being larger .
Now we can use densities to answer that question: The integral correctly shows, that the probability to observe "k" for the sum of all possible hypotheses does not depend on . That is identical to the assumption that we know nothing about and means the chance for each (from k=0 to k=n) is . Make that clear for yourself, it is an interesting and if you once understood a very clear observation how the flatness in and fits together. The conflicting assumptions are that cannot be 0.5 if at the same time you know nothing about and . Now we use everything with changed and we do not assume a value for any (but that they are all the same a priori):
You can check that this is in our case ≈
which is exactly the same result as in the frequentist approach. Because this was just worked out by myself today, someone different should review this and then we can put this into the article.
Bay Siana (talk) 13:23, 14 February 2018 (UTC)
You can also ask the question if lies in a range of the e.g. 95% best possible . This is a bit more complicated, but let us further assume that "best " would be a range of ±s. The function you need is
For our case with this gives ≈. This is the chance that the value k comes from the "-area" from 0.5 to 0.507222. That means the chance that it is from outside this range is ≈ which is again extremly close to the frequentist approach.
(Bay Siana (talk) 15:15, 14 February 2018 (UTC))
You can also calculate like in the article, but then your assumption is really strange. You assume 50% chance for to be 0.5 and 50% chance to be "somewhere". This is a bad choice, because you would normally not think that a lower than 0.1 has the same chance to be the parameter than a e.g. between 0.49 and 0.51. So the "real" solution (between 0.49 and 0.51) is in this model "surpressed" by the wrong assumption that could be anywhere. I would not mix this two things, but if anyone wants to, a Gaussian should be used (with e.g. and ). (92.217.90.67 (talk) 15:13, 17 February 2018 (UTC))
R gives a different result for the https://en.wikipedia.org/wiki/Lindley%27s_paradox#Frequentist_approach
[edit]>dnorm(49581, mean=(0.5)*(49581+48870), sd=sqrt((1-0.5)*(0.5)*(49581+48870)))
[1] 0.0001951344
>dbinom(49581, size=49581+48870, prob=0.5)
[1] 0.0001951346 — Preceding unsigned comment added by Ani1977 (talk • contribs) 14:38, 12 December 2019 (UTC)
Please explain last sentence
[edit]Hello,
Could someone please explain "(Of course, one cannot actually use the MLE as part of a prior distribution).", which is the second to last sentence in the article right now. If one cannot use it, why have we just used it? And why should this be obvious? I feel like this is VERY important, but not explained.
Hakunamatator (talk) 16:07, 3 April 2020 (UTC)
- The maximum likelihood estimate is obtained using the data—it's the value of the parameters that maximizes the likelihood of the data. The prior distribution, by definition, does not depend on the data, because it represents our beliefs before the data was collected. Thus, it would make no sense to use MLE to set the prior. I do agree with you that, since such an approach is invalid, giving it as an example doesn't make sense. Legendre17 (talk) 14:48, 13 October 2020 (UTC)
Alternative hypothesis in frequentist approach
[edit]The claim "the frequentist approach above tests without reference to " (in the section "The lack of an actual paradox") seems wrong. The frequentist approach takes into account the alternative in at least two ways: the choice of a summary statistic, and the definition of what it means for values to be "at least as extreme". A similarly misleading statement is made below, "The frequentist finds that is a poor explanation for the observation". While it's true that the null and the alternative are treated asymmetrically, it's not true that the alternative is not used at all. Legendre17 (talk) 15:35, 13 October 2020 (UTC)
Bayesian Example integral needs clarification
[edit]I don't understand where the integral in the "Bayesian approach" section is coming from for P(H1|k). I'm not that familiar with Bayesian stats. Could we get some additional clarification or citation on why we are using this integral?
Bayesian Approach point vs interval
[edit]It seems that the Bayesian approaches are looking at the probabilities of getting exactly k. But the frequentist approaches are looking at getting more than or equal to k. Is this observation relevant to the paradox?