Talk:Hypergeometric distribution
This article is rated B-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||
|
Example?
[edit]Could someone please add an example? - Someone who didn't sign
Does it really make sense to allow the parameter n to be 0, as the side bar suggests? 81.159.124.90 04:12, 4 December 2005 (UTC)
- No. Both N and n should be positive. The support is incorrect also. John Lawrence 21:05, 26 June 2007 (UTC)
To my mind the table on the mathematical characteristics of the hypergeometric distribution is much too broad. I suggest that a line break is used in the formula of curtosis. Falk Lieder 16:00, 6 October 2006 (UTC)
Doesn't it make sense if n=0. In that case the probability of 0 successes is 1, and the probability of any other number of successes is 0. I mean the math doesn't break, and it gives the solution you would intuitively expect. Hwttdz 15:13, 22 October 2007 (UTC)
- Agree, I'll change it back.John Lawrence 14:35, 25 October 2007 (UTC)
(was: helpme)
[edit]Formatting problem:
The two tables "Probability mass function" and "drawn / not drawn" are overlapping each other when viewed in Mozilla Firefox version 1 and 2. (There is no problem in Internet Explorer).
I have tried with various html tags, putting headings between the tables, centering, etc. Nothing helps.
I think it has to do with the style sheets allowing text to float around tables. Do I have any access to change this? Arnold90 10:42, 17 June 2007 (UTC)
- I tried an align=right. If that isn't ok I suggest asking at the Help desk to get more opinions.--Commander Keane 10:55, 17 June 2007 (UTC)
- That helped. Thank you. 13:22, 17 June 2007 (UTC)
The fix has been undone?! Fixing it again Arnold90
related distributions section
[edit]The statement in the related distributions section doesn't make sense to me. X is a random variable, not a sequence, so taking the limit as written doesn't make sense. Also, unless D goes to infinity, the limiting distribution will be degenerate. I would like to replace it with a statement that expresses the same idea in an informal way. I would also like to add comments showing the relationship to the Bernoulli and Normal distribution. I will wait for comments about this change before changing anything. Here is how I would like the section to read:
Let X ~ Hypergeometric(,, ) and .
- If then has a Bernoulli distribution with parameter .
- If and are large compared to and is not close to 0 or 1, then where has a binomial distribution with parameters and .
- If is large, and are large compared to and is not close to 0 or 1, then where is the standard normal distribution function.
Johnlv12 14:13, 26 June 2007 (UTC)
application and example section
[edit]I would like to remove everything from this section after the first horizontal line. The example of how to do use the calculator is not appropriate here. I would just leave the link to the website at the bottom of the page "external links". Also, the relationship to the binomial distribution is already in the "related distribution" section, so it is not needed here.Johnlv12 14:21, 26 June 2007 (UTC)
various additions and modifications
[edit]I would like to change the name of parameter D to m. Any objections?
I am going to add another symmetry relation and recurrence relations, and a formula for the mode.
I want to add a section for the multivariate hypergeometric distribution. I think it is OK to put this into the same article in analogy with Fisher's noncentral hypergeometric distribution and Wallenius' noncentral hypergeometric distribution. It is convenient to have the formulas for more than two colors of marbles on the same page. Any arguments for putting the multivariate distribution in a seperate entry?
Arnold90 15:59, 1 July 2007 (UTC)
somebody anonymously interchanged white to be failure and black to be success in the example. But, they only changed it in one place, which makes the remainder incorrect. I have changed it back to the way it was originally written. I have no problem with changing the colors or not referring to colors at all (although I think visualizing different colored balls helps). If someone wants to change the colors, they have to make all the changes to make the example still correct.John Lawrence 14:25, 19 October 2007 (UTC)
In the introductory illustration, it states "Suppose you are to draw "n" balls without replacement from an urn containing "N" balls in total, "m" of which are white." This implies that m cannot be large than n. Yet the sidebox states that m is an element of 0, 1, 2, ... N. I believe this should read "m is an element of 0, 1, 2, ... n." 203.214.117.238 (talk) 04:29, 7 May 2010 (UTC) Nevermind. Figured it out, the wording was a bit ambiguous. Perhaps this could be re-written so it doesn't sound like m is the number of successes in the draw, but rather the total number of potential successes.
I changed the colors of the marbles from white(success)/black(failure) to green(success)/red(failure). I triple checked, but if I missed any instances, let me know or correct it. Thanks! --Cammy169 (talk) 15:09, 28 February 2014 (UTC)
Two new subsections
[edit]I just added two subsections to greatly expand the symmetries section, putting a stronger expository emphasis on non-sampling applications where the symmetries are more self-evident. I do research concerning stochastic algorithms which greatly involve hypergeometric distributions, but I'm not by any means a proper statistician, so my focus is more on alternate modes of conceptualization, rather than technical statements. I suspect I added perhaps more exposition that appropriate for an article of mathematical bent. However, I do think that my material at least points out some perspectives that could be more forcefully treated in line with the rest of the exposition, if anyone wishes to adapt my contribution in that direction. I worked quite hard to distinguish sampling artifacts from the underlying distribution.
Coming from the symmetric perspective, I do have some technical concerns elsewhere in the article, but as a non-statistician I held back on making any edits to existing material.
The parameters of the distribution are expressed as m on 0..N and n on 1..N. Yet technically, that statement breaks the formal statements of symmetry. Under that statement of parameters, one can not interchange n with m when m==0.
Equations provided for the mean and mode exhibit obvious symmetry in n and m, but the variance equation does not. I feel it should. If I can still do basic algebra (highly debatable) a symmetric expression of the variance might look like this:
nmN(N-n-m)+(nm)^2 / N^2(N-1)
I happen to like grouping all the division terms on the denominator, as you now have only one place to look to figure out the values of N where variance is undefined. Note that none of the higher order terms indicate the conditions under which they are defined. Perhaps it is standard fare in the stats articles that higher order statistics are only defined for sufficient N. E.g. skewness is undefined for N==2, even though N==2 is viable under parameters. Interestingly the skewness is expressed with symmetry, but kurtosis is not. I grant that anyone who can reason correctly from kurtosis can establish the symmetric form themself.
Why is entropy not given with summation notation? It's a fairly simple sum, and I certainly use it a lot in my own work. Likewise, the median could be expressed easily enough to the nearest integer by averaging the endpoints of the support. I suppose technically the median is only defined when it happens to land on an integer. As a computer scientist, I would tend to view median as one end of a rank order, where rank is abs(left_side-right_side), so by my view, nearby integers adjoined by rounding would both be medians under best-rank selection. Midpoints don't necessarily exist, but non-empty ranks always have mins/maxs. Sometimes you require a true median, sometimes you just want something as close as possible.
As I said, I'm not wedded to the exposition I added. I offer it in the spirit that it will at least suggest useful perspectives to future editors regardless of whether my text stands as contributed. It goes without saying my technical claims should be checked by a competent checker. MaxEnt 08:17, 28 September 2007 (UTC)
- Neither of these sections make sense to me and I'm not sure if they belong in an encyclopedia type article. I feel like I would have to read them at least 3-4 times to understand what is written, then what would I know about the hypergeometric distribution afterwards? Too specialized and technical to be of general interst to most people who want to find out what is the hypergeometric distribution.John Lawrence 14:12, 19 October 2007 (UTC)
Notation
[edit]Quote: "There is a shipment of N objects in which m are defective. The hypergeometric distribution describes the probability that exactly k objects are defective in a sample of n distinct objects drawn from the shipment."
I prefer using K rather than m, such that upper case letters refer to the population and lower case letters refer to the sample.
"There is a shipment of N objects in which K are defective. The hypergeometric distribution describes the probability that in a sample of n distinct objects drawn from the shipment exactly k objects are defective."
drawn | total | |
---|---|---|
defective | k | K |
total | n | N |
Bo Jacoby (talk) 06:21, 26 June 2009 (UTC).
I agree. The proposed notation would improve readability. And there seems to be no convention for these variable names: for example, R and Mathematica use different notations (confusingly, they use the same symbols to mean different things).
If no one objects, I will implement the change.
AllenDowney (talk) 13:30, 7 June 2011 (UTC)
- I like having lower and upper case mean sample and population, but in hand-writing (which many people will do to copy out the formula for their own application), k and K can easily be confused. Perhaps we could use g and G in honor of the green marbles? Numbersinstitute (talk) 16:19, 5 April 2018 (UTC)
Significant error in expression for excess kurtosis
[edit]I am still working on the algebra to make the correct expression pretty, but plan to edit it soon unless someone else would rather do it. If you have comments or suggestions, please advise. Fizzbowen (talk) 03:28, 15 November 2010 (UTC)
added proof
[edit]Just added another (simpler) proof to one of the theorems. I am new to editing wikipedia pages and not yet up to speed on the customary protocol. So if I did something wrong, rude, etc it was not intentionally but out of ignorance, for which I apologize. Therefore tips, comments, corrections, etc are welcome.
Gunungblau (talk) 16:50, 12 February 2011 (UTC)
Symmetry application: Doctor problem?
[edit]Does anyone know if there is a reference for the section entitled, "symmetry application" about the children with the bone marrow?
If there is no reference, I think it would be instructive for someone to put an example equation onto the page for each of those examples. —Preceding unsigned comment added by 131.194.104.136 (talk) 19:13, 24 May 2011 (UTC)
/* Simple Definition */ removed this confusing and incorrect subsection
[edit]An anonymous editor insists in including an incorrect subsection. Please stop it. Bo Jacoby (talk) 01:09, 24 December 2011 (UTC).
Texas Hold'Em Example
[edit]Is this example correct? It appears to ignore the cards held by the other players. The probability of seeing more clubs will depend on how many clubs are already held by the other palyers, surely?
Marchino61 (talk) 10:52, 4 April 2013 (UTC)
- Not sure if it's correct, and that needs addressing. Meanwhile I added an election example, which is simpler and has more practical meaning. Numbersinstitute (talk) 17:09, 6 April 2018 (UTC)
- if you know other players have clubs or don't have clubs then it affects the calculation of the probability, but in general the probability is based on not knowing where the cards are. — Preceding unsigned comment added by Friafternoon (talk • contribs) 15:27, 13 April 2018 (UTC)
What are F1 and F2?
[edit]In the table on the right of the article, there is an F1 and F2. However, I can't find those linked-to or defined anywhere in the article. — Preceding unsigned comment added by 205.155.65.226 (talk) 19:46, 24 April 2013 (UTC)
- In fact you are looking at and . These are generalized hypergeometric functions. --Rumping (talk) 21:40, 10 August 2013 (UTC)
Cumulative Multivariate
[edit]The PDF of multivariate is here, but what about the CDF? From this article, it would appear that one would just "sum" the probabilities into a CDF. Since summing probabilities is USUALLY a bad idea, it would be good to mention it here as the Hypergeometric case is special.
http://stattrek.com/probability-distributions/hypergeometric.aspx
— Preceding unsigned comment added by 98.207.93.61 (talk) 10:37, 25 May 2013 (UTC)
Cumulant generating function.
[edit]Can someone please include the cumulant generating function for the hypergeometric distributions? And for the negative hypergeometric distributions? I can't find it in Wolfram Alpha. Thank you! Bo Jacoby (talk) 13:00, 17 February 2023 (UTC)
Tail bound condition for t appears incorrect
[edit]See Chvátal's proof. For (upper) tail, the condition for is
https://www.sciencedirect.com/science/article/pii/0012365X79900840 Ustcgcy (talk) 05:48, 2 June 2024 (UTC)
Citations for formulas
[edit]Can anyone provide a citation for the CDF formula? I think I see how to verify it but my understanding is that these pages are not supposed to contain original research. It would be nice to have a concrete reference, and I have searched for some time with no luck. It's not on MathWorld (https://mathworld.wolfram.com/HypergeometricDistribution.html) or "An Introduction To Probability Theory and Its Applications" (Feller), or the cited "Mathematical Statistics and Data Analysis" (Rice) or the cited "HyperQuick algorithm for discrete hypergeometric distribution". This is of some importance since it is the formula for the p-value in the Fisher test, and the ecosystem of software implementations of the Fisher test is sort of a mess. Jimmymath (talk) 20:53, 24 January 2025 (UTC)