Supplementary material to “Misapplication of a Statistical Test: Comment on 'Lies, Damned Lies, and Statistics (in Geology)'”
22 February 2011
Robert S. Weigel, Department of Computational and Data Sciences, George Mason University, Fairfax, Virginia
Citation:
Weigel, R. S. (2011), Misapplication of a statistical test: Comment on “Lies, damned lies, and statistics (in geology),” Eos Trans. AGU, 92(8), 65, doi:10.1029/2011EO080010. [Full Article (pdf)]
In this supplement we provide details about the two problems with the analysis described in the comment. There may be other problems with the analysis involving assumptions about the physical process that generated the data (earthquakes), but an understanding of the physics involved is not required to recognize the statistical problems discussed here. The data from the earthquake catalog [USGS, 2010] used in the following analysis are included as part of this supplement epic.cgi.html.
Problem 1:
In Vermeesch [2009], a 7-bin histogram with bins of day-of-week and bin amplitudes of the total number of earthquakes with that day-of-week label was used to test the null hypothesis that the occurrence of earthquakes does not depend on the day of the week. The hypothesis was rejected when all data were used because p was found to be 10ˆ{-18}, corresponding to chi-squared=94, which is less than the typical cut-off value of p=0.05. It was argued that if one-tenth of the data were used, chi-square would be 9.4 and so the hypothesis would not be rejected because chi-squared=9.4 corresponds to p=0.15. This argument was used to support the claim that "the strong dependence of p values on sample size makes them uninterpretable."
Pearson's chi-squared test applies to a histogram generated from data from a very restricted type of experiment. There are many ways to create a uniform histogram for which the chi-square test does not apply. Pearson's chi-squared test was derived assuming that a k-category distribution was created by rolling a k-sided die, in which case the sampling distribution of the amplitudes of each bin is Poisson. For a uniform distribution, the chi-square statistic is $(1/E)\sum_{i=1}ˆ{k} (n_i-E)ˆ2$, where $n_i$ is the number observed in the $i$th bin and $E=(1/k)\sum_{i=1}ˆkn_i$ is the number expected in each bin.
Not all uniform distributions have a sampling distribution of bin amplitudes that is Poisson. To see why this distribution is important in the chi-squared test, consider a uniform histogram created by summing bin values after filling each bin with a value drawn from a distribution with a large standard deviation (and large mean so that all values are positive). This distribution is expected to have a chi-squared value that is larger than one created by drawing numbers from a distribution with a very small standard deviation, but identical mean. This follows directly from the chi-square formula, which for fixed expectation value of bin amplitudes, $E$, is dependent on the standard deviation. In contrast, a Poisson distribution is more restricted. It has the property that its mean and standard deviation depend on a single parameter. If the standard deviation is large, then $E$ is also large.
This can be shown numerically by creating a uniform 7-bin histogram by replacing the number of earthquakes measured on each of the 3654 days in the original data set with a number drawn from either a Poisson distribution with $\lambda$ of 32 or a Gaussian distribution with mean 32 and standard deviation of 9. Creating 100 histograms in this way resulted in an average chi-squared value of 6 for the Poisson distribution and 15 for the Gaussian distribution. The chi-square value is always about 6 for the Poisson distribution with any $\lambda$ (as expected because a 6 degree-of-freedom chi-squared distribution has a mean of 6). However, the chi-square value for the Gaussian case depends strongly on the standard deviation. As the standard deviation of the sampling distribution increases, the chi-squared value increases, as expected.
A randomization approach can be used to create a uniform distribution with the original data to determine if it may make sense to apply a chi-squared hypothesis test of a uniform distribution. In the case of the earthquake data, we do this by taking the list of 3654 {ordinal day number in list, number of earthquakes on day} pairs and then repeat the chi-square/p value calculation after shuffling the ordinal day numbers. The result is p = 8x10ˆ{-3} +/- 3x10ˆ{-2} (chi-squared=57 +/- 32) for 100,000 experiments. Using a standard cut-off value of p = 0.05 would lead to rejection of the hypothesis of day-of-week independence for data with no day-of-week dependence (because it was removed by shuffling the labels). The explanation for this is that the sampling distribution of the shuffled histogram's bins is not approximately Poisson and so the chi-squared test does not apply.
Problem 2:
If Pearson's chi-squared test is applied to data for which it is applicable, the associated p value should not depend strongly on sample size. This is implied by the equation for the chi-squared test which makes no reference to the total number of values used to create the histogram (although empirically it has been shown that a minimum of 5 values in each bin is appropriate [Bulmer, 1979]). This can be shown numerically by simulating a draw of N = 118415 values from a probability distribution function with 7 equally probable day-of-week labels. Doing so results in chi-squared close to the expected value of 6.0 for N, N/10, N/100, and N/1000 draws.
As discussed in the main text, the claim that 10 times fewer measurements would lead to a factor-of-ten reduction in chi-squared could be tested using the existing data. One could use a resampling approach in which a reduced histogram is created by keeping only a random selection of 1/10th the of the days in each bin of the original histogram. Creating 1000 histograms in this way gave an average chi-squared value of 70 +/- 45.
Bulmer, M.G., Principles of Statistics, Dover Publications, 1979.
