web dirweb dir Bookmark and Share |
 

Supplementary material to “On the Correct Use of Statistical Tests: Comment on 'Lies, Damned Lies, and Statistics (in Geology)'”

22 February 2011

D. Sornette, Department of Management, Technology and Economics, Eidgenössische Technische Hochschule Zürich, Zurich, Switzerland

V. F. Pisarenko, International Institute of Earthquake Prediction Theory and Mathematical Geophysics, Russian Academy of Sciences, Moscow, Russia

Citation:

Sornette, D., and V. F. Pisarenko (2011), On the correct use of statistical tests: Comment on “Lies, damned lies, and statistics (in geology),” Eos Trans. AGU, 92(8), 64, doi:10.1029/2011EO080008. [Full Article (pdf)]

D. Sornette1-3, V.F. Pisarenko4

1ETH Zurich, D-MTEC, Kreuzplatz 5
CH-8032 Zurich, Switzerland
1ETH Zurich, Department of Earth Sciences
3Institute of Geophysics and Planetary Physics
University of California, Los Angeles, California 90095
4International Institute of Earthquake Prediction Theory and Mathematical Geophysics
Russian Ac. Sci., Profsoyuznaya 84/32, Moscow 117997, Russia
Emails: dsornette@ethz.ch and pisarenko@yasenevo.ru

In a recent Forum in EOS entitled «Lies, Damned Lies and Statistics (in Geology)», Vermeesch (2009) applied the standard Chi-square test to a global catalog of earthquakes (USGS, http://earthquake.usgs.gov; 118,415 events, 4.0 ≤ m ≤ 9.0, 1999–2009) and revealed that the distribution of global seismicity over weekdays is not uniform. The chi-square sum S(D) is 94, which corresponds to an extremely small p-value = 4.510-18 for the null hypothesis that «the occurrence of earthquakes does not depend on the day of the week». This makes the null hypothesis of the uniform distribution of seismicity over week days absolutely unacceptable.

Then, Vermeesch applied to the catalog the following operation: “Using the same proportion of earthquake occurrences but reducing the sample size by a factor 10 results in a 10 times smaller chi-square value (S(D) = 9.4), corresponding to a p-value of 0.15, which is greater than 0.05 and fails to reject the null hypothesis. In conclusion, the strong dependence of p-values on sample size makes them uninterpretable”. Vermeesch concluded that «statistical significant is not the same as geological significant».

In complete contradiction, we affirm that statistical tests, if they are used properly, are always informative.

The conclusions by Vermeesch are erroneous, and the error consists in the inadmissibility of the operation consisting of simultaneously dividing the total sample size and the earthquake occurrences in each weekday by 10 (or any other factor). This created a biased sample through the forced equal reduction of data in each bin. Instead, Vermeesch should have taken 10% of the original data set and then grouped it into 7 bins again. The chi-square sum is made of normalized squared deviations of observed frequencies from theoretical ones (all equal to 1/7 under the null hypothesis of uniform seismicity over the 7 weekdays). Dividing the sample size by a factor of 10 results in decreasing the mean values of the squared deviations by 102, and in decreasing the variances of the deviations by 10. Thus, their ratio is decreased by 10, as found by Vermeesch. It is essential to realize that such an operation does not correspond to an apparently harmless reduction of the sample size from 118,415 to 11,842. In particular, the ten-fold reduction of the chi-square sum S(D) is inadmissible. Indeed, under the condition that the sample is large enough and the observations are independent and identically distributed, and that the null hypothesis holds, then a standard result states that the p-value practically does not depend on sample size, since the asymptotic chi-square distribution is valid for any sufficiently large sample size satisfying the condition of a minimum number of observations per bin (usually 10). When performing the correct procedure of taking 10% of the original data set and then grouped it into 7 bins again, we find a p-value for the reduced sample size of about 10-6, instead of 0.15 in Vermeesch's procedure. Thus, decreasing the sample size by 10 still rejects the null hypothesis that the occurrence of earthquakes does not depend on the day of the week. This change of the p-value with sample size should thus signal the existence of some violation of the conditions for the chi-square test to be valid, and not that statistics lie. We proceed to investigate this issue.

In order to interpret the cause of the rejection of the null hypothesis, both for the initial sample and for the (correctly) reduced one, we need to recall that the chi-square test is based on the asymptotical distribution (as sample size n tends to infinity) of the normalized, squared sum of deviations of the observed frequencies from theoretical ones. The chi-square test should be applied to independent, identically distributed random observations. In addition, the distribution of observations over bins should satisfy the condition that each bin contains not less than 8-10 observations. As long as the last condition is satisfied, the p-value almost does not depend on the sample size (a possible weak dependence asymptotically vanishes). Since geological causes of a heterogeneous distribution of earthquakes over different weekdays seem improbable, one can assume that some conditions of applicability of the chi-square test are violated in the case of the earthquake catalog studied by Vermeesch (2009). We can enumerate at least five reasons of such violation:

  • aftershocks;
  • so-called “swarms” of weak shocks (of vague tectonic nature);
  • artificial seismic events (quarry blasts; fluid-induce seismicity and so on, see e.g. [Goldbach, 2009];
  • lower background noise on week-ends;
  • catalog incompleteness.

Of course, there can exist other reasons of non-stationarity or interdependence of events, preventing a justified use of the chi-square test.

We are going to remove aftershocks and other possible clusters from the catalog in order to test the null hypothesis for the remaining set of main shocks. This is done to obtain a catalog of earthquakes, which obeys better one of the conditions for the application of the chi-square test, namely the independence between events. For this, a standard procedure in seismology is to «decluster» earthquake catalogs by removing as much as possible the aftershocks. Here, we apply the “aftershock cleaning” method described in details in [Pisarenko et al., 2008]. We found 80616 aftershocks (constituting 68% of the total number of events), and 11841580616 = 37799 main events. The main events are distributed as follows in each weekday: Mon 5135; Tue 5423; Wed 5338; Thu 5615; Fri 5218; Sat 5485; Sun 5585. The chi-square sum S for the main shocks is

(1) S = ∑7k=1 [nk – (1/7)·n]2 / [(1/7)·n],

where nk is the number of events in day k of the week and n is the total number of events (n=n1+ n2+ n3+ n4+ n5+ n6+ n7). We find S = 36.19, with p = 2.5 ·10-6. Although the p-value for the main shocks increases a lot, as compared with the p-value including the aftershocks, its small value still leads to reject the null hypothesis of a uniform distribution of events over the weekdays. While the “aftershock cleaning” method removes an essential portion of the aftershocks, no declustering method is perfect. In addition, we have not addressed the possible effect of “swarms” of weak shocks and of explosions. These effects refer to weak events, and we have little information on the nature of these events (if any). We know only that the Gutenberg-Richer law should be fulfilled (at least in the range of moderate events) in order for the catalog of earthquakes to be a complete representative one. For this purpose, we are going to truncate the catalog to remove weak events which are below a threshold that is higher than the threshold m=4.0 given by the catalog. This gets rid of possible artificial shocks and allows us to analyze the remaining events, which should obey even better the conditions for the application of the chi-square test.

Figure S1 shows in the inset that the Gutenberg-Richter law for the distribution of earthquake magnitudes is fulfilled rather satisfactorily. Some deviations can be seen at lower magnitudes m ‹ 5 and above m = 7.5. These features are documented extensively in the seismological literature. The histogram of earthquake magnitudes in Fig. S1 shows that the monotonic decrease of the distribution begins at 4.5, but the threshold of completeness should be somewhat larger, approximately for m ≥ 5.0. Many seismologists believe that the completeness threshold for the Harvard global catalog is about m=5.5 (since 1987) and m=5.75 (since 1977), see e.g. [Molchan et al., 1996]. Being a bit less restrictive and selecting only earthquakes with m ≥ 5.0 but without removing the aftershocks keeps 16308 events. The events with m ≥ 5.0 are distributed as follows in each weekday: Mon 2374; Tue 2511; Wed 2291; Thu 2497; Fri 2153; Sat 2282; Sun 2360. The chi-square sum S given by expression (1) is S = 42.38, with p = 1.5510-7.

Combining the declustering “aftershock cleaning” method in order to obtain approximately independent events and working with a more complete catalog with only events of magnitude m ≥ 5.0 leads us to identify 10672 aftershocks (65%), and 16308 – 10672 = 5636 main events with m ≥ 5.0. The main events with m ≥ 5.0 are distributed as follows in each weekday: Mon 780; Tue 847; Wed 793; Thu 831; Fri 785; Sat 821; Sun 779. The chi-square sum S given by expression (1) is S = 5.64, with p = 0.46. Thus, the hypothesis that “the occurrence of earthquakes does not depend on the day of the week” is not rejected for large main shocks.

We can thus affirm that the main earthquake shocks with m ≥ 5.0 are distributed uniformly over the seven weekdays, as expected from “seismological intuition.” We have obtained this result correcting Vermeesch (2009)'s conclusion by taking into account the two most important properties of earthquake catalogs: the presence of numerous aftershocks and the problem of catalog incompleteness [Kagan, 2003].

The smaller earthquakes (main shocks included) might be distributed unevenly during the week, but elucidating the origin and nature of this phenomenon require much more additional information about the explosions, swarms of weak events and so on [Atef et al., 2009].

When properly used and interpreted, statistical tests are always revealing useful information. Ridiculously small p-values as found by Vermeesch (2009) should lead to questioning one by one all the assumptions which underpin the used statistical test. Geological hypotheses ought to include in the null hypothesis that earthquakes are correlated and that there are data uncertainties. A full model is needed to correctly calculate appropriate significance levels. Such approach is now undertaken since many years by different professional statistical seismologists [Keilis-Borok and Soloviev, 2003; Schorlemmer et al., 2007; 2009].

Mathematics is not wrong, only its incorrect interpretation may lead to confusion and paradoxes.

We are thankful to Pieter Vermeesch for providing the catalog in question for exact one to one comparisons and to Max Werner for useful discussions.

Bibliography

Atef, A.H., Liu, K.H., and S. S. Gao (2009), Apparent Weekly and Daily Earthquake Periodicities in the Western United States, Bulletin of the Seismological Society of America 99 (4), 2273-2279.

Kagan, Y.Y. (2003), Accuracy of modern global earthquake catalogs, Phys. Earth Planet. Inter. 135 (2-3), 173-209.

Keilis-Borok, V. I., and Soloviev, A. A. eds. (2003), Nonlinear Dynamics of the Lithosphere and Earthquake Prediction, Springer-Verlag, Heidelberg.

Short-Term Properties of Earthquake Catalogs and Models of Earthquake Source, Bulletin of the Seismological Society of America 94 (4), 1207-1228 (2004).

Molchan et al. (1996), Seismic risk oriented multiscale seismicity model: Italy, Computational Seismology, Iss. 28, pp. 193-224 (in Russian). See as well the English translation: Computational Seismology and Geodynamics, D.C.: American Geophysical Union, 1999, 200p.

Pisarenko V.F., Sornette A., Sornette D., and M.V. Rodkin (2008) New Approach to the Characterization of Mmax and of the Tail of the Distribution of Earthquake Magnitudes, Pure and Applied Geophysics 165, 847-888.

Schorlemmer, D., M. C. Gerstenberger, S. Wiemer, D. D. Jackson, and D. A. Rhoades (2007), Earthquake Likelihood Model Testing, Seismological Research Letters 78 (1), 17-29.

Schorlemmer, D., J. D. Zechar, M. Werner, D. D. Jackson, E. H. Field, T. H. Jordan, and the RELM Working Group (2009), First results of the Regional Earthquake Likelihood Models Experiment, Pure and Applied Geophysics, submitted (e-print at http://www.cseptesting.org/sites/default/files/zechar2007.pdf)

Vermeesch, P. (2009), Lies, Damned lies and statistics (in geology), EOS 90 (47), 24 Nov 2009, p. 443.

Vermeesch, P. (2011), Statistical significance does not equal geological significance: Reply to comments on "Lies, damned lies, and statistics (in geology)," Eos Trans. AGU, 92(8), 66, doi:10.1029/2011EO080013.

Goldbach, O.D. (2009), Flood-induced seismicity in mines, 11-th SAGA Biennial Techn. Meeting and Exhibition, Swaziland, 16-18 Sept. 2009, pp.391-401. See in particular Fig.15 of this paper, showing an uneven distribution of weak seismic shocks over the week days, due to flood induced seismicity.

Histogram approximation of the distribution of earthquake magnitudes

Fig. S1: Histogram approximation of the distribution of earthquake magnitudes for all 118'415 events of magnitude 4 or greater and occurring between Friday 1st January 1999 and Thursday, 1 January 2009 (USGS, http://earthquake.usgs.gov). Inset: Decimal logarithm of the empirical complementary cumulative distribution function 1 – F(m) of earthquake magnitude.

Postscript: In his reply, Vermeesch [2011] presents Table 1 with p values for different sample sizes (n to n/20) extracted from the earthquake data set. As the p values exhibit a strong dependence on sample sizes, he concludes that this "strongly contradicts" our claim that the p value should not depend on sample size. Nothing can be further from the truth. The strong dependence of the p value on sample size shown in his Table 1 confirms that at least one of the conditions for the validity of the chi-square test is violated in the earthquake data set.

AGU galvanizes a community of Earth and space scientists that collaboratively advances and communicates science and its power to ensure a sustainable future.