Supplementary material to “A Closer Look at Data Independence: Comment on 'Lies, Damned Lies, and Statistics (in Geology)'”
22 February 2011
Sergey Kravtsov and Rolando Olivas Saunders, University of Wisconsin-Milwaukee
Citation:
Kravtsov, S., and R. O. Saunders (2011), A closer look at data independence: Comment on “Lies, damned lies, and statistics (in geology),” Eos Trans. AGU, 92(8), 65, doi:10.1029/2011EO080011. [Full Article (pdf)]
In Eos Forum article (90(47), 24 November 2009), Pieter Vermeesch suggested that statistical tests are not fit to interpret long data records. He makes it sound as if for large enough data sets any true null hypothesis will always be rejected, due to alleged statistical oversensitivity to the sample size. This is certainly not the case! In the present note, we revisit the author's example of the weekly distribution of earthquakes and show that statistical results do support the common-sense assertion that seismic activity does not depend on the day of the week. The data set and MATLAB script that details these analyses can be downloaded from Sergey Kravtsov's website.
We start processing the earthquake data1 (earthquakes of magnitude 4 and greater occurring between 01/1999 and 01/2009) by computing the time series of the number of earthquakes for each day in the ten-year record; this gives rise to the overall distribution of daily earthquake occurrences shown in Fig. S1. Randomly shuffling the days in the daily earthquake occurrence time series allows one to compute synthetic histograms of cumulative earthquake occurrences tallied by weekday, analogous to the observed histogram based on non-shuffled original data [Vermeesch, 2009]. Both a priori and a posteriori 95% confidence intervals based on the above procedure (test 1 in Table 1) contain the range of the observed cumulative daily earthquake occurrences (from16349 occurrences on Friday to 17752 on Sunday); hence, the null hypothesis that the earthquake occurrences do not depend on weekday cannot be rejected, as expected. The a posteriori range is wider than the a priori one, since it corresponds to the probability that the whole seven-day histogram is within this a posteriori range; the latter range is thus given by the 0.951/7 ~0.99 a priori levels2. There is no geophysical reason to believe that the earthquakes behave differently during any particular day of the week. Therefore, one has to use the a posteriori levels to address statistical significance of the observed daily earthquakes spread.
1 http://neic.usgs.gov/neis/epic/epic_global.html
2 The a posteriori range is numerically determined from 1000 synthetic 7-valued cumulative daily earthquake occurrence histograms by computing 2.5th percentile of the minimum histogram values and 97.5th percentile of the maximum histogram values, while the a priori ranges are effectively 2.5th and 97.5th percentile of 7000 cumulative synthetic daily earthquake occurrences (1000 synthetic realizations for each of the7 days).
Analogous results are obtained with bootstrap re-sampling (which is different from the above reshuffling in that certain days can be sampled more than once and others not sampled at all) — test 2 in Table 1, and also “semi-theoretically” (test 3 in Table 1), by using the observed mean [~32 events/day] and variance [~315.5 (events/day)2] of the daily earthquake occurrences and the fact that the sum of many random variables tends to be Gaussian distributed with the mean and variance equal to the sum of the individual means and variances, respectively.
Why do the above tests produce results different from those of the chi-square testing [Vermeesch, 2009]? We argue that the effective number of degrees of freedom (independent observations) N* in the earthquake record is less than the total number of earthquakes N =118,414. Indeed, creating a surrogate daily earthquake time series by random shuffling the observed time intervals between consecutive earthquakes (which by construction produces independent earthquake occurrences) results in a nearly Gaussian daily earthquakes distribution with the same mean, but substantially reduced variance of 41.3 (events/day)2 (Fig. S1a, dashed line). The low-pass filtered time series of the actual and surrogate daily earthquake occurrences (Fig. S1b) have remarkably different characters, with large-magnitude low-frequency variations in the observed time series apparently caused by serial correlations among the in-between-quakes time intervals (so that consecutive short or consecutive long intervals tend to cluster) — a typical indication of statistical dependence. The periods of clustering probably correspond to aftershock sequences of strong earthquakes, in the background of normal seismic activity.
Rigorous computation of degrees of freedom N* in the earthquake time series under consideration is beyond the scope of the present comment. Instead, we estimate N* by requiring that the 95% a priori and a posteriori confidence ranges of daily earthquake occurrences estimated using binomial distribution with the probability of success p=1/7 and independent trials matched the results of the non-parametric tests in Table 1. Note that these binomial ranges have to be rescaled by N /N* for obtaining the actual accumulated values of weekday earthquake occurrences. The binomial ranges for N* = N /10 (test 4 in Table 1) match the non-parametric ranges fairly well. Naturally, the chi-square test for this value of N* once again fails to reject the null hypothesis of uniform earthquake occurrences throughout the week: χ2* ‹ χ2(95).
Similar results are obtained for bi-hourly binned earthquakes. The number of degrees of freedom (independent events) N* in a data set is a fundamental property of the data and does not depend on the way the data set is binned or sub-sampled. For example, more frequent sampling of a continuous time series characterized by a typical time scale much longer than the sampling interval does not introduce additional independent observations. Hence, the expected “binomial” 95% range of the bi-hourly earthquake occurrences (p=1/12) can be computed in the same way as the range of daily earthquake occurrences above using the identical N* estimated previously. This range (9020–10740), once again, contains the observed range of the cumulative bi-hourly earthquake occurrences (9284–10336). Not surprisingly, the effective chi-square value χ2* of the bi-hourly earthquake histogram is well within the expected confidence interval: χ2*≈ 11.1 ‹ χ2(95) ≈ 19.7, meaning that the chi-square test also fails to reject the null hypothesis of non-uniform earthquake distribution throughout the day.
In summary, while the large databases allow one to establish statistical significance of small-magnitude phenomena, care should be taken to ensure that the implicit assumptions underlying majority of statistical test, such as data independence and in some cases stationary character of data, are satisfied. Failing to do so may result in false rejection of correct null hypotheses.
SERGEY KRAVTSOV, ROLANDO OLIVAS SAUNDERS, University of Wisconsin-Milwaukee, Milwaukee, WI. E-mail: kravtsov@uwm.edu

Table 1. The observed and 95%-confidence ranges for the number of earthquakes that occurred, globally, on a certain day of the week during 1999–2009. Note that the observed range is within the 95% a posteriori confidence interval for each test.

Fig. S1. (a) Estimated probability density function (PDF) of daily earthquake occurrences for the actual data (solid line) and its surrogate realizations obtained by randomly shuffling the time intervals between the consecutive earthquakes. (b) 100-day boxcar running mean of the daily earthquake occurrences for the actual data and a particular realization of the surrogate data. The variance of the actual earthquake daily occurrences is 315.5 (events/day)2, while that for the surrogate data is 41.3 (events/day)2.
