next up previous
Next: Time Series Analysis Up: Recent advances in nonparametric Previous: An Example of

Frequency Estimation

Hydrologists are often concerned with estimating the relative frequencies of rare floods or low flows. Given a univariate data set , i=1..., with the x presumed to be independent and identically distributed, the goal is to estimate the probability of exceedance of some x*. The estimation problem is difficult, since the interest is usually in frequencies that lie outside the range of available data; prior guidance as to the nature of the extrapolation needed is not available; and the ability to test the proposed model is limited. There has been a profusion of competing methodologies that can give markedly different results, attempts to legislate procedures, and unabashed advocacy of certain methods. Given this background, it is not surprising that nonparametric flood frequency estimation has been successfully introduced.

In practice, frequency estimation should (1) be robust with respect to model choice and outliers, (2) be consistent (i.e., converge to the correct law) in application across different sites, (3) be flexible enough to deal with the variety of data observed, (4) be efficient (get the maximum information from the sample), and (5) be appropriate for describing tail behavior that may depart from what fits best for the bulk of the data. The traditional parametric flood frequency estimators may not meet these criteria. Given that the set of candidate parametric models is large, the power of tests to choose between them is low [ Vogel and McMartin, 1991], evidence that annual floods may correspond to a mixture of generating mechanisms [ Webb and Betancourt, 1992], the ``best fit'' parametric model often departs from the empirical distribution function in the upper tail [ Adamowski, 1985], and the stage is set for looking at a nonparametric alternative.

Key historical developments are outlined below.

Fall 1983 AGU meeting: S. Yakowitz and K. Adamowski independently introduce kernel estimators for flood frequency analysis.

Adamowski [1985] and Adamowski and Labatiuk
[4] [1987
] propose a fixed bandwidth kernel estimator for the probability density function (p.d.f.) of annual floods, and illustrated its use with rectangular and Cauchy kernels. The bandwidth was chosen (method AC) by minimizing the sum of squared differences between the empirical cumulative distribution function (c.d.f.) estimated using a plotting position formula and the kernel estimate of the c.d.f. Quantiles x corresponding to an exceedance probability p were interpolated from the estimated c.d.f. Monte Carlo performance is shown to be comparable to that of selected parametric estimators.

Noting that kernel estimators do not directly allow incorporation of prior information, and may be inadmissible for flood frequency analysis since they put a negligible probability on flood values larger than the recorded maximum, Schuster and Yakowitz [1985]
[4] developed a nonparametric Bayes estimator defined
[4] through a weighted combination of a fixed bandwidth kernel density estimate (k.d.e.) and a parametric (or other site k.d.e. to include regional information) estimate. This is an important building block. The optimal weight and bandwidth were chosen by censored maximum likelihood cross validation (C-MLCV). The MLCV score function is evaluated only in the interior of the data, to avoid inconsistent density estimates with long tailed data (see Schuster [1985]). The tail behavior and convergence rate of the parametric component of the mixture is inherited in this semi-parametric approach. C-MLCV could also be used to choose better parametric models in the mixture. Monte Carlo results presented demonstrate successful use of C-MLCV for parameterization and improvement over the raw k.d.e.

Bardsley [1988; 1989] proposes the use of a Gumbel kernel for the estimation of the c.d.f., with a different bandwidth at each ranked flood, given by half the distance between that flood and the next smallest one. These bandwidths adapt automatically to the tails and modes of the data, but are highly variable. The Gumbel kernel is asymmetric. As a result, the upper tail ``sees'' all the data with lower weights for the smaller data values, while the lower tail is determined only by the observations smaller than the point of estimate.

Adamowski [1989] recognized the need for bandwidth variation (to deal with skewed data, and to improve extrapolation of the k.d.e) and adapted the Breiman et al. [1977] proposal (VK). The bandwidth at any data point is taken proportional to the distance to the nearest neighbor of that point. The number of nearest neighbors, k, is chosen using a scaling argument of Breiman et al. [1977]. The constant of proportionality is chosen by MLCV. Monte Carlo performance of the VK-MLCV estimates was comparable to some parametric estimators.

Wu and Woo [1989] propose a Fourier Series approximation of the p.d.f. with the number of terms chosen by a criteria similar to AC. This corresponds [ Devroye and Györfi, 1985] to the use of a trigonometric kernel, and works best for bounded data.

Bardsley [1989], Adamowski and Feluch [1990] and Guo [1991] present similar ideas for the inclusion of historical (paleo) information. A censoring threshold is presumed, and estimators for exceedances above threshold in the systematic and in the historical record are developed and combined. Bardsley [1989] extends his earlier estimator of the distribution function. Adamowski and Feluch [1990] work with the k.d.e with bandwidth estimated by Least Squares Cross Validation (LSCV) of the p.d.f. Guo [1991] extends the VK framework of Adamowski [1989] directly to this setting, with the MLCV function reformulated to recognize the joint likelihood of the historical and systematic data. Monte Carlo investigations into the value of additional information are presented by Adamowski and Feluch [1990].

In closely related papers, Gingras and Adamowski [1992; 1993; 1994, to appear], Gingras et al. [1994], and Alila et al. [1993] investigate the possibility of using the number of modes in the k.d.e. of annual floods at a site, and L-moments to identify homogeneous regions for flood estimation. The number of modes is used as a surrogate for the number of active flood mechanisms. Evidence of consistency between this statistical classification and physical mechanisms as well as geographical site distribution is presented. Regional regression equations based on the identified subregions lead to lower MSE's than those fitted using the whole region. While an interesting approach, it is problematic. Mixtures may still lead to a unimodal density. It is easy to see or obscure modes in a k.d.e. if the bandwidth chosen is not optimal. The LSCV/MLCV bandwidth selectors used are notorious for high variability in h, and undersmoothing [ Chiu, 1990]. The sample sizes available are too small for reliable hypothesis testing on the number of modes. Often the modes shown correspond to single data values, and reflect lack of adaptation of the bandwidth to sparsity. The role of changing sample size across sites is not examined. An improved cluster analysis and hypothesis testing approach [ Minnotte and Scott, 1993; Silverman, 1981] may address these concerns.

Lall et al. [1993a] investigated kernel choice, bandwidth selection and flood frequency analysis performance. A particular focus was whether one should estimate the p.d.f., or directly seek to estimate the c.d.f. Monte Carlo investigations into the performance of a number of kernels, bandwidth selectors, and bandwidth adaptation schemes relative to parametric estimators that are correctly specified, mildly mis-specified and grossly mis-specified are undertaken for a variety of parent populations, for 10 to 200 year floods, using 20 to 100 data points. Smoothing the p.d.f. and the c.d.f. are different problems. The c.d.f. corresponding to the optimal p.d.f. estimate is often quite different from the optimal c.d.f. estimator. The Cauchy (C) kernel, with VK and bandwidth and k chosen by AC was found to be superior for flood frequency analysis. It interpolates the empirical c.d.f. in the interior of the data, and allows reasonable extrapolation and smoothness in the tails (because of the increased bandwidth and the heavy tail of the Cauchy kernel). The VK-C-AC estimate is nearly useless for the p.d.f. On the other hand p.d.f. estimates using LSCV or MLCV are reasonable but may underperform the empirical c.d.f. for flood frequency analysis. The VK-C-AC estimates had larger Root Mean Square Error (RMSE), but comparable or lower bias than the correct parametric estimator, comparable performance under mild mis-specification of the parametric density and much better performance for mixture densities. Moon [1991] investigated the proposals by Guo [1991], Adamowski and Feluch [1990], and Schuster and Yakowitz [1985] for improving k.d.e.'s semiparametrically or by including historical information and found that while the resulting quantile estimates were improved relative to those from the raw k.d.e., they were still inferior to VK-C-AC. The importance of smoothing the c.d.f. directly is emphasized.

Moon et al. [1993] focused directly on estimation of the upper tail, and compared the performance of VK-C-AC to selected estimators of upper tail statistics. These estimators include the Extreme Value distribution, Exponential tail (ET) and Quadratic tail (QT) methods of Breiman and Stone [1985], and Pareto tail models of Hill [1975] (PT1) and Hosking and Wallis [1987] (PT2). Other methods due to Harrell and Davis [1982] and Kaigh and Lachenbruch [1982] were also investigated and dropped due to poor performance. These tail estimators assume that the generating mechanism for floods has a tail behavior that belongs to the ``domain of attraction'' of some parametric family (e.g., Exponential or Pareto) and seek to fit such a model to a selected number of upper order statistics. The Breiman models consider a Taylor series approximation of x about log(p), where p is the exceedance probability. A linear (ET) and a quadratic (QT) model are fit for x in terms of log(p), using the m upper order statistics x i=1 . . m. The Pareto tail models consider a power law dependence in the tails, with PT2 fitting a Generalized Pareto Distribution to the data. Monte Carlo results from VK-C-AC and QT were comparable and superior to the others.

Moon and Lall [1994] considered the direct kernel estimation (KQ) of the quantile function with special boundary kernels to extrapolate beyond the largest recorded flow. With the exception of the tail estimators discussed thus far, x was recovered from an estimate of the c.d.f. A kernel regression estimator due to Gasser and Müller [1979] is used to relate x and p. A modified version of a ``plug in'' estimator [ Gasser et al., 1991] is used to select a bandwidth that minimizes the Mean Integrated Square Error, and provides a convergence rate of O(n ) compared with O(n ) for cross validation. Bootstrap confidence limits and empirical sampling distributions for the quantile estimate are developed. Monte Carlo performance is found to be similar to that of QT and superior to the other tail estimators.

Vogel and Fennessey [1994] investigated nonparametric estimation of annual flow duration curves using daily flow data. They compared the use of the empirical quantile function, a linear interpolation of the empirical quantile function [ Parzen, 1979], and the Harrell and Davis [1982] estimator that forms a linear combination of all order statistics, with weights that depend on the quantile to be estimated and the order statistics. They find that the three quantile estimates are comparable in this context, and recommend Harrell and Davis [1982] because of its lower RMSE and higher smoothness. A nice discussion of how to interpret flow duration curves, and develop empirical confidence limits (sampling distributions of quantiles) is presented.

Nonparametric frequency analysis has evolved since the pioneering work of Adamowski and Yakowitz. With the recent, rapid growth in the related statistical literature, further improvements are to be anticipated. We can expect better parameterization, improved Bayesian estimators for regionalization, historical information or other prior beliefs, and consideration of multivariate extremes. On the other hand, one has to keep in mind the inherent futility of estimating extremes from small samples, particularly noting that both climatic and land surface factors controlling floods are nonstationary. Recognition and consideration of these factors in flood frequency analysis needs to evolve. Some preliminary ideas that may aid such thinking are presented in Brillinger [1994].

At this point viable nonparametric estimators are available. This author is partial to KQ and QT for at site application. Given their flexibility and adaptability to a variety of data, automated application across sites can be recommended. Their RMSE is typically higher than would result from a parametric estimator based on the true distribution. However, as the complexity of the parametric model increases, the better nonparametric methods become competitive even on this criteria. The larger, nonparametric quantile error bounds reflect the uncertainty induced by model choice. A formal study that uses real data to compare nonparametric methods with a procedure that chooses (site by site) between candidate parametric models in a split sample or jackknife mode may help resolve this issue.



next up previous
Next: Time Series Analysis Up: Recent advances in nonparametric Previous: An Example of



U.S. National Report to IUGG, 1991-1994
Rev. Geophys. Vol. 33 Suppl., © 1995 American Geophysical Union