Hydrologists are often concerned with estimating the relative frequencies of
rare floods or low flows. Given a univariate data set
, i=1...
,
with the x
presumed to be
independent and identically distributed, the goal is to estimate the probability
of exceedance of some x*. The estimation problem is difficult, since the
interest is usually in frequencies that lie outside the range of available data;
prior guidance as to the nature of the extrapolation needed is not available;
and the ability to test the proposed model is limited. There has been a
profusion of competing methodologies that can give markedly different results,
attempts to legislate procedures, and unabashed advocacy of certain methods.
Given this background, it is not surprising that nonparametric flood frequency
estimation has been successfully introduced.
In practice, frequency estimation should (1) be robust with respect to model choice and outliers, (2) be consistent (i.e., converge to the correct law) in application across different sites, (3) be flexible enough to deal with the variety of data observed, (4) be efficient (get the maximum information from the sample), and (5) be appropriate for describing tail behavior that may depart from what fits best for the bulk of the data. The traditional parametric flood frequency estimators may not meet these criteria. Given that the set of candidate parametric models is large, the power of tests to choose between them is low [ Vogel and McMartin, 1991], evidence that annual floods may correspond to a mixture of generating mechanisms [ Webb and Betancourt, 1992], the ``best fit'' parametric model often departs from the empirical distribution function in the upper tail [ Adamowski, 1985], and the stage is set for looking at a nonparametric alternative.
Key historical developments are outlined below.
Fall 1983 AGU meeting: S. Yakowitz and K. Adamowski
independently introduce kernel estimators for flood frequency analysis.
Adamowski [1985] and Adamowski and Labatiuk
[4]
[1987] propose a fixed bandwidth kernel estimator for the probability density
function (p.d.f.) of annual floods, and illustrated its use with rectangular and Cauchy
kernels. The bandwidth was chosen (method AC) by minimizing the sum of squared
differences between the empirical cumulative distribution function (c.d.f.)
estimated using a plotting position formula and the kernel estimate of the
c.d.f. Quantiles x
corresponding to an exceedance
probability p were interpolated from the estimated c.d.f. Monte
Carlo performance is shown to be comparable to that of selected
parametric estimators.
Noting that kernel estimators do not directly allow incorporation
of prior information, and may be inadmissible for flood frequency analysis
since they put a negligible probability on flood values larger than the
recorded maximum, Schuster and Yakowitz [1985]
[4]
developed a nonparametric Bayes estimator defined
[4]
through a weighted combination of a fixed
bandwidth kernel density estimate (k.d.e.) and a parametric (or other site
k.d.e. to include regional information) estimate. This is an important building
block. The optimal weight and bandwidth were chosen by censored maximum
likelihood cross validation (C-MLCV). The MLCV score function is evaluated only
in the interior of the data, to avoid inconsistent density estimates with long
tailed data (see Schuster [1985]). The tail behavior and convergence
rate of the parametric component of the mixture is inherited in this
semi-parametric approach. C-MLCV could also be used to choose better parametric
models in the mixture. Monte Carlo results presented demonstrate successful use
of C-MLCV for parameterization and improvement over the raw k.d.e.
Bardsley [1988; 1989] proposes the use of a Gumbel kernel for
the estimation of the c.d.f., with a different bandwidth at each ranked flood,
given by half the distance between that flood and the next smallest one. These
bandwidths adapt automatically to the tails and modes of the data, but are
highly variable. The Gumbel kernel is asymmetric. As a result, the upper tail
``sees'' all the data with lower weights for the smaller data values, while the
lower tail is determined only by the observations smaller than the point of
estimate.
Adamowski [1989] recognized the need for bandwidth variation (to
deal with skewed data, and to improve extrapolation of the k.d.e) and adapted
the Breiman et al. [1977] proposal (VK). The bandwidth at any data point
is taken proportional to the distance to the
nearest neighbor of that
point. The number of nearest neighbors, k, is chosen using a scaling argument
of Breiman et al. [1977]. The constant of proportionality is chosen by
MLCV. Monte Carlo performance of the VK-MLCV estimates was comparable to some
parametric estimators.
Wu and Woo [1989] propose a Fourier Series approximation of the
p.d.f. with the number of terms chosen by a criteria similar to AC. This
corresponds [ Devroye and Györfi, 1985] to the use of a
trigonometric kernel, and works best for bounded data.
Bardsley [1989], Adamowski and Feluch [1990]
and Guo [1991] present similar ideas for the inclusion of
historical (paleo) information. A censoring threshold is presumed,
and estimators for exceedances above threshold in the systematic and
in the historical record are developed and combined. Bardsley
[1989] extends his earlier estimator of the distribution function.
Adamowski and Feluch [1990] work with the k.d.e with bandwidth
estimated by Least Squares Cross Validation (LSCV) of the p.d.f.
Guo [1991] extends the VK framework of Adamowski [1989]
directly to this setting, with the MLCV function reformulated to recognize the
joint likelihood of the historical and systematic data. Monte Carlo
investigations into the value of additional information are presented by
Adamowski and Feluch [1990].
In closely related papers, Gingras and Adamowski
[1992; 1993; 1994, to appear], Gingras et al. [1994], and Alila
et al. [1993] investigate the possibility of using the number of modes in the
k.d.e. of annual floods at a site, and L-moments to identify homogeneous regions
for flood estimation. The number of modes is used as a surrogate for the number
of active flood mechanisms. Evidence of consistency between this statistical
classification and physical mechanisms as well as geographical site distribution
is presented. Regional regression equations based on the identified subregions
lead to lower MSE's than those fitted using the whole region. While an
interesting approach, it is problematic. Mixtures may still lead to a unimodal
density. It is easy to see or obscure modes in a k.d.e. if the bandwidth chosen
is not optimal. The LSCV/MLCV bandwidth selectors used are notorious for high
variability in h, and undersmoothing [ Chiu, 1990]. The sample sizes
available are too small for reliable hypothesis testing on the number of modes.
Often the modes shown correspond to single data values, and reflect lack of
adaptation of the bandwidth to sparsity. The role of changing sample size across
sites is not examined. An improved cluster analysis and hypothesis testing
approach [ Minnotte and Scott, 1993; Silverman, 1981] may
address these concerns.
Lall et al. [1993a] investigated kernel choice, bandwidth
selection and flood frequency analysis performance. A particular focus was
whether one should estimate the p.d.f., or directly seek to estimate the c.d.f.
Monte Carlo investigations into the performance of a number of kernels,
bandwidth selectors, and bandwidth adaptation schemes relative to parametric
estimators that are correctly specified, mildly mis-specified and grossly
mis-specified are undertaken for a variety of parent populations, for 10 to 200
year floods, using 20 to 100 data points. Smoothing the p.d.f. and the c.d.f.
are different problems. The c.d.f. corresponding to the optimal p.d.f. estimate
is often quite different from the optimal c.d.f. estimator. The Cauchy (C)
kernel, with VK and bandwidth and k chosen by AC was found to be superior for
flood frequency analysis. It interpolates the empirical c.d.f. in the interior
of the data, and allows reasonable extrapolation and smoothness in the tails
(because of the increased bandwidth and the heavy tail of the Cauchy kernel).
The VK-C-AC estimate is nearly useless for the p.d.f. On the other hand p.d.f.
estimates using LSCV or MLCV are reasonable but may underperform the empirical
c.d.f. for flood frequency analysis. The VK-C-AC estimates had larger Root Mean
Square Error (RMSE), but comparable or lower bias than the correct parametric
estimator, comparable performance under mild mis-specification of the parametric
density and much better performance for mixture densities. Moon [1991]
investigated the proposals by Guo [1991], Adamowski and Feluch
[1990], and Schuster and Yakowitz [1985] for improving k.d.e.'s
semiparametrically or by including historical information and found that while
the resulting quantile estimates were improved relative to those from the raw
k.d.e., they were still inferior to VK-C-AC. The importance of smoothing the
c.d.f. directly is emphasized.
Moon et al. [1993] focused directly on estimation of the upper
tail, and compared the performance of VK-C-AC to selected estimators of upper
tail statistics. These estimators include the Extreme Value distribution,
Exponential tail (ET) and Quadratic tail (QT) methods of Breiman and
Stone [1985], and Pareto tail models of Hill [1975] (PT1) and
Hosking and Wallis [1987] (PT2). Other methods due to Harrell and Davis
[1982] and Kaigh and Lachenbruch [1982] were also investigated and
dropped due to poor performance. These tail estimators assume that the
generating mechanism for floods has a tail behavior that belongs to the
``domain of attraction'' of some parametric family (e.g., Exponential or
Pareto) and seek
to fit such a model to a selected number of upper order statistics. The Breiman
models consider a Taylor series approximation of x
about
log(p), where p is the exceedance probability. A linear (ET) and a quadratic
(QT) model are fit for x
in terms of log(p), using the m
upper
order statistics x
i=1 . . m. The Pareto tail
models consider a power law dependence in the tails, with PT2 fitting a
Generalized Pareto Distribution to the data. Monte Carlo results from
VK-C-AC and QT were comparable and superior to the others.
Moon and Lall [1994] considered the direct kernel estimation
(KQ) of the quantile function with special boundary kernels to extrapolate
beyond the largest recorded flow. With the exception of the tail estimators
discussed thus far, x
was recovered from an estimate of the
c.d.f. A kernel regression estimator due to Gasser and Müller
[1979] is used to relate x
and p. A modified version of a
``plug in'' estimator [ Gasser et al., 1991] is used to select a
bandwidth that minimizes the Mean Integrated Square Error, and provides a
convergence rate of O(n
) compared with O(n
) for cross validation. Bootstrap confidence limits and empirical
sampling distributions for the quantile estimate are developed. Monte Carlo
performance is found to be similar to that of QT and superior to the other tail
estimators.
Vogel and Fennessey [1994] investigated nonparametric estimation
of annual flow duration curves using daily flow data. They compared the use of the empirical quantile function, a linear
interpolation of the empirical quantile function [ Parzen, 1979], and
the Harrell and Davis [1982] estimator that forms a linear combination
of all order statistics, with weights that depend on the quantile to be
estimated and the order statistics. They find that the three quantile estimates
are comparable in this context, and recommend Harrell and Davis [1982]
because of its lower RMSE and higher smoothness. A nice discussion of how to
interpret flow duration curves, and develop empirical confidence limits
(sampling distributions of quantiles) is presented.
Nonparametric frequency analysis has evolved since the pioneering work of Adamowski and Yakowitz. With the recent, rapid growth in the related statistical literature, further improvements are to be anticipated. We can expect better parameterization, improved Bayesian estimators for regionalization, historical information or other prior beliefs, and consideration of multivariate extremes. On the other hand, one has to keep in mind the inherent futility of estimating extremes from small samples, particularly noting that both climatic and land surface factors controlling floods are nonstationary. Recognition and consideration of these factors in flood frequency analysis needs to evolve. Some preliminary ideas that may aid such thinking are presented in Brillinger [1994].
At this point viable nonparametric estimators are available. This author is partial to KQ and QT for at site application. Given their flexibility and adaptability to a variety of data, automated application across sites can be recommended. Their RMSE is typically higher than would result from a parametric estimator based on the true distribution. However, as the complexity of the parametric model increases, the better nonparametric methods become competitive even on this criteria. The larger, nonparametric quantile error bounds reflect the uncertainty induced by model choice. A formal study that uses real data to compare nonparametric methods with a procedure that chooses (site by site) between candidate parametric models in a split sample or jackknife mode may help resolve this issue.