Despite virtual dominance of the research literature, linear Auto-regressive Moving Average (ARMA) models for hydrologic time series forecasting and simulation have gained limited acceptance with practitioners. Simple resampling schemes, such as the index sequential method [ Kendall and Dracup, 1991] may be preferred. The ARMA framework has been successful for annual and perhaps monthly flows, largely because ``structure'' and predictability of flows are lost by the time you get to such lags. ARMA models are hard to justify when daily flows are of interest. These models are incapable of easily modeling the persistence in such flows, while at the same time responding to sudden bursts in hydrographs subsequent to a storm, and the subsequent gradual decay of the hydrograph. Recognition of such factors motivated the nonparametric, Markovian thinking described in Yakowitz [1973, 1979a, 1979b, 1985a, 1985b]. Much of the subsequent nonparametric time series literature draws on the concepts developed in these papers.
In these papers, Yakowitz considers a finite order, continuous
parameter Markov chain as an appropriate model for hydrologic time series. He
observes that discretization of the state space can quickly lead to an
unmanageable number of parameters (the curse of dimensionality) or poor
approximation of the transition functions, while the ARMA approximations to such
a process call for restrictive distributional and structural assumptions. The
problem is cast in a general setting with a variety of measures (e.g.,
conditional probability of threshold crossings, or one step conditional
distribution functions or expectations) of interest, and a predictor space that
can include a d-tuple of past stream flows and other auxiliary variables. The
requisite transition functions are evaluated through empirical conditional
distribution functions, and transition intensity functions, conditional p.d.f.'s
and regressions that are evaluated using nearest neighbor (NN) or kernel
methods. Strategies for the simulation of daily flow sequences, one step ahead
prediction and the conditional probability of flooding (flow crossing a
threshold) are exemplified with river flows and shown to be superior to ARMA
models. Seasonality is accommodated by including the calendar date as one of the
predictors. Nonparametric Bayesian procedures for incorporating prior or
regional information (including parametric p.d.f.'s for extremes) are indicated.
Yakowitz indicates that this continuous parameter Markov chain approach can
reproduce any possible Hurst coefficient. He relates these ideas to hydrologic
decision problems, argues that the loss functions associated with hydrologic
decisions (e.g., declare a flood warning or not) are usually highly asymmetric,
and that the classical ARMA or Kalman filtering framework is suited for optimal
prediction only under squared error, and only for linear operations on the
observables. The nonparametric framework allows attention to be focused
directly on calculating these loss functions and evaluating the consequences.
Tong [1990] provides motivation for nonlinear time series
analysis methodology and for nonparametric modeling and visualization of time
series. He uses a daily river flow example to illustrate that such data with
sudden jumps, time irreversibility, asymmetric joint distributions, persistence,
lots of high level crossings, and state dependent correlation between lagged
flows do not support the assumptions inherent in classical linear ARMA modeling.
Yakowitz [1987, 1993], Yakowitz and Karlsson [1987],
Karlsson and Yakowitz [1987a, 1987b] motivate and provide
theoretical basis for nearest neighbor (NN) regression for prediction of time
series and specifically for rainfall-runoff modeling. The practical idea is
simple. Given a ``feature vector'' of, say, a sequence of past flows and past
and current rainfall amounts, determine the conditional expectation of, say,
the next flow. This conditional expectation is evaluated by identifying the
successor flows to the k historical nearest neighbors of the current
feature vector, and
averaging them. Importance weights may be assigned to each component of the
feature vector and optimized by cross validation as part of the estimation
process. They compared the one step NN predictions of daily flow on different
days with storms to a Unit Hydrograph model, and to an ARMAX model with data
from an Ohio basin and found that the NN model was superior. Galeati
[1990] shows that this simple NN predictor provides lower mean square error
predictions of daily mean inflow to an Italian reservoir relative to an
autoregressive model with exogenous inputs, that was coupled to physically
based, calibrated, rainfall-runoff and snow cover evolution models.
Smith [1991] and Smith et al. [1992] present some
interesting applications of Yakowitz's ideas that expose the flexibility of
nonparametric methods for seeking relationships between arbitrary functions of
possibly linked data sets. For example, they seek to predict directly (1)
accumulated daily flow over a future 1 to 4 month period, (2) the minimum daily
flow over the future period, (3) the time when future flow may drop below a
threshold, or (4) the total time during the future period when the daily flow
is below a threshold. As predictors, they consider measures of antecedent
conditions, the Southern Oscillation Index, and basin hydrologic and climatic
variables. Kernel methods and empirical conditional distribution functions are
used to develop such predictions. Relative importance of predictors is assessed,
and the state and seasonal dependence of the predictions is graphically
demonstrated. This work shows that the nonparametric framework allows one to
work directly with the statistics relevant for reservoir operation, rather than
worrying about successfully estimating them from a linear model designed to
reproduce a serial correlation structure.
Kember et al. [1993] connect the NN predictor to state space
reconstruction methods used to reconstruct nonlinear dynamics [ Farmer and
Sidorowich, 1987] from time series. They consider a weighted neighborhood,
with weights decreasing exponentially with distance, and the L step ahead
forecast regressed on a vector of past flows that may be lagged at a rate
different than the sampling rate. Predictive error criterion are used for
choosing the model order, the lag time and the decay rate of the exponential
weighting scheme. Performance is found to be superior to multiplicative,
seasonal, ARIMA models for a 70 year record of daily streamflow.
Lall et al. [1994b] are motivated similarly, but use
Multivariate Adaptive Regression Splines (MARS) due to Friedman [1991],
to recover the map of the dynamical system. This is a higher order function
approximation scheme than NN regression. Parameters including model order,
delay, and spline parameters (number of knots, knot locations, linear or cubic
splines) are chosen using GCV. The time series analyzed is the 1848-1992
biweekly volume record of the Great Salt Lake. Blind predictions up to 4 years
ahead using only prior data are attempted at various points in time. These
predictions are dramatically superior as the forecast horizon increases compared
to those from the best fit AR model, and predict the unprecedented, and dramatic
4 year rise and fall of the Great Salt Lake in the 1980's.
The strategy used for simulation in the following work is to develop a k.d.e. for the target univariate, multivariate or conditional p.d.f., and to then sample from this k.d.e. This is tantamount to a smoothed bootstrap [ Silverman, 1986] or smoothed conditional bootstrap. Markovian interpretations of such procedures as suggested by Yakowitz apply.
Rajagopalan et al. [1993, 1994] and Lall et al. [1993b]
develop a seasonal nonparametric renewal model (NPR) for simulating daily
precipitation, where successive dry and wet spell lengths may be dependent or
independent. All requisite p.d.f.'s (for log transformed precipitation amount,
and wet/dry spell length) are estimated by kernel methods. Monte Carlo results
with real data show that spell characteristics as well as other statistics are
well reproduced. The development of a new k.d.e. [ Balaji and Lall, 1994,
to appear] appropriate for discrete data complements this work.
Tarboton et al. [1993] develop a multivariate k.d.e. with local
bandwidths proportional to local covariance based on k nearest neighbors
(similar in spirit to Lall and Bosworth [1993]), as well as requisite
conditional k.d.e.'s for simulation of streamflow time series. Simulation
proceeds sequentially using appropriate, estimated conditional p.d.f.'s. Annual
and monthly applications to Colorado River basin flow preserve desired
statistics. This model is extended by Balaji et al. [1994] to consider a
multivariate vector of daily weather variables (solar radiation, maximum
temperature, minimum temperature, average wind speed and average dew point
temperature) and to integrate it with the NPR daily precipitation model
described above. Monte Carlo results with Western U.S. weather data demonstrate
ability to reproduce not just the usual moments but also quartiles.
Tarboton [1994] visually evaluates the performance of Colorado
river annual stream flows (some based on tree rings) simulated by SPIGOT
[ Grygier and Stedinger, 1990], through plots of k.d.e.'s of the marginal p.d.f.
of recorded and simulated traces.