Bandwidth Selectors

The property Bandwidth specifies the method that is used to calculate a bandwidth for a local polynomial regression. The table below gives the methods currently supported

Bandwidth	Description
CV	Leave-one-out cross validation selector
GCV	Generalized cross-validation selector
SHI	Shibata's model selector
AIC	Akaike's information criterion
FPE	Finite prediction error selector
T	Rice's T selector
RSC	Fan and Gijbels residual squares criterion
MSE	Fan and Gijbels mean squared error selector
FG_ROT	Fan and Gijbels rule-of-thumb selector
RSW_ROT	Ruppert-Sheather-Wand rule-of-thumb selector
DPI	Ruppert-Sheather-Wand direct plug-in selector
STE	Ruppert-Sheather-Wand solve-the-equation selector
ROT	Generalized Ruppert-Sheather-Wand rule-of-thumb selector
IROT	Iterated rule-of-thumb selector

Overview

Recall that the aim is to find a smooth curve through a series of observations \((X_i,Y_i)\) ,\(i=1,\ldots,n\). Assuming the observations are drawn from a process \begin{equation} Y=m(X)+\sigma(X) \noise \end{equation} where \(\mathrm{E}(\noise)=0\) and \(\mathrm{Var}(\noise)=1\) then a local polynomial approximation \begin{equation} m(x)\approx \beta_0 + \beta_1(x-x_0)+ \cdots + \beta_p (x-x_0)^p \end{equation} of order \(p\) at the point \(x_0\) leads to estimates for \(m(x)\) and its derivatives \begin{eqnarray} \eY(x_0) &=& \ebeta_0\\ \esm^{(v)}(x_0)&=&v!\ebeta_v,\quad v=1,\ldots,p \end{eqnarray} where \begin{equation} \ebvec=(\xmat' \wmat \xmat)^{-1}\xmat'\wmat \yvec \end{equation} and \(\yvec\) and \(\ebvec\) are the vectors \begin{equation} \yvec=\left[\begin{array}{c} Y_1\\ \vdots\\ Y_n \end{array}\right] \quad\mathrm{and}\quad \ebvec=\left[\begin{array}{c} \ebeta_0\\ \vdots\\ \ebeta_p \end{array}\right]. \end{equation} Here \(\xmat\) is the design matrix \begin{equation} \xmat=\left[ \begin{array}{ccccc} 1 & (X_1-x_0) & (X_1 -x_0)^2 & \cdots & (X_1-x_0)^p\\ \vdots & \vdots & &\vdots\\ 1 & (X_n-x_0) & (X_n -x)^2 & \cdots & (X_n-x_0)^p \end{array} \right] \end{equation} and \(\wmat\) is the diagonal matrix of kernal weights \begin{equation} \wmat=\left[ \begin{array}{cccc} \kernal_h(X_1-x_0) & & & \\ & \kernal_h(X_2-x_0) & & 0\\ 0 & & \ddots & \\ & & & \kernal_h(X_n-x_0) \end{array} \right] \end{equation} where \begin{equation} \kernal_h(x-x_0) = \frac{1}{h}\kernal\left(\frac{x-x_0}{h}\right) \end{equation} and \(\kernal(u)\) is a kernal function and \(h\) is the bandwidth. The estimates are not sensitive to the choice of kernal function but are sensitive to the choice of order \(p\) and bandwidth \(h\). There is some guidance for choosing \(p\): To approximate \(m^{(v)}(x)\) one should choose \(p=v+a\) where \(a\) is odd, with \(a=1\) or \(a=3\) being considered sufficient for most cases. Bandwidth selectors are methods for choosing an optimum \(h\) given \(p\). In the following when \(v\) is not explicitly stated then \(v=0\).

Minimum Error Methods

These methods choose the bandwidth by minimizing an estimate of fitting error[3].

\(\CVbs\)

The leave-one-out cross-validation bandwidth selector. \(h^{\mathrm{CV}}\) is determined by minimizing \begin{equation} \CVerr(h)=\frac{1}{n}\sum_{i=1}^n \left(Y_i-\eY_{-i}(X_i)\right)^2 \end{equation} where \(\eY_{-i}(X_i;h)\) is the local polynomial estimate at \(X_i\) using all observations except \((X_i,Y_i)\).

\(\code{GCV},\code{SHI},\code{AIC},\code{FPE},\code{T}\)

For local polynomial regression the estimate \begin{equation} \mathbf{\esm}=\left[ \begin{array}{c} \esm(X_1)\\ \vdots\\ \esm(X_n) \end{array}\right] \end{equation} is a linear function of \(\yvec\) \begin{equation} \mathbf{\esm}=\smooth \cdot \yvec \end{equation} where \(\smooth\) is the hat matrix with elements \begin{equation} \label{eqn:smooth} \smooth_{ij} = \left[\xmat'_i \wmat_i \xmat_i)^{-1}\xmat'_i\wmat_i\right]_{1,j} \end{equation} and \(\xmat_i\) and \(\wmat_i\) are design and weight matrices at \(X_i\). \(h^{\mathrm{ASE}}\) is determined by minimizing the adjusted mean-square error \begin{equation} \ASEerr(h)=\frac{1}{n}\sum_{i=1}^n \left(Y_i-\eY(X_i)\right)^2\ASEp\left(\smooth_{ii}\right) \end{equation} where \(\ASEp(u)\) is one of a range of penalty functions:

ASE	Description	\(\ASEp(u)\)
\(\code{GCV}\)	Generalized Cross-Validation	\((1-u)^{-2}\)
\(\code{SHI}\)	Shibata's Model Selector	\(1+2u\)
\(\code{AIC}\)	Akaike's Information Criterion	\(\exp(2u)\)
\(\code{FPE}\)	Finite Prediction Error	\((1+u)/(1-u)\)
\(\code{T}\)	Rice's T	\((1-2u)^{-1}\)

\(\RSCbs\)
The Fan-Gijbels residual square criterion bandwidth selector[1]. \(h_{v,p}^{\mathrm{RSC}}\) is given by \begin{equation} h_{v,p}^{\mathrm{RSC}} =\RSCadj_{v,p}(K)\cdot h_p^{\mathrm{RSC}} \end{equation} where \(h_{v,p}\) is the optimal bandwidth for estimating \(m^{(v)}(x)\), \(\RSCadj_{v,p}(\kernal)\) is a kernal-dependent constant (see [1] (4.17)) and \(h_p^{\mathrm{RSC}}\) minimizes the integrated residual squares criterion \begin{equation} \IRSCerr_p(h)=\sum_{i=1}^n \RSCerr_p(X_i;h) \end{equation} where \begin{equation} \RSCerr_p(x_0;h)=\esigma^2(x_0) \left(1+(p+1)V(x_0)\right) \end{equation} Here \begin{equation} \esigma^2(x_0) = \frac{\sum_{i=1}^n\left(Y_i-\hat{Y}_i\right)^2\kernal_h(X_i-x_0)}{\trace\left\{\wmat - \invss\ssa\invss\right\}} \end{equation} where \begin{equation} \hat{Y}_i= \left[\xmat\ebvec\right]_i \end{equation} and \(V(x_0)\) is the first diagonal element of \begin{equation} \label{eqn:rscvar} V_p(x_0) := \invss \ssa\invss \esigma^2(x_0) . \end{equation}

\(\MSEbs\)
The Fan-Gijbels mean-square-error bandwidth selector[1]. \(h_{v,p}^{\mathrm{MSE}}\) is determined by minimizing the approximate integrated mean square error \begin{equation} \IMSEerr_{v,p}(h) = \sum_{i=1}^n \MSEerr_{v,p}(X_i;h) \end{equation} where \begin{equation} \MSEerr(x_0;h)= b_{v,p}^2(x_0)+V_{v,p}(x_0) \end{equation} Here the bias \(b_{v,p}(x_0)\) is the \((v+1)\)th element of the vector \begin{equation} b_p(x_0)=\invss \left( \begin{array}{c} \ebeta_{p+1}S_{p+1}+\cdots+\ebeta_{p+a}S_{p+a}\\ \vdots\\ \ebeta_{p+1}S_{2p+1}+\cdots+\ebeta_{p+a}S_{2p+a} \end{array}\right) \end{equation} where \begin{equation} S_{j}:=\sum_{i=1}^n \kernal_h(X_i-x_0) (X_i-x_0)^j , \end{equation} and variance \(V_{v,p}(x_0)\) is the \((v+1)\)th diagonal element of \(V_p(x_0)\) in (\ref{eqn:rscvar}) with \(\ebeta_{p+1},\ldots,\ebeta_{p+a}\) and \(\esigma^2(x_0)\) obtained from an order \(p+a\) local polynomial fit using bandwidth \(h_{p+1,p+a}^{\mathrm{RSC}}\).

Plug-In Methods

Plug-in methods are based on the following asymptotic expansion as \(n\rightarrow\infty\) of the optimal constant bandwidth for estimating \(m^{(v)}\) over the interval \([a,b]\): \begin{equation} \label{eqn:asym} h_{\mathrm{opt}}\sim C_{v,p}(\kernal)\left[\frac{\int_a^b \sigma^2(x)\weight(x)dx}{ n\int_a^b m^{(p+1)}(x)^2 \weight(x)\density(x)dx}\right]^{1/(2p+3)} \end{equation} Here \begin{equation} C_{v,p}(\kernal) = \left[\frac{(p+1)^2(2v+1)\int\ekernal_v(t)^2dt}{% 2(p+1-v)\left\{\int t^{p+1}\ekernal_v(t)dt\right\}^2}\right]^{1/(2p+3)} \end{equation} is a constant that depends only on the equivalent kernal \(\kernal^*_v(t)\) (see [1] Section 3.2). The various selectors differ in how they approximate the two integrals in (\ref{eqn:asym}).

\(\FGROTbs\)
The Fan-Gijbels rule-of-thumb selector [1]. A single polynomial of order \(p+3\) is fitted to \((X_i,Y_i)\) without kernal weights. The integrals are approximated by \begin{equation} \int_a^b \sigma^2(x) w(x)dx \approx \frac{b-a}{n-p-1}\sum_{i=1}^n (Y_i-\eY(X_i))^2 \end{equation} and \begin{equation} n\int_a^b m^{(p+1)}(x)^2 \weight(x)\density(x)dx\approx \sum_{i=1}^n \esm^{(p+1)}(X_i)^2 \end{equation} where \(\eY(X_i)\) and \(\esm^{(p+1)}(X_i)\) are calculated from the \((p+3)\)th order polynomial fit.

\(\RSWROTbs\)
Ruppert-Sheather-Wand rule-of-thumb bandwidth selector[2]. This selector only applies to local linear regression (\(p=1\)).

Step 1
Estimate \(\eY(X_i)\) and \(\eY^{(2)}(X_i)\) by fitting a quartic to each of \(N\) subintervals of \([a,b]\). Do this for a range of \(N\) from 1 to \(N_{\max}\) where \begin{equation} N_{\max}=\max\left(\min\left(\lfloor n/20\rfloor,N^*\right),1\right) \end{equation} and \(N^*=5\). For each \(N\) calculate the residual sum of squares \(\rss(N)\) and choose the \(N\) that minimizes Mallows' number \begin{equation} C_p(N)=\frac{\rss(N)}{\rss(N_{\max})}(n-5N_{\max})-(n-10N). \end{equation}

Step 2
Make the approximations \begin{equation} \int_a^b \sigma^2(x) w(x)dx \approx (b-a)\esigma^2(N) \end{equation} and \begin{equation} n\int_a^b m^{(2)}(x)^2 \weight(x)\density(x)dx\approx n\,\etheta_{22}(N) \end{equation} where \begin{equation} \esigma^2(N)=\frac{1}{n-5N}\sum_{i=1}^n (Y_i-\eY(X_i))^2 \end{equation} and \begin{equation} \etheta_{22}(N)=\frac{1}{n}\sum_{i=1}^n \esm^{(2)}(X_i)^2 \end{equation}

\(\DPIbs\)
Ruppert-Sheather-Wand direct plug-in bandwidth selector[2]. This selector only applies to local linear regression (\(p=1\)).

Step 1
Do a blocked quartic fit as for \(\RSWROTbs\) to find \(\esigma^2(N)\) and \begin{equation} \etheta_{24}(N) :=\frac{1}{n}\sum_{i=1}^n \esm^{(2)}(X_i)\esm^{(4)}(X_i) \end{equation}

Step 2
Use \(\esigma^2(N)\) and \(\etheta_{24}(N)\) to approximate the bandwidth \begin{equation} g\approx C_2(\kernal) \left[\frac{(b-a)\esigma^2(N)}{|\etheta_{24}(N)|n}\right]^{1/7} \end{equation} where \(C_2(\kernal)\) is a kernal-dependent constant defined in [2]. \(g\) is the asymptotically optimal bandwidth for a local cubic regression estimate of \begin{equation} \etheta_{22}(g) :=\frac{1}{n}\sum_{i=1}^n \esm^{(2)}(X_i;g)\esm^{(2)}(X_i;g). \end{equation} Next use \(\esigma(N)\) and \(\etheta_{22}(g)\) to approximate the bandwidth \begin{equation} \lambda\approx C_2(\kernal)\left[\frac{(b-a)\esigma^2(N)}{\etheta_{22}(g)^2n^2}\right]^{1/9} \end{equation} where \(C_3(\kernal)\) is a kernal-dependent constant defined in [2]. \(\lambda\) is the asymptotically optimal bandwidth for a local cubic regression estimate of \begin{equation} \esigma^2(\lambda):=\frac{1}{\nu}\sum_{i=1}^n (Y_i-\eY(X_i;\lambda))^2 \end{equation} where \begin{equation} \label{eqn:nu} \nu=n-2\sum_{i=1}^n \smooth_{ii} + \sum_{i,j=1}^n \smooth_{ij}^2 \end{equation} and \(\smooth\) is the hat matrix given by (\ref{eqn:smooth}).

Step 3
Make the approximations \begin{equation} \int_a^b \sigma^2(x) w(x)dx \approx (b-a)\esigma^2(\lambda) \end{equation} and \begin{equation} n\int_a^b m^{(2)}(x)^2 \weight(x)\density(x)dx\approx n\,\etheta_{22}(g) \end{equation}

\(\STEbs\)
Ruppert-Sheather-Wand solve-the-equation bandwidth selector[2]. This selector only applies to local linear regression (\(p=1\)). The optimal bandwidth is found iteratively. First calculate \(h\) as for \(\DPIbs\). Next update \(g\) using \begin{equation} g\leftarrow C_4(\kernal) \left(\frac{\etheta_{22}(N)}{\etheta_{24}(N)}\right)^{1/7} h^{5/7} \end{equation} where \begin{equation} C_4(\kernal)=C_2(\kernal)C_1(\kernal)^{-5/7} \end{equation} and recaculate \(h\) using \(\DPIbs\) steps 2 and 3. Repeat until convergence.

\(\ROTbs\)
This is a generalization of \(\RSWROTbs\) to the case of arbitrary \(p\).

Step 1
Estimate \(\eY(X_i)\) and \(\esm^{(p+1)}(X_i)\) by fitting a polynomial of degree \(p+3\) to each of \(N\) subintervals of \([a,b]\). Do this for a range of \(N\) from 1 to \(N_{\max}\) where \begin{equation} N_{\max}=\max\left(\min\left(\lfloor n/20\rfloor,N^*\right),1\right) \end{equation} and \(N^*=5\). For each \(N\) calculate the residual sum of squares \(\rss(N)\) then choose the \(N\) that minimizes Mallows' number \begin{equation} C_p(N)=\frac{\rss(N)}{\rss(N_{\max})}(n-(p+4)N_{\max})-(n-2(p+4)N). \end{equation}

Step 2
Make the approximations \begin{equation} \int_a^b \sigma^2(x) w(x)dx \approx (b-a)\esigma^2(N) \end{equation} and \begin{equation} n\int_a^b m^{(p+1)}(x)^2 \weight(x)\density(x)dx\approx n\,\etheta_{p+1}(N) \end{equation} where \begin{equation} \esigma^2(N)=\frac{1}{n-(p+4)N}\sum_{i=1}^n (Y_i-\eY(X_i))^2 \end{equation} and \begin{equation} \etheta_{p+1}(N)=\frac{1}{n}\sum_{i=1}^n \esm^{(p+1)}(X_i)^2 \end{equation}

\(\IROTbs\)
There is an iterative quality to (\ref{eqn:asym}) which can be exploited to improve on \(\FGROTbs\). Let us agree to estimate \(m^{(v)}\) using local polynomial regressions of order \(p_v=v+a\) for fixed approximation order \(a\). Fan and Gijbels argue that \(a\) should be odd and that \(a=1\) or \(a=3\) should suffice. In (\ref{eqn:asym}) replace both integrals by estimators to get \begin{equation} h_{v,p}=C_{v,p}(\kernal) \left[\frac{(b-a)\esigma^2_{p+1}}{n\etheta_{p+1}}\right]^{1/(2p+3)} \end{equation} where \begin{equation} \label{eqn:sigmav} \esigma^2_v:=\frac{1}{\df}\sum_{i=1}(Y_i-\eY(X_i;h_v))^2 \end{equation} and \begin{equation} \label{eqn:thetav} \etheta_v := \frac{1}{n}\sum_{i=1}^n \esm^{(v)}(X_i;h_v)^2 \end{equation} Here \(\nu\) is given by (\ref{eqn:nu}) and \(h_v:=h_{v,v+a}\) is given by the optimal bandwidth formula \begin{equation} \label{eqn:hv} h_{v}=C_{v}(\kernal) \left[\frac{(b-a)\esigma^2_{v'}}{n\etheta_{v'}}\right]^{1/(2(v+a)+3)} \end{equation} where \(C_v(\kernal):=C_{v,v+a}(\kernal)\) and \begin{equation} v'=v+a+1 . \end{equation} Equations (\ref{eqn:sigmav}), (\ref{eqn:thetav}) and (\ref{eqn:hv}) can be iterated downwards starting from a guess for \(h_{v'}\) for suitably large \(v'\).

References

[1]	J. Fan and I. Gijbels. Local Polynomial Modelling and Its Applications. CRC Press, 1996.
[2]	D. Ruppert, S. J. Sheather, and M. P. Wand. An Effective bandwidth selector for local least squares regression. Journal of the American Statistical Association, 90(432), 1995.
[3]	Anja Schindler. Bandwidth Selection in Nonparametric Kernel Estimation. PhD thesis, Faculty of Economic Sciences, Georg-August-Universitat Gottingen, 2011.