2023-05-01
This is a companion repository for the paper “Robust nonparametric regression: review and practical considerations” (In Press) Econometrics and Statistics, DOI: https://doi.org/10.1016/j.ecosta.2023.04.004.
The file robust-non-param-example-public.R contains an R script
reproducing the illustration example in that paper. Below is an
R markdown version of the same script.
We use the polarization data from the package ggcleveland. The
response variable is the scattering angle where polarization disappears
(babinet), and the explanatory variable is the cubic root of the
particulate concentration of a gas in the atmosphere (concentration).
To avoid numerical issues with repeated values, this variable is
perturbed by adding a small amount of random uniformly distributed noise
(via jitter):
data(polarization, package='ggcleveland')
pol <- polarization
set.seed(123)
pol$conc3 <- jitter( pol$concentration^(1/3), factor = 10 )
plot(babinet ~ conc3, data = pol, pch=19, col='gray70', cex=1.1)The classical smoothing spline estimator, computed with penalization selected via GCV, as described in Wood, S.N., (2006), Generalized Additive Models, Boca Raton: Chapman & Hall / CRC.
library(mgcv)
a <-gam(babinet ~ s(conc3), data=pol) M-smoothing splines as proposed in Kalogridis, I. and Van Aelst, S., (2021), M-type penalized splines with auxiliary scale estimation, Journal of Statistical Planning and Inference, 212, 97-113. The code is publicly available at https://github.com/ioanniskalogridis/Smoothing-splines.
source('https://raw.githubusercontent.com/ioanniskalogridis/Smoothing-splines/main/Huber')
y <- pol$babinet[order(pol$conc3)]
x <- sort(pol$conc3)
fit.huber1 <- huber.smsp(x = x, y = y, k = 1.345, interval = c(1e-16, 1)) M-smoothing splines as proposed in Oh, H-S., Nychka, D.W., Brown, T. and
Charbonneau, P., (2004), Period analysis of variable stars by robust
smoothing, Journal of the Royal Statistical Society, Series C, 53,
15-30. Code is available from the R package
fields. The first fit was
obtained by selecting the smoothing parameter as recommended by the
authors. The second one was obtained with a post-hoc subjective
selection as hinted in the help page of the function fields::qsreg.
# Oh et al (robust smoothing)
library(fields)
oh1 <- qsreg(x=x, y=y, maxit.cv=100)
pr <- predict(oh1, x)
si <- mad( y - pr )
oh2.0 <- qsreg(x=x, y=y, sc = si, maxit.cv = 100)
lam<- oh2.0$cv.grid[,1]
tr<- oh2.0$cv.grid[,2]
lambda.good<- max(lam[tr>=15])
oh2 <- qsreg(x=x, y=y, sc = si, lam = lambda.good) The M-local linear estimator as proposed in Boente, G., Martínez, A. and
Salibian-Barrera, M., (2017), Robust estimators for additive models
using backfitting, Journal of Nonparametric Statistics, 29(4), 744-767.
Code is available in the R package
RBF. We use a robust
leave-one-out cross-validation criterion to select an optimal bandwidth.
This loop can take a relatively long time to complete, so we do not run
it here and use the optimal value (0.2111111).
library(RBF)
y <- pol$babinet[order(pol$conc3)]
x <- sort(pol$conc3)
n <- length(y)
# nh <- 10
# hh <- seq(.1, .3, length=nh) # previous search over (.05, 1), narrowed here
# cvbest <- +Inf
# rmspe <- rep(NA, nh)
# for(i in 1:nh) {
# # leave-one-out CV loop
# preds <- rep(NA, n)
# for(j in 1:n) {
# tmp <- try( backf.rob(y ~ x, point = x[j],
# windows = hh[i], epsilon = 1e-6,
# degree = 1, type = 'Tukey', subset = c(-j) ))
# if (class(tmp)[1] != "try-error") {
# preds[j] <- rowSums(tmp$prediction) + tmp$alpha
# }
# }
# pred.res <- preds - y
# tmp.re <- RobStatTM::locScaleM(pred.res, na.rm=TRUE)
# rmspe[i] <- tmp.re$mu^2 + tmp.re$disper^2
# if( rmspe[i] < cvbest ) {
# jbest <- i
# cvbest <- rmspe[i]
# # print('Record')
# }
# # print(c(i, rmspe[i]))
# }
# bandw <- hh[jbest]
bandw <- 0.2111111
tmp <- backf.rob(y ~ x, windows=bandw, degree=1, type='Tukey', point=x)Robust penalized splines (based on S-estimators) as proposed in Tharmaratnam, K., Claeskens, G., Croux, C. and Salibian-Barrera, M., (2010), S-estimation for penalized regression splines, Journal of Computational and Graphical Statistics, 19(3), 609-625. The code is available at https://github.com/msalibian/PenalizedS.
source("https://raw.githubusercontent.com/msalibian/PenalizedS/master/pen-s-functions.R")
lambdas <- seq(1e-4, 10, length = 100)
p <- 3
y <- pol$babinet[order(pol$conc3)]
x <- sort(pol$conc3)
n <- length(x)
num.knots <- max(5, min(floor(length(unique(x))/4), 35))
knots <- quantile(unique(x), seq(0, 1, length = num.knots + 2))[-c(1, (num.knots + 2))]
xpoly <- rep(1, n)
for (j in 1:p) xpoly <- cbind(xpoly, x^j)
xspline <- outer(x, knots, "-")
xspline <- pmax(xspline, 0)^p
X <- cbind(xpoly, xspline)
# penalty matrix
D <- diag(c(rep(0, ncol(xpoly)), rep(1, ncol(xspline))))
NN <- 500 # max. no. of iterations for S-estimator
# NNN no. of initial candidates for the S-estimator
cc <- 1.54764
b <- 0.5
tmp.s <- pen.s.rgcv(y = y, X = X, D = D, lambdas = lambdas, num.knots = num.knots,
p = p, NN = NN, cc = cc, b = b, NNN = 50)The plots below compare the different robust fits computed above. Given the number of different fits considered, we separate these regression estimates into two groups: smoothing splines and others (penalized splines and local kernel estimators).
plot(babinet ~ conc3, data = pol, pch=19, col='gray70', cex=1.1,
xlab=expression(Conc^{1/3}), ylab='Babinet')
lines(pol$conc3, fitted(a), type='l', lwd=3, col='gray30')
lines(x, fit.huber1$fitted, lwd=3, lty=2, col='blue3')
lines(x, predict(oh2, x), lwd=5, lty=3, col='red3')
lines(x, predict(oh2.0, x), lwd=3, lty=4, col='green4')In the figure above the grey solid line is the classical fit, the blue dashed line is the M-smoothing spline (Kalogridis and Van Aelst, 2021), the robust-GCV-based M-smoothing spline (Oh et al, 2004) is the green dash-dot line, while the red dotted line corresponds to the fit obtained with a subjectively chosen value of the smoothing parameter.
plot(babinet ~ conc3, data = pol, pch=19, col='gray70', cex=1.1,
xlab=expression(Conc^{1/3}), ylab='Babinet')
lines(x, tmp.s$yhat, lwd=3, lty=1, col='red3')
lines(x, as.numeric(tmp$prediction + tmp$alpha), lwd=3, lty=2, col='blue3')In the figure above the red solid line is the S-penalized splines fit while the blue dashed line is the robust local linear M-estimator.


