# Linear Model Selection and Regularization

Recall the linear model (Can also apply to GLM)

Y=β0+β1X1+β2X2+...+βpXp+ϵY=\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p+\epsilon

Then we consider even more general non-linear models

  • the linear model has distinct advantages in terms of its interpretability and often shows good predictive performance
    • the simple linear model can be improved, by replacing ordinary least squares fitting with some alternative fitting procedures
    • This often applies to the case when npn\approx p or p>np>n
Why consider alternatives to least squares?
  1. Prediction Accuracy: especially when pnp\approx n or p>np>n, to control the variance
  2. Model Interpretability: By removing irrelevant (x對y解釋無幫助) or redundant features (x有幫助,但貢獻可被其它變數取代) – that is, by setting the corresponding coefficient estimates to zero – we can obtain a model that is more easily interpreted.
  3. Speed up the training/inference
  4. Avoid the curse of dimensionality

# Three classes of methods

  1. We identify a subset of the pp predictors that we believe to be related to the response.
  2. We then fit a model using least squares on the reduced set of variables
  1. We fit a model involving all pp predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates.
  2. This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection
  1. We project the pp predictors into a MM-dimensional subspace, where M<pM<p. This is achieved by computing MM different linear combinations, or projections, of the variables.
  2. Then these MM projections are used as predictors to fit a linear regression model by least squares

# Subset Selection

  1. Let M0M_0 denote the null model, which contains no predictors.
  2. For k=1,2,...,pk=1,2,...,p:
    • Fit all (pk)\binom{p}{k} models that contain exactly kk predictors
    • Pick the best among these (pk)\binom{p}{k} models, and call it MkM_k. Here best defined as having the smallest RSS, or equicalently largest R2R^2
  3. Select a single best model from among M0,...,MpM_0,...,M_p using cross-validated prediction error, CpC_p (AIC), BIC, or adjusted R2R^2
Extensions to other models
  1. the same ideas apply to other types of models, such as logistic regression
  2. The deviance proportion to 2×-2\times the maximized log-likelihood plays the role of RSS for a broader class of models

# Stepwise Selection

  • For computational reasons, best subset selection cannot be applied with very large pp
  • Best subset selection may also suffer from statistical problems when pp is large:
    larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data
  • For both of these reasons, stepwise methods, which explore a far more restricted set of models,are attractive alternatives to best subset selection

# Forward Stepwise Selection

  • Begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model
  • In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model
In detail
  1. Let M0M_0 denote the null model, which contains no predictors
  2. For k=0,...,p1k=0,...,p-1:
    • Consider all pkp-k models that augment the predictors in MkM_k with one additional predictor
    • Choose the best among these pkp-k models, and call it Mk+1M_{k+1}.
      Here best is defined as having the smallest RSS, or equivalently largest R2R^2
  3. Select a single best model from among M0,...,MpM_0,...,M_p using cross-validated prediction error, CpC_p (AIC), BIC, or adjusted R2R^2

Though forward stepwise selection considers performs a guided search p(p+1)/2+1p(p+1)/2+1 models, it over model space, and so the effective model space considered contains substantially more than p(p+1)/2+1p(p+1)/2+1 models

# More on Forward Stepwise Selection

✅ The computational advantage over best subset selection is clear
❎ It is not guaranteed to find the best possible model (lowest training error) out of all 2p2^p models containing subsets of the pp predictors

# Backward Stepwise Selection

  • Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection
  • However, unlike forward stepwise selection, it begins with the full least squares model containing all pp predictors, and then iteratively removes the least useful predictor, one-at-a-time
In detail
  1. Let MpM_p denote the full model, which contains all pp predictors
  2. For k=p,p1,...,1k=p,p-1,...,1:
    • Consider all kk models that contain all but one of the predictors in MkM_k, for a total of k1k-1 predictors
    • Choose the best among these kk models, and call it Mk1M_{k-1}.
      Here best is defined as having the smallest RSS, or equivalently largest R2R^2
  3. Select a single best model from among M0,...,MpM_0,...,M_p using cross-validated prediction error, CpC_p (AIC), BIC, or adjusted R2R^2

# More on Backward Stepwise Selection

✅ Like forward stepwise selection, the backward selection approach searches through only 1+p(p+1)/21+p(p+1)/2 models
❎ Backward selection requires that npn\geq p (so that the full model can be fit). In contrast, forward stepwise can be used even when n<pn<p
❎ Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best (lowest training error) model containing a subset of the pp predictors

# Choosing the Optimal Model

  • 模型通常都含最小 RSS 和最大R^ 兩個和訓練誤差相關的。我們需要選低測試誤差的模型,但因為訓練誤差來估計測試誤差的時候表現不好,得出這兩種選最佳模型的方法在面對不同數量的 predictors 不太適合。

Therefore we can

  • directly estimate the test error
    • validation set approach
    • cross validation approach
  • indirectly estimate test error
    • making an adjustment to be training error to account for the bias due to overfitting

# More details

# CpC_p

Mallow’s CpC_p for estimated test MSE (for least square model):

Cp=1n(RSS+2dσ^2),σ^2=RSS^np1C_p=\frac{1}{n}(RSS+2d\hat{\sigma}^2), \hat{\sigma}^2=\frac{\hat{RSS}}{n-p-1}

  • dd is the total # of parameters used
  • σ^2\hat{\sigma}^2 is an estimate of the variance of the error ϵ\epsilon associated with each response measurement based on model containing all predictors

# AIC

The AIC (Akaike Information Criterion) is defined for a large class of models fit by maximum likelihood:

AIC=2logL+2d=1n(RSS+2dσ^2)+ConstAIC=-2\log{L}+2d=\frac{1}{n}(RSS+2d\hat{\sigma}^2)+Const

# BIC

BIC (Bayesian information criterion) is arises in the Bayesian approach to model selection

BIC=2logL+log(n)d=1n(RSS+log(n)dσ^2)+ConstBIC=-2\log{L}+log{(n)}d=\frac{1}{n}(RSS+log{(n)}d\hat{\sigma}^2)+Const

  • Like CpC_p, the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value
  • Notice that BIC replaces the 2dσ^22d\hat{\sigma}^2 term in CpC_p with log(n)dσ^2log{(n)}d\hat{\sigma}^2, where nn is the number of observations.

Since log(n)>2\log{(n)}>2 for any n>7n>7,

  • the BIC statistic places a heavier penalty on the models with many variables
  • results in the selection of smaller models than CpC_p

# Adjusted R2R^2

For a least squares model with dd variables, the adjusted R2R^2 statistic is calculated as

Adjusted R2=1RSS/(nd1)TSS/(n1),(R2=1TSSRSSTSS=1RSSTSS)\text{Adjusted }R^2=1-\frac{RSS/(n-d-1)}{TSS/(n-1)}, (R^2=1-\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS})

where TSS=i=1n(yiyˉ)2=i=1n(yi2y^2)+i=1n(y^2nyˉ2)TSS=\sum_{i=1}^{n}(y_i-\bar{y})^2=\sum_{i=1}^{n}(y_i^2-\hat{y}^2)+\sum_{i=1}^{n}(\hat{y}^2-n\bar{y}^2) is the total sum of squares

  • Maximizing the adjusted R2R^2 is equivalent to minimizing the RSSnd1\frac{RSS}{n-d-1}
  • RSS always decreases as the number of variables in the model increases
  • RSSnd1\frac{RSS}{n-d-1} may increase or decrease, due to the presence of dd in the denominator
  • Unlike the $R^2$ statistic , the adjusted R2R^2 statistic pays a price for the inclusion of unnecessary variables in the model
  • Unlike $C_p$, AIC, and BIC , for which a small value indicates a model with a low test error, a large value of adjusted R2R^2 indicates a model with a small test error

# Validation and cross-validation

Each of the procedures returns a sequence of models MkM_k indexed by model size k=0,1,2k =0,1,2…
We compute the validation set error or the cross-validation error for each model MkM_k under consideration, and then select the kk for which the resulting estimated test error is smallest.

✅ This procedure has an advantage relative to AIC, BIC, CpC_p, and adjusted R2R^2, in that

  • it provides a direct estimate of the test error
  • It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance σ2\sigma^2

❎ It needs computational power

# One-standard-error rule

All three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors

  1. We first calculate the standard error of the estimated test MSE for
    each model size
  2. and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve
Selecting Number of Model Parameters Using Cross-Validation and One Standard Error Rule

# Shrinkage Methods

  • we can fit a model containing all pp predictors using a technique that constrains or regularizes the coefficient estimates
  • shrinking the coefficient estimates can significantly reduce their variance

# Ridge regression

Recall that the least squares fitting procedure estimates β0,β1,,βp\beta_0, \beta_1, \ldots, \beta_p using the values that minimize

RSS=i=1n(yiβ0j=1pβjxij)2RSS=\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2

In contrast, the ridge regression coefficient estimates β^λR\hat{\beta}_\lambda^R are the values that minimize

i=1n(yiβ0j=1pβjxij)2+λj=1pβj2=RSS+λj=1pβj2\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2+\lambda\sum_{j=1}^{p}\beta_j^2=RSS+\lambda\sum_{j=1}^{p}\beta_j^2

where λ0\lambda\geq 0 is a tuning parameter
By making RSS smaller, the ridge regression seeks coefficient that fit the data well

  • λjβj2\lambda\sum_{j}\beta_j^2 called a shrinkage penalty, is small when β1,β2,,βp\beta_1,\beta_2,\ldots,\beta_p are close to zero
  • The tuning parameter λ\lambda serves to control the relative importance of the two terms on the regression coefficient estimates. Select a good λ\lambda is crucial, usually via cross-validation

# The Lasso (Least Absolute Shrinkage and Selection Operator)

Since the obvious disadvantage of ridge regression is it will include all pp predictors in the final model, the lasso is proposed to overcome this problem

The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, ̂β^λR\hat{\beta}_\lambda^R, minimize the quantity

i=1n(yiβ0j=1pβjxij)2+λj=1pβj=RSS+λj=1pβj\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2+\lambda\sum_{j=1}^{p}|\beta_j|=RSS+\lambda\sum_{j=1}^{p}|\beta_j|

In statistical parlance, the lasso uses an l1l_1 norm (pronounced “ell 1”) penalty instead of an l2l_2 penalty.

  • As ridge regression, the lasso shrinks the coefficient estimates towards zero
    However, in the case of the lasso, the l1l_1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter λ\lambda is sufficiently large. Hence, the lasso performs variable selection
  • lasso yields sparse models (models that involve only a subset of the variables).
    As in ridge regression, selecting a good value of λ\lambda for the lasso is critical; cross-validation is again the choice
Why is it that the lasso, unlike ridge regression, results in coefficient estimates

that are exactly equal to zero?

  • One can show that the lasso and ridge regression coefficient estimates solve the problems and (linked by the Lagrange multiplier)

    minβi=1n(yiβ0j=1pβjxij)2 subject to j=1pβjs \min_{\beta}\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2 \text{ subject to } \sum_{j=1}^{p}|\beta_j|\leq s

    and

    minβi=1n(yiβ0j=1pβjxij)2 subject to j=1pβj2s \min_{\beta}\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2 \text{ subject to }\sum_{j=1}^{p}\beta_j^2\leq s

    Respectively
  • The best subset selection can be viewed as (ESL, ch3)

    minβi=1n(yiβ0j=1pβjxij)2 subject to j=1pI(βj0)s \min_{\beta}\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2 \text{ subject to } \sum_{j=1}^{p}I(\beta_j\neq 0)\leq s

# Comparing the Lasso and Ridge Regression

# Short Conclusions

# Dimension Reduction Methods

# Principal Components Regression (PCR)

# Partial Least Squares

# Consideration in high dimensions