# Linear Model Selection and Regularization

Recall the linear model (Can also apply to GLM)

$Y=\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p+\epsilon$

Then we consider even more general non-linear models

the linear model has distinct advantages in terms of its interpretability and often shows good predictive performance
- the simple linear model can be improved, by replacing ordinary least squares fitting with some alternative fitting procedures
- This often applies to the case when $n\approx p$ or $p>n$

Why consider alternatives to least squares?

Prediction Accuracy: especially when $p\approx n$ or $p>n$ , to control the variance
Model Interpretability: By removing irrelevant (x對y解釋無幫助) or redundant features (x有幫助，但貢獻可被其它變數取代) – that is, by setting the corresponding coefficient estimates to zero – we can obtain a model that is more easily interpreted.
Speed up the training/inference
Avoid the curse of dimensionality

# Three classes of methods

We identify a subset of the $p$ predictors that we believe to be related to the response.
We then fit a model using least squares on the reduced set of variables

We fit a model involving all $p$ predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates.
This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection

We project the $p$ predictors into a $M$ -dimensional subspace, where $M<p$ . This is achieved by computing $M$ different linear combinations, or projections, of the variables.
Then these $M$ projections are used as predictors to fit a linear regression model by least squares

# Subset Selection

Let $M_0$ denote the null model, which contains no predictors.
For $k=1,2,...,p$ $k = 1, 2, . . ., p$ :
- Fit all $\binom{p}{k}$ models that contain exactly $k$ predictors
- Pick the best among these $\binom{p}{k}$ models, and call it $M_k$ . Here best defined as having the smallest RSS, or equicalently largest $R^2$
Select a single best model from among $M_0,...,M_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$

Extensions to other models

the same ideas apply to other types of models, such as logistic regression
The deviance proportion to $-2\times$ the maximized log-likelihood plays the role of RSS for a broader class of models

# Stepwise Selection

For computational reasons, best subset selection cannot be applied with very large $p$
Best subset selection may also suffer from statistical problems when $p$ is large:
larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data
For both of these reasons, stepwise methods, which explore a far more restricted set of models,are attractive alternatives to best subset selection

# Forward Stepwise Selection

Begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model
In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model

In detail

Let $M_0$ denote the null model, which contains no predictors
For $k=0,...,p-1$ $k = 0, . . ., p - 1$ :
- Consider all $p-k$ models that augment the predictors in $M_k$ with one additional predictor
- Choose the best among these $p-k$ models, and call it $M_{k+1}$ .
  Here best is defined as having the smallest RSS, or equivalently largest $R^2$
Select a single best model from among $M_0,...,M_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$

Though forward stepwise selection considers performs a guided search $p(p+1)/2+1$ models, it over model space, and so the effective model space considered contains substantially more than $p(p+1)/2+1$ models

# More on Forward Stepwise Selection

✅ The computational advantage over best subset selection is clear
❎ It is not guaranteed to find the best possible model (lowest training error) out of all $2^p$ models containing subsets of the $p$ predictors

# Backward Stepwise Selection

Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection
However, unlike forward stepwise selection, it begins with the full least squares model containing all $p$ predictors, and then iteratively removes the least useful predictor, one-at-a-time

In detail

Let $M_p$ denote the full model, which contains all $p$ predictors
For $k=p,p-1,...,1$ $k = p, p - 1, . . ., 1$ :
- Consider all $k$ models that contain all but one of the predictors in $M_k$ , for a total of $k-1$ predictors
- Choose the best among these $k$ models, and call it $M_{k-1}$ .
  Here best is defined as having the smallest RSS, or equivalently largest $R^2$
Select a single best model from among $M_0,...,M_p$ using cross-validated prediction error, $C_p$ (AIC), BIC, or adjusted $R^2$

# More on Backward Stepwise Selection

✅ Like forward stepwise selection, the backward selection approach searches through only $1+p(p+1)/2$ models
❎ Backward selection requires that $n\geq p$ (so that the full model can be fit). In contrast, forward stepwise can be used even when $n<p$
❎ Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best (lowest training error) model containing a subset of the $p$ predictors

# Choosing the Optimal Model

模型通常都含最小 RSS 和最大R^ 兩個和訓練誤差相關的。我們需要選低測試誤差的模型，但因為訓練誤差來估計測試誤差的時候表現不好，得出這兩種選最佳模型的方法在面對不同數量的 predictors 不太適合。

Therefore we can

directly estimate the test error
- validation set approach
- cross validation approach
indirectly estimate test error
- making an adjustment to be training error to account for the bias due to overfitting

# More details

# $C_p$

Mallow’s $C_p$ for estimated test MSE (for least square model):

$C_p=\frac{1}{n}(RSS+2d\hat{\sigma}^2), \hat{\sigma}^2=\frac{\hat{RSS}}{n-p-1}$

$d$ is the total # of parameters used
$\hat{\sigma}^2$ is an estimate of the variance of the error $\epsilon$ associated with each response measurement based on model containing all predictors

# AIC

The AIC (Akaike Information Criterion) is defined for a large class of models fit by maximum likelihood:

$AIC=-2\log{L}+2d=\frac{1}{n}(RSS+2d\hat{\sigma}^2)+Const$

$L$ is the maximized value of the likelihood function for the estimated model
In the case of the linear model with Gaussian errors , maximum likelihood and least squares are the same thing, and $C_p$ and AIC are equivalent

# BIC

BIC (Bayesian information criterion) is arises in the Bayesian approach to model selection

$BIC=-2\log{L}+log{(n)}d=\frac{1}{n}(RSS+log{(n)}d\hat{\sigma}^2)+Const$

Like $C_p$ , the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value
Notice that BIC replaces the $2d\hat{\sigma}^2$ term in $C_p$ with $log{(n)}d\hat{\sigma}^2$ , where $n$ is the number of observations.

Since $\log{(n)}>2$ for any $n>7$ ,

the BIC statistic places a heavier penalty on the models with many variables

results in the selection of smaller models than $C_p$

# Adjusted $R^2$

For a least squares model with $d$ variables, the adjusted $R^2$ statistic is calculated as

$\text{Adjusted }R^2=1-\frac{RSS/(n-d-1)}{TSS/(n-1)}, (R^2=1-\frac{TSS-RSS}{TSS}=1-\frac{RSS}{TSS})$

where $TSS=\sum_{i=1}^{n}(y_i-\bar{y})^2=\sum_{i=1}^{n}(y_i^2-\hat{y}^2)+\sum_{i=1}^{n}(\hat{y}^2-n\bar{y}^2)$ is the total sum of squares

Maximizing the adjusted $R^2$ is equivalent to minimizing the $\frac{RSS}{n-d-1}$
RSS always decreases as the number of variables in the model increases
$\frac{RSS}{n-d-1}$ may increase or decrease, due to the presence of $d$ in the denominator
Unlike the $R^2$ statistic , the adjusted $R^2$ statistic pays a price for the inclusion of unnecessary variables in the model
Unlike $C_p$, AIC, and BIC , for which a small value indicates a model with a low test error, a large value of adjusted $R^2$ indicates a model with a small test error

# Validation and cross-validation

Each of the procedures returns a sequence of models $M_k$ indexed by model size $k =0,1,2…$
We compute the validation set error or the cross-validation error for each model $M_k$ under consideration, and then select the $k$ for which the resulting estimated test error is smallest.

✅ This procedure has an advantage relative to AIC, BIC, $C_p$ , and adjusted $R^2$ , in that

it provides a direct estimate of the test error

It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance $\sigma^2$

❎ It needs computational power

# One-standard-error rule

All three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors

We first calculate the standard error of the estimated test MSE for
each model size
and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve

Selecting Number of Model Parameters Using Cross-Validation and One Standard Error Rule

# Shrinkage Methods

we can fit a model containing all $p$ predictors using a technique that constrains or regularizes the coefficient estimates
shrinking the coefficient estimates can significantly reduce their variance

# Ridge regression

Recall that the least squares fitting procedure estimates $\beta_0, \beta_1, \ldots, \beta_p$ using the values that minimize

$RSS=\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2$

In contrast, the ridge regression coefficient estimates $\hat{\beta}_\lambda^R$ are the values that minimize

$\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2+\lambda\sum_{j=1}^{p}\beta_j^2=RSS+\lambda\sum_{j=1}^{p}\beta_j^2$

where $\lambda\geq 0$ is a tuning parameter
By making RSS smaller, the ridge regression seeks coefficient that fit the data well

$\lambda\sum_{j}\beta_j^2$ called a shrinkage penalty, is small when $\beta_1,\beta_2,\ldots,\beta_p$ are close to zero
The tuning parameter $\lambda$ serves to control the relative importance of the two terms on the regression coefficient estimates. Select a good $\lambda$ is crucial, usually via cross-validation

# The Lasso (Least Absolute Shrinkage and Selection Operator)

Since the obvious disadvantage of ridge regression is it will include all $p$ predictors in the final model, the lasso is proposed to overcome this problem

The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, ̂ $\hat{\beta}_\lambda^R$ , minimize the quantity

$\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2+\lambda\sum_{j=1}^{p}|\beta_j|=RSS+\lambda\sum_{j=1}^{p}|\beta_j|$

In statistical parlance, the lasso uses an $l_1$ norm (pronounced “ell 1”) penalty instead of an $l_2$ penalty.

As ridge regression, the lasso shrinks the coefficient estimates towards zero
However, in the case of the lasso, the $l_1$ penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter $\lambda$ is sufficiently large. Hence, the lasso performs variable selection
lasso yields sparse models (models that involve only a subset of the variables).
As in ridge regression, selecting a good value of $\lambda$ for the lasso is critical; cross-validation is again the choice

Why is it that the lasso, unlike ridge regression, results in coefficient estimates

that are exactly equal to zero?

One can show that the lasso and ridge regression coefficient estimates solve the problems and (linked by the Lagrange multiplier)
$\min_{\beta}\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2 \text{ subject to } \sum_{j=1}^{p}|\beta_j|\leq s$
and
$\min_{\beta}\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2 \text{ subject to }\sum_{j=1}^{p}\beta_j^2\leq s$
Respectively
The best subset selection can be viewed as (ESL, ch3)
$\min_{\beta}\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^{p}\beta_jx_{ij})^2 \text{ subject to } \sum_{j=1}^{p}I(\beta_j\neq 0)\leq s$

Linear model selection and regularization

# Linear Model Selection and Regularization

# Three classes of methods

# Subset Selection

# Stepwise Selection

# Forward Stepwise Selection

# More on Forward Stepwise Selection

# Backward Stepwise Selection

# More on Backward Stepwise Selection

# Choosing the Optimal Model

# More details

# $C_p$

# AIC

# BIC

# Adjusted $R^2$

# Validation and cross-validation

# One-standard-error rule

# Shrinkage Methods

# Ridge regression

# The Lasso (Least Absolute Shrinkage and Selection Operator)

# Comparing the Lasso and Ridge Regression

# Short Conclusions

# Dimension Reduction Methods

# Principal Components Regression (PCR)

# Partial Least Squares

# Consideration in high dimensions

# Linear Model Selection and Regularization

# Three classes of methods

# Subset Selection

# Stepwise Selection

# Forward Stepwise Selection

# More on Forward Stepwise Selection

# Backward Stepwise Selection

# More on Backward Stepwise Selection

# Choosing the Optimal Model

# More details

# CpC_pCp​

# AIC

# BIC

# Adjusted R2R^2R2

# Validation and cross-validation

# One-standard-error rule

# Shrinkage Methods

# Ridge regression

# The Lasso (Least Absolute Shrinkage and Selection Operator)

# Comparing the Lasso and Ridge Regression

# Short Conclusions

# Dimension Reduction Methods

# Principal Components Regression (PCR)

# Partial Least Squares

# Consideration in high dimensions

Resampling Methods

# $C_p$

# Adjusted $R^2$