# Linear Model Selection and Regularization
Recall the linear model (Can also apply to GLM)
Then we consider even more general non-linear models
- the linear model has distinct advantages in terms of its
interpretabilityand often showsgood predictive performance- the simple linear model can be improved, by replacing ordinary least squares fitting with some alternative fitting procedures
- This often applies to the case when or
Why consider alternatives to least squares?
- Prediction Accuracy: especially when or , to control the variance
- Model Interpretability: By removing irrelevant
(x對y解釋無幫助)or redundant features(x有幫助,但貢獻可被其它變數取代)– that is, by setting the corresponding coefficient estimates to zero – we can obtain a model that is more easily interpreted. - Speed up the training/inference
- Avoid the curse of dimensionality
# Three classes of methods
- We identify a subset of the predictors that we believe to be related to the response.
- We then fit a model using least squares on the reduced set of variables
- We fit a model involving all predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates.
- This shrinkage (also known as regularization) has the effect of reducing variance and can also perform variable selection
- We project the predictors into a -dimensional subspace, where . This is achieved by computing different linear combinations, or projections, of the variables.
- Then these projections are used as predictors to fit a linear regression model by least squares
# Subset Selection
- Let denote the null model, which contains no predictors.
- For :
- Fit all models that contain exactly predictors
- Pick the best among these models, and call it . Here best defined as having the smallest RSS, or equicalently largest
- Select a single best model from among using cross-validated prediction error, (AIC), BIC, or adjusted
Extensions to other models
- the same ideas apply to other types of models, such as logistic regression
- The deviance proportion to the maximized log-likelihood plays the role of RSS for a broader class of models
# Stepwise Selection
- For computational reasons, best subset selection cannot be applied with very large
- Best subset selection may also suffer from statistical problems when is large:
larger the search space, the higher the chance of finding models that look good on the training data, even though they might not have any predictive power on future data - For both of these reasons, stepwise methods, which explore a far more restricted set of models,are attractive alternatives to best subset selection
# Forward Stepwise Selection
- Begins with a model containing no predictors, and then adds predictors to the model, one-at-a-time, until all of the predictors are in the model
- In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model
In detail
- Let denote the null model, which contains no predictors
- For :
- Consider all models that augment the predictors in with one additional predictor
- Choose the best among these models, and call it .
Here best is defined as having the smallest RSS, or equivalently largest
- Select a single best model from among using cross-validated prediction error, (AIC), BIC, or adjusted
Though forward stepwise selection considers performs a guided search models, it over model space, and so the effective model space considered contains substantially more than models
# More on Forward Stepwise Selection
✅ The computational advantage over best subset selection is clear
❎ It is not guaranteed to find the best possible model (lowest training error) out of all models containing subsets of the predictors
# Backward Stepwise Selection
- Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection
- However, unlike forward stepwise selection, it begins with the full least squares model containing all predictors, and then iteratively removes the least useful predictor, one-at-a-time
In detail
- Let denote the full model, which contains all predictors
- For :
- Consider all models that contain all but one of the predictors in , for a total of predictors
- Choose the best among these models, and call it .
Here best is defined as having the smallest RSS, or equivalently largest
- Select a single best model from among using cross-validated prediction error, (AIC), BIC, or adjusted
# More on Backward Stepwise Selection
✅ Like forward stepwise selection, the backward selection approach searches through only models
❎ Backward selection requires that (so that the full model can be fit). In contrast, forward stepwise can be used even when
❎ Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best (lowest training error) model containing a subset of the predictors
# Choosing the Optimal Model
- 模型通常都含最小 RSS 和最大R^ 兩個和訓練誤差相關的。我們需要選低測試誤差的模型,但因為訓練誤差來估計測試誤差的時候表現不好,得出這兩種選最佳模型的方法在面對不同數量的 predictors 不太適合。
Therefore we can
- directly estimate the test error
- validation set approach
- cross validation approach
- indirectly estimate test error
- making an adjustment to be training error to account for the bias due to overfitting
# More details
#
Mallow’s for estimated test MSE (for least square model):
- is the total # of parameters used
- is an estimate of the variance of the error associated with each response measurement based on model containing all predictors
# AIC
The AIC (Akaike Information Criterion) is defined for a large class of models fit by maximum likelihood:
- is the maximized value of the likelihood function for the estimated model
- In the case of the linear model with Gaussian errors , maximum likelihood and least squares are the same thing, and and AIC are equivalent
# BIC
BIC (Bayesian information criterion) is arises in the Bayesian approach to model selection
- Like , the BIC will tend to take on a small value for a model with a low test error, and so generally we select the model that has the lowest BIC value
- Notice that BIC replaces the term in with , where is the number of observations.
Since for any ,
- the BIC statistic places a heavier penalty on the models with many variables
- results in the selection of smaller models than
# Adjusted
For a least squares model with variables, the adjusted statistic is calculated as
where is the total sum of squares
- Maximizing the adjusted is equivalent to minimizing the
- RSS always decreases as the number of variables in the model increases
- may increase or decrease, due to the presence of in the denominator
Unlike the $R^2$ statistic, the adjusted statistic pays a price for the inclusion of unnecessary variables in the modelUnlike $C_p$, AIC, and BIC, for which a small value indicates a model with a low test error, a large value of adjusted indicates a model with a small test error
# Validation and cross-validation
Each of the procedures returns a sequence of models indexed by model size
We compute the validation set error or the cross-validation error for each model under consideration, and then select the for which the resulting estimated test error is smallest.
✅ This procedure has an advantage relative to AIC, BIC, , and adjusted , in that
- it provides a direct estimate of the test error
- It can also be used in a wider range of model selection tasks, even in cases where it is hard to pinpoint the model degrees of freedom (e.g. the number of predictors in the model) or hard to estimate the error variance
❎ It needs computational power
# One-standard-error rule
All three approaches suggest that the four-, five-, and six-variable models are roughly equivalent in terms of their test errors
- We first calculate the standard error of the estimated test MSE for
each model size - and then select the smallest model for which the estimated test error is within one standard error of the lowest point on the curve
Selecting Number of Model Parameters Using Cross-Validation and One Standard Error Rule

# Shrinkage Methods
- we can fit a model containing all predictors using a technique that constrains or regularizes the coefficient estimates
- shrinking the coefficient estimates can significantly reduce their variance
# Ridge regression
Recall that the least squares fitting procedure estimates using the values that minimize
In contrast, the ridge regression coefficient estimates are the values that minimize
where is a tuning parameter
By making RSS smaller, the ridge regression seeks coefficient that fit the data well
- called a shrinkage penalty, is small when are close to zero
- The tuning parameter serves to control the relative importance of the two terms on the regression coefficient estimates. Select a good is crucial, usually via cross-validation
# The Lasso (Least Absolute Shrinkage and Selection Operator)
Since the obvious disadvantage of ridge regression is it will include all predictors in the final model, the lasso is proposed to overcome this problem
The Lasso is a relatively recent alternative to ridge regression that overcomes this disadvantage. The lasso coefficients, ̂, minimize the quantity
In statistical parlance, the lasso uses an norm (pronounced “ell 1”) penalty instead of an penalty.
- As ridge regression, the lasso shrinks the coefficient estimates towards zero
However, in the case of the lasso, the penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero when the tuning parameter is sufficiently large. Hence, the lasso performs variable selection - lasso yields sparse models (models that involve only a subset of the variables).
As in ridge regression, selecting a good value of for the lasso is critical; cross-validation is again the choice
Why is it that the lasso, unlike ridge regression, results in coefficient estimates
that are exactly equal to zero?
- One can show that the lasso and ridge regression coefficient estimates solve the problems and (linked by the Lagrange multiplier)
and
Respectively
- The best subset selection can be viewed as (ESL, ch3)