# Why resampling?

These methods refit a model of interest by sampling from the training set, in order to obtain additional information about the fitted model
Can be computationally expensive because they involve fitting the same statistical method multiple times using different subsets of the training data (隨時代進步這困難比較不受限)

# Cross-validation and the bootstrap

We discuss two resampling methods: cross-validation and the bootstrap. They provide

estimates of test-set prediction error
cross-validation select the appropriate level of flexibility

A hyperparameter is a parameter whose value is used to control the learning process and cannot be inferred while fitting the model to the training set
- Model hyperparameters
- Algorithm hyperparameters
Bootstrap helps us to obtain the standard deviation of our parameter estimates. Also use to conduct ensemble learning

# Validation-set approach

we randomly divide the available set of samples into two parts: a training set and a validation or hold-out set

The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set
The resulting validation-set error provides an estimate of the test error. Estimates can be used to select the best model and to give an idea of the test error of the final chosen model
白話：樣本隨機分成訓練集和測試集，挑適合的模型來擬和訓練集並預測測試集的觀測值，其得到的誤差可以用來估計測試集誤差
Drawbacks
- The validation estimate of the test error can be highly variable
- only a subset of the observations (those that are included in the training set) are used to fit the model
- the validation set error may overestimate the test error for the model fit on the entire data set. Since statistical methods tend to perform worse when trained on fewer observations

# Training versus testing-set performance

Recall the distinction between the test error and the training error

is the average error that results from using a statistical learning method to predict the response on a new observation

can be easily calculated by applying the statistical learning method to the observations used in its training

# More on prediction-error estimates

The best solution: a large designated test set. Often not available

mathematical adjustment order to estimate the test error rate to the training error rate

These include the $C_p$ statistic, AIC and BIC
a class of methods that estimate the test error by holding out a subset of the training observations from the fitting process and then applying the statistical learning method to those held-out observations

# Leave-One-Out Cross-Validation (LOOCV)

a single observation $(x_1,y_1)$ is used for the validation set

$CV_{(n)}=\frac{1}{n}\sum_{i=1}^n MSE_i,\, \text{Where }MSE_i=(y_i-\hat{y_i})^2$

# K-fold cross-validation (Widely used approach for estimating test error)

The idea is to randomly divide the data into $k$ equal-sized parts.

We leave out part $k$ , fit the model to the other $k-1$ parts
obtain predictions for the left-out 𝑖th part. This is done in turn for each part $i=1,2,...,k$
the results are combined

# The details

Let the $k$ parts be $C_1, C_2, ..., C_k$
There are $n_i$ observations in part $i$ .

$CV_{(k)}=\sum_{i=1}^k \frac{n_i}{n}MSE_i$

$MSE_i=\frac{\sum_{j\in C_i}(y_j-\hat{y_i})^2}{n_i}$ , $\hat{y_j}$ is the fit for observation $j$ , obtained with part $i$ removed
Setting $k=n$ yields $n$ -fold or leave-one out cross-validation (LOOCV)

One typically performs k-fold CV using $k=5$ or $k=10$

# A nice special case

With least-squares linear or polynomial regression, a shortcut makes the cost of LOOCV the same as that of a single model

$CV_{(n)}=\frac{1}{n}\sum_{i=1}^n (\frac{y_i-\hat{y_i}}{1-h_i})^2$

Where

$\hat y_i$ is the $i$ th fitted value from the original least squares fit
$h_i$ is the leverage

LOOCV is sometimes useful but typically doesn’t shake up the data enough

still time-consuming
The estimates from each fold are highly correlated and hence their averagecan have a high variance

# Bias-variance trade-off for cross-validation

Since each training set is as big as the original training set, the estimates of prediction error will typically be biased
This bias is minimized when $k=n$ (LOOCV), but this estimate has a high variance

# The Bootstrap

The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method
For example, it can provide

an estimate of the standard error of a coefficient
a confidence interval for that coefficient

# Bootstrap of $\alpha$

Denoting the first bootstrap data set by $Z^{*1}$ , we use $Z^{*1}$ to produce a new bootstrap estimate for $\alpha$ , which we call \hat{\alpha}^
This procedure is repeated $B$ times for some large value of $B$ (say 1000), in order to produce $B$ different bootstrap data sets, $Z^{*1},Z^{*2},...,Z^{*B}$ and $B$ corresponding estimates,
$\hat{\alpha}^{*1},\hat{\alpha}^{*2},...,\hat{\alpha}^{*B}$
We estimate the standard error of these bootstrap estimates using the formula
$SE_B(\hat{\alpha})=\sqrt{\frac{1}{B-1}\sum_{r=1}^B(\hat{\alpha}^{*r}-\frac{1}{B}\sum_{r'=1}^B\hat{\alpha}^{*r'})^2}$

# Other uses of the bootstrap

Primarily used to obtain standard errors of an estimate, also provides approximate confidence intervals for a population parameter

# Bootstrap for confidence interval

Generate 𝑛 “bootstrap sample” data points $x_i^*, y_i^*$
Fit linear regression using $x_i^*, y_i^*$
Evaluate the regression line on fix $x$ -grid
Repeat step 1-3 for $B$ times and collect the values in step 3
For each point in the $x$ -grid, calculate the confidence interval using collected values

# The bootstrap in general

In more complex data situations, for example

if the data is a time series, we can’t simply sample the observations with replacement (Not i.i.d)
We can instead create blocks of consecutive observations, and sample those with replacements. Then we paste together sampled blocks to obtain a bootstrap dataset

# Can the bootstrap estimate prediction error?

In cross-validation, there is no overlap. This is crucial for its success

To estimate prediction error using the bootstrap,
- we could think about using each bootstrap dataset as our training sample and +the original sample as our validation sample++
- But each bootstrap sample has a significant overlap with the original data.
Can partly fix this problem by only using predictions for those observations that did not (by chance) occur in the current bootstrap sample
In the end, cross-validation provides a simpler, more attractive approach for estimating prediction error