# Notation

$X$ $X$ : predictor variable (feature)
- We can refer to the input vector collectively as $X=(X_1, X_2, ..., X_p)$
- Vectors are represented as a column vector
$X = \begin{pmatrix} x_{11} \\ x_{12} \\ \vdots \\ x_{1p} \end{pmatrix}$
$Y$ $Y$ : response variable (target)
- Usually a scalar. if we have $n$ observations, we can represent $Y$ as a vector
  $Y = \begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{pmatrix}$
- We write our model as
  $Y = f(X) + \epsilon$
  where $\epsilon$ captures measurement errors and other discrepancies has a mean of zero.

# How to estimate $f$ ?

$g$ is the distribution of data that is unknown

Choose a model f_
- Parametric: Assumes a specific form for $f$ , estimating a fixed set of parameters by fitting or training.
- Non-parametric: Does not assume a specific form, but need a larger amount of data.
Choose a quality measure
- Loss function: measures how well a model fits the data.
- Common loss functions:
  - Mean Squared Error (MSE): for regression problems
    $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
  - Cross-Entropy Loss: for classification problems
    $L = -\sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})$
Optimize(fitting)
- Find the best parameters $\theta$
- Calculus to find close form solution, gradient descent, expectation-maximization(EM) algorithm, etc.

# Nearest Neighbor Averaging

If we have few data points, we can use the average of nearest neighbors to estimate $f$ .
We cannot compute $E(Y|X=x)$ , but we can let $\hat{f}(x) = Ave(Y|X \in N(x))$ , where $N(x)$ is some neighborhood of $x$ .

Curse of dimensionality (維數災難): As the number of features increases, the volume of the feature space increases exponentially, making data points sparse. This sparsity makes it difficult to find nearby neighbors, reducing the effectiveness of nearest neighbor methods.

# Parametric and structured models

The linear model is an important example of a parametric model:

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon$

Specified in terms of $p+1$ parameters: $\beta_0, \beta_1, ..., \beta_p$
We estimate the parameters by fitting the model to the training data.
It often serves as a good and interpretable approximation to the true function $f(X)$ .

# Some tradeoffs

Prediction accuracy vs. interpretability
Linear models are more interpretable; thin-plate splines are not
Good fit vs. overfit
Parsimony vs. black box
Prefer a simpler model over a black-box predictor

# Assessing model accuracy

Suppose we fit a model $\hat{f}$ using training data $Tr=\{x_i, y_i\},\, i=1, ..., n_1$ , and we have a separate test data set $Te=\{x_j, y_j\},\, j=1, ..., n_2$ .

Training MSE:

$MSE_{tr} = \frac{1}{n_1} \sum_{i=1}^{n_1} (y_i - \hat{f}(x_i))^2$
Test MSE:

$MSE_{te} = \frac{1}{n_2} \sum_{j=1}^{n_2} (y_j - \hat{f}(x_j))^2$

較複雜線性模型

epsilon 小的三次方關係，即便overfit在黑線附近，也要到很高次才看得到轉折

epsilon 中等的線性關係
Interpretation of the curves:
- Black curve: true function
- Red curve: MSE_
- Grey curve: MSE_
- Orange, blue, green curves: fits of different flexibility
As model flexibility increases:
- $T_r$ always decreases
- $T_e$ decreases initially, then increases (U-shape)

# Bias-Variance Tradeoff

Suppose we have fit model $\hat{f}$ to some training data, and let $(x_0, y_0)$ be a test observation drawn from the population.
The expected test MSE at a point $x_0$ can be decomposed into three fundamental components:

$E[(y_0 - \hat{f}(x_0))^2] = Bias_{Tr}[\hat{f}(x_0,Tr)]^2 + Var_{Tr}[\hat{f}(x_0,Tr)] + Var(\epsilon)$

Note that $Bias_{Tr}[\hat{f}(x_0,Tr)]^2=E[\hat{f}(x_0,Tr)]-f(x_0)$ .

model	Flexibility
Complexity	Increases
Bias	Decreases
Variance	Increases

# Tradeoff for three examples

To obtain a good fitted model, select the one with the smallest test MSE.