# Notation

  • XX: predictor variable (feature)
    • We can refer to the input vector collectively as X=(X1,X2,...,Xp)X=(X_1, X_2, ..., X_p)
    • Vectors are represented as a column vector

    X=(x11x12x1p) X = \begin{pmatrix} x_{11} \\ x_{12} \\ \vdots \\ x_{1p} \end{pmatrix}

  • YY: response variable (target)
    • Usually a scalar. if we have nn observations, we can represent YY as a vector

      Y=(y1y2yn)Y = \begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{pmatrix}

    • We write our model as

      Y=f(X)+ϵY = f(X) + \epsilon

      where ϵ\epsilon captures measurement errors and other discrepancies has a mean of zero.

# How to estimate ff?

gg is the distribution of data that is unknown

  1. Choose a model f_
    • Parametric: Assumes a specific form for ff, estimating a fixed set of parameters by fitting or training.
    • Non-parametric: Does not assume a specific form, but need a larger amount of data.
  2. Choose a quality measure
    • Loss function: measures how well a model fits the data.
    • Common loss functions:
      • Mean Squared Error (MSE): for regression problems

        MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

      • Cross-Entropy Loss: for classification problems

        L=i=1nc=1Cyi,clog(y^i,c)L = -\sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})

  3. Optimize(fitting)
    • Find the best parameters θ\theta
    • Calculus to find close form solution, gradient descent, expectation-maximization(EM) algorithm, etc.

# Nearest Neighbor Averaging

If we have few data points, we can use the average of nearest neighbors to estimate ff.
We cannot compute E(YX=x)E(Y|X=x), but we can let f^(x)=Ave(YXN(x))\hat{f}(x) = Ave(Y|X \in N(x)), where N(x)N(x) is some neighborhood of xx.

Curse of dimensionality (維數災難): As the number of features increases, the volume of the feature space increases exponentially, making data points sparse. This sparsity makes it difficult to find nearby neighbors, reducing the effectiveness of nearest neighbor methods.

# Parametric and structured models

The linear model is an important example of a parametric model:

Y=β0+β1X1+β2X2+...+βpXp+ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon

  • Specified in terms of p+1p+1 parameters: β0,β1,...,βp\beta_0, \beta_1, ..., \beta_p
  • We estimate the parameters by fitting the model to the training data.
  • It often serves as a good and interpretable approximation to the true function f(X)f(X).

# Some tradeoffs

  • Prediction accuracy vs. interpretability
    Linear models are more interpretable; thin-plate splines are not
  • Good fit vs. overfit
  • Parsimony vs. black box
    Prefer a simpler model over a black-box predictor

# Assessing model accuracy

Suppose we fit a model f^\hat{f} using training data Tr={xi,yi},i=1,...,n1Tr=\{x_i, y_i\},\, i=1, ..., n_1, and we have a separate test data set Te={xj,yj},j=1,...,n2Te=\{x_j, y_j\},\, j=1, ..., n_2.

  • Training MSE:

    MSEtr=1n1i=1n1(yif^(xi))2MSE_{tr} = \frac{1}{n_1} \sum_{i=1}^{n_1} (y_i - \hat{f}(x_i))^2

  • Test MSE:

    MSEte=1n2j=1n2(yjf^(xj))2MSE_{te} = \frac{1}{n_2} \sum_{j=1}^{n_2} (y_j - \hat{f}(x_j))^2

    較複雜線性模型

    epsilon 小的三次方關係,即便overfit在黑線附近,也要到很高次才看得到轉折

    epsilon 中等的線性關係

  • Interpretation of the curves:

    • Black curve: true function
    • Red curve: MSE_
    • Grey curve: MSE_
    • Orange, blue, green curves: fits of different flexibility
  • As model flexibility increases:

    • TrT_r always decreases
    • TeT_e decreases initially, then increases (U-shape)

# Bias-Variance Tradeoff

Suppose we have fit model f^\hat{f} to some training data, and let (x0,y0)(x_0, y_0) be a test observation drawn from the population.
The expected test MSE at a point x0x_0 can be decomposed into three fundamental components:

E[(y0f^(x0))2]=BiasTr[f^(x0,Tr)]2+VarTr[f^(x0,Tr)]+Var(ϵ)E[(y_0 - \hat{f}(x_0))^2] = Bias_{Tr}[\hat{f}(x_0,Tr)]^2 + Var_{Tr}[\hat{f}(x_0,Tr)] + Var(\epsilon)

Note that BiasTr[f^(x0,Tr)]2=E[f^(x0,Tr)]f(x0)Bias_{Tr}[\hat{f}(x_0,Tr)]^2=E[\hat{f}(x_0,Tr)]-f(x_0).

model Flexibility
Complexity Increases
Bias Decreases
Variance Increases

# Tradeoff for three examples

To obtain a good fitted model, select the one with the smallest test MSE.