# Linear Regression

線性回歸是一種簡單的監督式學習方法,假設 𝑌 對 𝑋1,𝑋2,…𝑋𝑝 的依賴是線性的。
雖然看起來過於簡單,但線性回歸在概念和實踐上都非常有用。

# Simple Linear Regression

We assume a model: Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon, where β0\beta_0 and β1\beta_1 are unknown constants, that represent the intercept and slope, and ϵ\epsilon is the error term which is assumed to be i.i.d. that follows the normal distribution.(LINE)

# Estimation of the parameters

  • Let y^i=β^0+β^1xi\hat y_{i} = \hat \beta_0 + \hat \beta_1 x_i be the prediction of YY when X=xiX = x_i. Then the residual is defined as ei=yiy^ie_i = y_i - \hat y_i.
  • Residual sum of squares: RSS=e12+e22+...+en2=i=1n(yiβ^0β^1xi)2RSS = e_1^2 + e_2^2 + ... + e_n^2=\sum_{i=1}^{n} (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2

# Accessing the Accuracy

  1. 針對紅線 Y=2+3X+ϵY=2+3X+\epsilon 生成資料點(黑點)
  2. 根據資料點做線性回歸,得到藍線
  3. 重複步驟 1 和 2,多畫幾次對應到信賴區間

量化信賴區間之誤差
The standard error of the estimate reflects how it varies underrepeated sampling.

  • SE(β^1)2=σ2i=1n(xixˉ)2\displaystyle SE(\hat \beta_1)^2 = \frac{\sigma^2}{\sum_{i=1}^{n} (x_i - \bar x)^2}
  • SE(β^0)2=σ2[1n+xˉ2i=1n(xixˉ)2]\displaystyle SE(\hat \beta_0)^2 = \sigma^2 [\frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^{n} (x_i - \bar x)^2}]
    Where
    • σ2=Var(ϵ)\sigma^2=Var(\epsilon)
    • σ^2=RSSn2=1n2i=1n(yiy^i)2\displaystyle \hat \sigma^2 = \frac{RSS}{n-2} = \frac{1}{n-2} \sum_{i=1}^{n} (y_i - \hat y_i)^2

This standard error can be used to compute confidence intervals

# Hypothesis Testing

Standard errors can also be used to perform hypothesis tests, the most common one involves
testing the null hypothesis of H0H_0: There is no relationship between XX and YY. (i.e., β1=0\beta_1 = 0)
vs. the alternative hypothesis HaH_a: There is a relationship between XX and YY. (i.e., β10\beta_1 \ne 0)

# Overall Accuracy of the Model

  • The Residual Standard Error (RSE) (smaller is better):

    RSE=1n2i=1n(yiy^i)2RSE = \sqrt{\frac{1}{n-2} \sum_{i=1}^{n} (y_i - \hat y_i)^2}

  • R-squared or fraction of variance explained (larger is better):

    R2=TSSRSSTSS=1RSSTSSR^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}

    Where TSS=i=1n(yiyˉ)2=i=1n(yiy^i)2+i=1n(y^iyˉ)2TSS = \sum_{i=1}^{n} (y_i - \bar y)^2 = \sum_{i=1}^{n} (y_i - \hat y_i)^2 + \sum_{i=1}^{n} (\hat y_i - \bar y)^2 is the total sum of squares.

# Multiple Linear Regression

Here our model is:

Y=β0+β1X1+β2X2+...+βpXp+ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon

We estimate the coefficients βj\beta_j by minimizing the residual sum of squares:

RSS=i=1n(yiβ^0β^1xi1β^2xi2...β^pxip)2RSS = \sum_{i=1}^{n} (y_i - \hat \beta_0 - \hat \beta_1 x_{i1} - \hat \beta_2 x_{i2} - ... - \hat \beta_p x_{ip})^2

Is at least one of the predictors X1,X2,...,XpX_1, X_2, ..., X_p useful in predicting YY?

We can use the F-statistic

F=(TSSRSS)/pRSS/(np1) simFp,np1F = \frac{(TSS - RSS) / p}{RSS / (n - p - 1)}~sim F_{p, n - p - 1}

Note that

  • if linear model assumptions hold, E[RSSnp1]=σ2E[\frac{RSS}{n - p - 1}] = \sigma^2
  • if H0H_0 holds, E[TSSRSSp]=σ2E[\frac{TSS - RSS}{p}] = \sigma^2

Is it only a subset of the predictors useful?

To examine whether a particular set qq variables are zero or not, H0:βpq+1=βpq+2=...=βp=0H_0: \beta_{p - q + 1} = \beta_{p - q + 2} = ... = \beta_p = 0, we can use the following F-statistic:

F=(RSS0RSS)/qRSS/(np1)F = \frac{(RSS_{0} - RSS) / q}{RSS / (n - p - 1)}

we fit a second model that uses all the variables except those last qq to get RSS0RSS_{0}.
Decide the important variables

  • If p>np>n: 無唯一解,亦無法使用 F-test
  • If p<np<n: 至少有一個 predictor 和 response 有關聯
  • 最直接的方法是 all subsets or best subset regression, 我們可以嘗試所有可能的 predictor 組合,選擇使得 RSS 最小的組合,但因為組合數 2p2^p 隨著 pp 指數成長,計算量會非常大

How well does the model fit the data?

  1. Can be shown R2=Cor(Y,Y^)R^2 = Cor(Y, \hat Y) in this case
  2. Compute the RSE, when it is small compared to the range of YY, the model fits well
  3. How close between Y^\hat Y and f(X)f(X) can be quantified by the confidence interval

Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

預測區間 (PI) 總是比信賴區間 (CI) 寬,因為考慮了個別預測的誤差

C.I:y^±tα/2σ^1n+(xxˉ)2i=1n(xixˉ)2C.I: \hat y \pm t_{\alpha/2}\hat \sigma \sqrt{\frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^{n} (x_i - \bar x)^2}}

P.I:y^±tα/2σ^1+1n+(xxˉ)2i=1n(xixˉ)2P.I: \hat y \pm t_{\alpha/2}\hat \sigma \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^{n} (x_i - \bar x)^2}}

for multiple linear regression, see here

# Other Considerations in the Regression Model

# Qualitative Predictors

  • It take a discrete set of values
  • Also called categorical predictors or factor variables

# Potential Problems

When we fit a linear regression model to a particular data set, many problems may arise, including:

  • 模型
    1. Non-linearity of the data -> L
      • The linear regression model assumes that there is a straight-line relationship between the predictors and the response
      • Residual plots are a useful graphical tool for identifying non-linearity
      • The red line is a smooth fit to the residuals, which is displayed in order to make it easier to identify any trends

        The residuals exhibit a clear U-shape in the left panel, which provides a strong indication of non-linearity in the data

        the right-hand panel displays the residual plot that results from the model, which contains a quadratic term

        Conclusion: There appears to be little pattern in the residuals, suggesting that the quadratic term improves the fit to the data
    2. Correlated errors -> I

      An important assumption of the linear regression model is that the error terms, ϵ1,ϵ2,...,ϵn\epsilon_1, \epsilon_2, ..., \epsilon_n, are uncorrelated. What does this mean?

      • ϵi\epsilon_i is positive provides no information about the sign of \epsilon_
      • As an extreme example, suppose we accidentally doubled our data and ignored the fact that we had done so.
        • Our standard error: 2n
        • Our estimated parameters: same
        • Our confidence intervals: narrower by 2!\sqrt{2}!

      Plots of residuals from simulated time series data sets generated with differing levels of correlation 𝜌 between error terms for adjacent time point


      In the top panel, we see the residuals from a linear regression fit to data generated with uncorrelated errors


      The center panel illustrates a moderate case in which the residuals had a correlation of 0.5


      The residuals in the bottom panel show a clear pattern in the residuals—adjacent residuals tend to take on similar values

    3. Non-constant variance of error terms -> E

      Often the case that the variances of the error terms are non-constant
      e.g., as the value of the response variable increases, the variance of the error terms also increases
      This phenomenon is called heteroscedasticity

  • 資料收集
    1. Outliers (unusual yy)
      • An outlier is a point for which 𝑦𝑖 is far from the value predicted by the model
      • Outliers can arise for a variety of reasons, such as incorrect recording of an observationduring data collection
      • Observations whose studentized residuals are greater than 3 in absolute value are possible outliers
      • it can cause other problems
        e.g. affecting RSE, since RSE (𝜎) is used to compute all confidence intervals and 𝑝-values, such a dramatic increase caused by a single data point can have implications for the interpretation of the final model
      • If we believe that an outlier has occurred, one solution is to simply remove the observation. However, care should be taken since an outlier may instead indicate a deficiency with the model
    2. High leverage points (unusual XX)

      For a simple linear regression,

      hi=1n+(xixˉ)2j=1n(xjxˉ)2h_{i} = \frac{1}{n} + \frac{(x_i - \bar x)^2}{\sum_{j=1}^{n} (x_j - \bar x)^2}

      • In general, the leverage statistic [1/n,1]\in [1/n, 1], the average leverage for all observations is (p+1)/n(p + 1)/n
        • If given observation has a leverage statistic that is substantially larger than (p+1)/n(p + 1)/n, then that observation has high leverage
      • A value whose absence would greatly change the regression equation is influential observation
        • influence pt typically has high leverage, but high leverage pts are not necessarily influential
        • Cook’s Distance can be used to measure the influence of an observation

          Di=j=1n(y^jy^j(i))2(p+1)σ^2=1p+1ti2hi1hiD_i = \frac{\sum_{j=1}^{n} (\hat y_j - \hat y_{j(i)})^2}{(p + 1) \hat \sigma^2}= \frac{1}{p + 1} t_i^2 \frac{h_i}{1 - h_i}

          where y^j(i)\hat y_{j(-i)} is the fitted response value obtained when excluding ii
    3. Collinearity

      Collinearity refers to the situation in which two or more predictor variables are closely related to one another

      • A better way to access multicollinearity is the Variance Inflation Factor (VIF)

        VIF(β^j)=11RXjXj2VIF(\hat \beta_j) = \frac{1}{1 - R_{X_{j}|X_{-j}}^2}

        where RXjXjR_{X_{j}|X_{-j}} is the R-squared obtained by regressing XjX_j against all the other predictors

        A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity

      • When faced with the problem of collinearity, we can:
        • Drop one of the problematic variables since it is redundant
        • Combine the collinear variables into a single predictor

# Linear Regression vs. K-Nearest Neighbors

We considor one of the simplest and best-known non-parametric mtd: K-nearest neighbors regression (KNN regression)

  • Given a value for KK and a prediction point x0x_0, KNN regression first identifies the KK training observations that are closest to x0x_0, represented by N0N_0
  • It then estimates f(x0)f(x_0) using the average of all the training responses in N0N_0. In other words,

    f^(x0)=1KxiN0yi\displaystyle \hat f(x_0)=\frac{1}{K}\sum_{x_i\in N_0}y_i

# Genelization of the Linear Model

Methods that expand the scope of linear models and how they are fit:

  • Classification problems: logistic regression, support vector machines
  • Non-linearity: kernel smoothing, splines and generalized additive model
  • Interactions: Tree-based methods, bagging, random forests and boosting (these also capture non-linearities)
  • Regularized fitting: Ridge regression and lassod