# Linear Regression

線性回歸是一種簡單的監督式學習方法，假設 𝑌 對 𝑋1,𝑋2,…𝑋𝑝 的依賴是線性的。
雖然看起來過於簡單，但線性回歸在概念和實踐上都非常有用。

# Simple Linear Regression

We assume a model: $Y = \beta_0 + \beta_1 X + \epsilon$ , where $\beta_0$ and $\beta_1$ are unknown constants, that represent the intercept and slope, and $\epsilon$ is the error term which is assumed to be i.i.d. that follows the normal distribution.(LINE)

# Estimation of the parameters

Let $\hat y_{i} = \hat \beta_0 + \hat \beta_1 x_i$ be the prediction of $Y$ when $X = x_i$ . Then the residual is defined as $e_i = y_i - \hat y_i$ .
Residual sum of squares: $RSS = e_1^2 + e_2^2 + ... + e_n^2=\sum_{i=1}^{n} (y_i - \hat \beta_0 - \hat \beta_1 x_i)^2$

# Accessing the Accuracy

針對紅線 $Y=2+3X+\epsilon$ 生成資料點（黑點）
根據資料點做線性回歸，得到藍線
重複步驟 1 和 2，多畫幾次對應到信賴區間

量化信賴區間之誤差
The standard error of the estimate reflects how it varies underrepeated sampling.

$\displaystyle SE(\hat \beta_1)^2 = \frac{\sigma^2}{\sum_{i=1}^{n} (x_i - \bar x)^2}$
$\displaystyle SE(\hat \beta_0)^2 = \sigma^2 [\frac{1}{n} + \frac{\bar x^2}{\sum_{i=1}^{n} (x_i - \bar x)^2}]$ $S E (\hat{β}_{0})^{2} = σ^{2} [\frac{1}{n} + \frac{x ˉ ^{2}}{\sum _{i = 1}^{n} ( x _{i} - x ˉ ) ^{2}}]$
Where
- $\sigma^2=Var(\epsilon)$
- $\displaystyle \hat \sigma^2 = \frac{RSS}{n-2} = \frac{1}{n-2} \sum_{i=1}^{n} (y_i - \hat y_i)^2$

This standard error can be used to compute confidence intervals

# Hypothesis Testing

Standard errors can also be used to perform hypothesis tests, the most common one involves
testing the null hypothesis of $H_0$ : There is no relationship between $X$ and $Y$ . (i.e., $\beta_1 = 0$ )
vs. the alternative hypothesis $H_a$ : There is a relationship between $X$ and $Y$ . (i.e., $\beta_1 \ne 0$ )

# Overall Accuracy of the Model

The Residual Standard Error (RSE) (smaller is better):
$RSE = \sqrt{\frac{1}{n-2} \sum_{i=1}^{n} (y_i - \hat y_i)^2}$
R-squared or fraction of variance explained (larger is better):
$R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}$
Where $TSS = \sum_{i=1}^{n} (y_i - \bar y)^2 = \sum_{i=1}^{n} (y_i - \hat y_i)^2 + \sum_{i=1}^{n} (\hat y_i - \bar y)^2$ is the total sum of squares.

# Multiple Linear Regression

Here our model is:

$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \epsilon$

We estimate the coefficients $\beta_j$ by minimizing the residual sum of squares:

$RSS = \sum_{i=1}^{n} (y_i - \hat \beta_0 - \hat \beta_1 x_{i1} - \hat \beta_2 x_{i2} - ... - \hat \beta_p x_{ip})^2$

Is at least one of the predictors $X_1, X_2, ..., X_p$ useful in predicting $Y$ ?

We can use the F-statistic

$F = \frac{(TSS - RSS) / p}{RSS / (n - p - 1)}~sim F_{p, n - p - 1}$

Note that

if linear model assumptions hold, $E[\frac{RSS}{n - p - 1}] = \sigma^2$
if $H_0$ holds, $E[\frac{TSS - RSS}{p}] = \sigma^2$

Is it only a subset of the predictors useful?

To examine whether a particular set $q$ variables are zero or not, $H_0: \beta_{p - q + 1} = \beta_{p - q + 2} = ... = \beta_p = 0$ , we can use the following F-statistic:

$F = \frac{(RSS_{0} - RSS) / q}{RSS / (n - p - 1)}$

we fit a second model that uses all the variables except those last $q$ to get $RSS_{0}$ .
Decide the important variables

If $p>n$ : 無唯一解，亦無法使用 F-test
If $p<n$ : 至少有一個 predictor 和 response 有關聯
最直接的方法是 all subsets or best subset regression, 我們可以嘗試所有可能的 predictor 組合，選擇使得 RSS 最小的組合，但因為組合數 $2^p$ 隨著 $p$ 指數成長，計算量會非常大

How well does the model fit the data?

Can be shown $R^2 = Cor(Y, \hat Y)$ in this case
Compute the RSE, when it is small compared to the range of $Y$ , the model fits well
How close between $\hat Y$ and $f(X)$ can be quantified by the confidence interval

Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

預測區間 (PI) 總是比信賴區間 (CI) 寬，因為考慮了個別預測的誤差

$C.I: \hat y \pm t_{\alpha/2}\hat \sigma \sqrt{\frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^{n} (x_i - \bar x)^2}}$

$P.I: \hat y \pm t_{\alpha/2}\hat \sigma \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar x)^2}{\sum_{i=1}^{n} (x_i - \bar x)^2}}$

for multiple linear regression, see here

# Other Considerations in the Regression Model

# Qualitative Predictors

It take a discrete set of values
Also called categorical predictors or factor variables

# Potential Problems

When we fit a linear regression model to a particular data set, many problems may arise, including:

模型
1. Non-linearity of the data -> L
  - The linear regression model assumes that there is a straight-line relationship between the predictors and the response
  - Residual plots are a useful graphical tool for identifying non-linearity
  - The red line is a smooth fit to the residuals, which is displayed in order to make it easier to identify any trends
    
    The residuals exhibit a clear U-shape in the left panel, which provides a strong indication of non-linearity in the data
    
    the right-hand panel displays the residual plot that results from the model, which contains a quadratic term
    
    Conclusion: There appears to be little pattern in the residuals, suggesting that the quadratic term improves the fit to the data
2. Correlated errors -> I
  An important assumption of the linear regression model is that the error terms, $\epsilon_1, \epsilon_2, ..., \epsilon_n$ , are uncorrelated. What does this mean?
  - $\epsilon_i$ is positive provides no information about the sign of \epsilon_
  - As an extreme example, suppose we accidentally doubled our data and ignored the fact that we had done so.
    
    Our standard error: 2n
    
    Our estimated parameters: same
    
    Our confidence intervals: narrower by $\sqrt{2}!$
  Plots of residuals from simulated time series data sets generated with differing levels of correlation 𝜌 between error terms for adjacent time point
  
  In the top panel, we see the residuals from a linear regression fit to data generated with uncorrelated errors
  
  The center panel illustrates a moderate case in which the residuals had a correlation of 0.5
  
  The residuals in the bottom panel show a clear pattern in the residuals—adjacent residuals tend to take on similar values
3. Non-constant variance of error terms -> E
  
  Often the case that the variances of the error terms are non-constant
  e.g., as the value of the response variable increases, the variance of the error terms also increases
  This phenomenon is called heteroscedasticity
資料收集
1. Outliers (unusual $y$ $y$ )
  - An outlier is a point for which 𝑦𝑖 is far from the value predicted by the model
  - Outliers can arise for a variety of reasons, such as incorrect recording of an observationduring data collection
  - Observations whose studentized residuals are greater than 3 in absolute value are possible outliers
  - it can cause other problems
    e.g. affecting RSE, since RSE (𝜎) is used to compute all confidence intervals and 𝑝-values, such a dramatic increase caused by a single data point can have implications for the interpretation of the final model
  - If we believe that an outlier has occurred, one solution is to simply remove the observation. However, care should be taken since an outlier may instead indicate a deficiency with the model
2. High leverage points (unusual $X$ $X$ )
  For a simple linear regression,
  
  $h_{i} = \frac{1}{n} + \frac{(x_i - \bar x)^2}{\sum_{j=1}^{n} (x_j - \bar x)^2}$
  - In general, the leverage statistic $\in [1/n, 1]$ , the average leverage for all observations is $(p + 1)/n$
    
    If given observation has a leverage statistic that is substantially larger than $(p + 1)/n$ , then that observation has high leverage
  - A value whose absence would greatly change the regression equation is influential observation
    
    influence pt typically has high leverage, but high leverage pts are not necessarily influential
    
    Cook’s Distance can be used to measure the influence of an observation
    $D_i = \frac{\sum_{j=1}^{n} (\hat y_j - \hat y_{j(i)})^2}{(p + 1) \hat \sigma^2}= \frac{1}{p + 1} t_i^2 \frac{h_i}{1 - h_i}$
    where $\hat y_{j(-i)}$ is the fitted response value obtained when excluding $i$
3. Collinearity
  Collinearity refers to the situation in which two or more predictor variables are closely related to one another
  - A better way to access multicollinearity is the Variance Inflation Factor (VIF)
    $VIF(\hat \beta_j) = \frac{1}{1 - R_{X_{j}|X_{-j}}^2}$
    where $R_{X_{j}|X_{-j}}$ is the R-squared obtained by regressing $X_j$ against all the other predictors
    
    A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity
  - When faced with the problem of collinearity, we can:
    
    Drop one of the problematic variables since it is redundant
    
    Combine the collinear variables into a single predictor