# Logistic Regression

Write $p(X)=Pr(Y=1|X)$ for short and consider using balance to predict default.

Logistic regression uses the form:

$E(Y|X)=p(X)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}},\, Y|X \sim Bernoulli(p(X))$

No matter what vals $\beta_0, \beta_1$ take, $p(X)$ will have vals between $0$ and $1$

A bit of rearrangement gives $\log (\frac{p(X)}{1-p(X)})=\beta_0+\beta_1X$ $lo g (\frac{p ( X )}{1 - p ( X )}) = β_{0} + β_{1} X$
- This monotone transformation is called the log odds or logit transformation of $p(X)$
- The decision boundary is still linear

# Multinomial logistic regression

It is easily generalized to more than two classed: $Y|X=\text{Categorical}(p(X))$

$Pr(Y=k|X)=\frac{e^{\beta_{k_0}+\beta_{k_1}X_1+\beta_{k_2}X_2+...+\beta_{k_p}X_p}}{\sum_{l=1}^K e^{\beta_{k_0}+\beta_{k_1}X_1+\beta_{k_2}X_2+...+\beta_{k_p}X_p}}$

$\log (\frac{Pr(Y=k|X=x)}{Pr(Y=k'|X=x)})=(\beta_{k_0}-\beta_{k'_0})+(\beta_{k_1}-\beta_{k'_1})X_1+...+(\beta_{k_p}-\beta_{k'_p})X_p$

# Why use the other approaches?

When the classes are well-separated
If $n$ is small and the distribution of the predictors $X$ is approximately normal in each of the classes,
Linear discriminant analysis is popular when we have more than two response classes and it also provides low-dimensional views of the data

# Bayes thm for classification

According to the Baye’s thm:

$Pr(Y=k|X=x)=\frac{Pr(Y=k)\times Pr(X=x|Y=k)}{Pr(X=x)}$

One writes this for discriminant analysis:

$p_k(x)=Pr(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^K \pi_lf_l(x)}$

$f_k(x)=Pr(X=x|Y=k)$ is the density for $X$ in class $k$ . Here we will use normal densities for these, separately in each class
$\pi_k=Pr(Y=k)$ is the marginal or prior probability for class $k$
We discuss three classifiers that use different estimates of $f_k(x)$ to approximate the Bayes classifier: linear discriminant analysis, quadratic discriminant analysis, and naive Bayes

# Discriminant functions

To classify at the value $X=x$ , we need to see which of the $p_k(x)$ is largest.

$\delta_k(x)=x\cdot\frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k)$

is called the discriminant function and note that $\delta_k(x)$ is a linear function of $x$ and when $argmax_k p_k(x)=argmax_k \delta_k(x)$ results in linear decision boundary
If there are $K = 2$ classes and $\pi_1 = \pi_2 = 0.5$ , then one can see that the decision boundary is at

$x=\frac{\mu_1+\mu_2}{2}$

# Linear Discriminant Analysis when $p>1$

Density: $f_k(x)=\displaystyle \frac{1}{(2\pi)^{p/2}|\sum|^{1/2}}e^{\frac{-1}{2}(x-\mu_k)^T\sum^{-1}(x-\mu_k)}$
Descriminant function:
$\delta_k(x)=x^T\sum^{-1}\mu_k-\frac{1}{2}\mu_k^T\sum^{-1}\mu_k+\log(\pi_k)=\frac{-1}{2}(x-\mu_k)^T\sum^{-1}(x-\mu_k)+\log(\pi_k)+C$
Despite its complicated form
$\delta_k(x)=c_{k0}+c_{k1}x_1+c_{k2}x_2+...+c_{kp}x_p$ is a linear function. The decision boundary is also a linear function

# Other forms of Discriminant Analysis

$Pr(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^k\pi_lf_l(x)}$

When $f_k(x)$ are Gaussian densities, with the same covariance matrix $\sum$ in each class, this leads to linear discriminant analysis

It assumes that an observation from the 𝑘th class is of the form $X|Y=k\sim N(\mu_k, \sum_k)$

By altering the forms for $f_k(x)$ $f_{k} (x)$ , we get different classifiers
- With $f_k(x)=\Pi_{j=1}^pf_{jk}(x)$ (conditional independence model) in each class we get naive Bayes.
With Gaussians but different $\sum_k$ $\sum_{k}$ in each class, we get quadratic discriminant analysis (QDA)
- If Gaussian is also impose this will mean the $\sum_k$ are diagonal
Many other forms, by proposing specific density models for $f_k(x)$ , including nonparametric approaches

# Quadratic Discriminant Analysis

The Bayes classifier assigns an observation $X=x$ to which the following formula is largest

$\delta_k(x)=-\frac{1}{2}x^T\sum_k^{-1}x+x^T\sum_k^{-1}\mu_k-\frac{1}{2}\mu_k^T\sum^{-1}\mu_k-\frac{1}{2}\log|\sum_k|+\log(\pi_k)$

the quantity $x$ now appears as a quadratic function

The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem

# Naive Bayes

Assume feature are independent in each class

$f_k(x)=f_{k1}(x_1)\times f_{k2}(x_2)\times ...\times f_{kp}(x_p)$

It often leads to decent results, especially in settings where $n$ is not large enough relative to $p$ for us to effectively estimate the joint distribution of the predictors within each class

$X_j$ $X_{j}$ quantitative
- we can assume that $X_j|Y=k\sim N(\mu_{kj}, \sigma_{kj}^2)$ which amounts to QDA with assumption that class-specific covariance matrix is diagonal.
- We can also replace \f_{kj}(x_j) with non-parametric estimate with probability mass function (histogram)
$X_j$ $X_{j}$ qualitative
- then we can simply count the proportion of training observations for the $j$ th predictor corresponding to each class
- The posterior probability is:
  $Pr(Y=k|X=x)=\frac{\pi_k\times f_{k1}(x_1)\times f_{k2}(x_2)\times ...\times f_{kp}(x_p)}{\sum_{l=1}^K \pi_l\times f_{l1}(x_1)\times f_{l2}(x_2)\times ...\times f_{lp}(x_p)}$

# Analytical Comparison of different mtds

We would like to assign an observation that maximizes the following formula

For LDA we have
So LDA, like logistic regression, assumes that log odds of the posterior probabilities is linear in $x$
For QDA we have
QDA assumes that the log odds of the posterior probabilities is quadratic in $x$
For naive Bayes

LDA is a special case of QDA (with $c_{kjl}=0$ for all $j=1,...,p, l=1,...,p,$ and $k=1,...,K$ )
Any classifier with a linear decision boundary can be link to naïve Bayes
QDA and naive Bayes can produce flexible fit

(Normality assumption) LDA would do better than Logistic Regression
KNN is completely non-parametric: No assumptions are made about the shape of the decision boundary
QDA is a compromise between non-parametric KNN method and the linear LDA and logistic regression . If the true decision boundary is:
- Linear: LDA and Logistic outperforms
- Moderately Non-linear: QDA outperforms
- More complicated: KNN is superior

# ROC(Receiver Operating Characteristic) Curve

Precision: $\frac{\text{TP}}{\text{TP+FP}}$
Recall: $\frac{\text{TP}}{\text{TP+FN}}$ = Sensitivity = True positive rate = Power
Specificity: $\frac{\text{TN}}{\text{TN+FP}}$
False positive rate: $\frac{\text{FP}}{\text{TN+FP}}=1-\text{Specificity}(\text{Type I error})$
False negative rate: $\frac{\text{FN}}{\text{TP+FN}}=1-\text{Sensitivity}(\text{Type II error})$

# Poisson Regression

Suppose that a random variable $Y$ takes on nonnegative integer values. If 𝑌 follows the Poisson distribution, then

$Pr(Y=k)=\frac{e^{-\lambda}\lambda^k}{k!} \text{ for } k=0,1,2,...$

$\lambda>0$ and $\lambda=E(Y)=Var(Y)$

We consider the following model for the mean $\lambda=E(Y|X)$
$\log(\lambda(X_1,...,X_p))=\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p$
Given $n$ independent observations from the Poisson regression model, the likelihood takes the form

$l(\beta_0, \beta_1, ..., \beta_p)=\displaystyle \Pi_{i=1}^n\frac{e^{-\lambda(x_i)}\lambda(x_i)^{y_i}}{y_i!}$

# Generalized Linear Models (GLM) in Greater Generality

These approaches share some common characteristics:

We assume that, conditional on $X_1, ...,X_p$ , $Y$ belongs to a certain family of distributions.
Each approach models the mean of $Y$ as a function of the predictors.
They can be expressed using a link function $\eta$ , which link function applies a transformation to $E(Y|X_1,...,X_p)$ so that the transformed mean is a linear function of the predictors. That is

$\eta(E(Y|X_1,...,X_p))=\beta_0+\beta_1X_1+\beta_2X_2+.. +\beta_pX_p$

The link functions $n(\mu)$ are

we assume that $Y|X$ follows a Gaussian or normaldistribution.

we assume that $Y|X$ follows a Bernoulli distribution.

we assume that $Y|X$ follows a Poisson distribution

$E(Y|X_1,...,X_p)=\beta_0+\beta_1X_+\beta_2X_2+...+\beta_pX_p$

$E(Y|X_1,...,X_p)=\frac{e^{\beta_0+\beta_1X_1+\beta_2X_2+.+\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+\beta_2X_2+.+\beta_pX_p}}$

$E(Y|X_1,...,X_p)=e^{\beta_0+\beta_1X_1+\beta_2X_2+.+\beta_pX_p}$

$n(\mu)=\mu$

$n(\mu)=\log(\mu/(1-\mu))$

$n(\mu)=\log(\mu)$