# Logistic Regression

Write p(X)=Pr(Y=1X)p(X)=Pr(Y=1|X) for short and consider using balance to predict default.

Logistic regression uses the form:

E(YX)=p(X)=eβ0+β1X1+eβ0+β1X,YXBernoulli(p(X))E(Y|X)=p(X)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}},\, Y|X \sim Bernoulli(p(X))

  • No matter what vals β0,β1\beta_0, \beta_1 take, p(X)p(X) will have vals between 00 and 11
  • A bit of rearrangement gives log(p(X)1p(X))=β0+β1X\log (\frac{p(X)}{1-p(X)})=\beta_0+\beta_1X
    • This monotone transformation is called the log odds or logit transformation of p(X)p(X)
    • The decision boundary is still linear

# Multinomial logistic regression

It is easily generalized to more than two classed: YX=Categorical(p(X))Y|X=\text{Categorical}(p(X))

Pr(Y=kX)=eβk0+βk1X1+βk2X2+...+βkpXpl=1Keβk0+βk1X1+βk2X2+...+βkpXpPr(Y=k|X)=\frac{e^{\beta_{k_0}+\beta_{k_1}X_1+\beta_{k_2}X_2+...+\beta_{k_p}X_p}}{\sum_{l=1}^K e^{\beta_{k_0}+\beta_{k_1}X_1+\beta_{k_2}X_2+...+\beta_{k_p}X_p}}

log(Pr(Y=kX=x)Pr(Y=kX=x))=(βk0βk0)+(βk1βk1)X1+...+(βkpβkp)Xp\log (\frac{Pr(Y=k|X=x)}{Pr(Y=k'|X=x)})=(\beta_{k_0}-\beta_{k'_0})+(\beta_{k_1}-\beta_{k'_1})X_1+...+(\beta_{k_p}-\beta_{k'_p})X_p

# Why use the other approaches?

  1. When the classes are well-separated
  2. If nn is small and the distribution of the predictors XX is approximately normal in each of the classes,
  3. Linear discriminant analysis is popular when we have more than two response classes and it also provides low-dimensional views of the data

# Bayes thm for classification

According to the Baye’s thm:

Pr(Y=kX=x)=Pr(Y=k)×Pr(X=xY=k)Pr(X=x)Pr(Y=k|X=x)=\frac{Pr(Y=k)\times Pr(X=x|Y=k)}{Pr(X=x)}

One writes this for discriminant analysis:

pk(x)=Pr(Y=kX=x)=πkfk(x)l=1Kπlfl(x)p_k(x)=Pr(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^K \pi_lf_l(x)}

  • fk(x)=Pr(X=xY=k)f_k(x)=Pr(X=x|Y=k) is the density for XX in class kk. Here we will use normal densities for these, separately in each class
  • πk=Pr(Y=k)\pi_k=Pr(Y=k) is the marginal or prior probability for class kk
  • We discuss three classifiers that use different estimates of fk(x)f_k(x) to approximate the Bayes classifier: linear discriminant analysis, quadratic discriminant analysis, and naive Bayes

# Discriminant functions

To classify at the value X=xX=x, we need to see which of the pk(x)p_k(x) is largest.

δk(x)=xμkσ2μk22σ2+log(πk)\delta_k(x)=x\cdot\frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+log(\pi_k)

  • is called the discriminant function and note that δk(x)\delta_k(x) is a linear function of xx and when argmaxkpk(x)=argmaxkδk(x)argmax_k p_k(x)=argmax_k \delta_k(x) results in linear decision boundary
  • If there are K=2K = 2 classes and π1=π2=0.5\pi_1 = \pi_2 = 0.5, then one can see that the decision boundary is at

x=μ1+μ22x=\frac{\mu_1+\mu_2}{2}

# Linear Discriminant Analysis when p>1p>1

  • Density: fk(x)=1(2π)p/21/2e12(xμk)T1(xμk)f_k(x)=\displaystyle \frac{1}{(2\pi)^{p/2}|\sum|^{1/2}}e^{\frac{-1}{2}(x-\mu_k)^T\sum^{-1}(x-\mu_k)}
  • Descriminant function:
    δk(x)=xT1μk12μkT1μk+log(πk)=12(xμk)T1(xμk)+log(πk)+C\delta_k(x)=x^T\sum^{-1}\mu_k-\frac{1}{2}\mu_k^T\sum^{-1}\mu_k+\log(\pi_k)=\frac{-1}{2}(x-\mu_k)^T\sum^{-1}(x-\mu_k)+\log(\pi_k)+C
  • Despite its complicated form
    δk(x)=ck0+ck1x1+ck2x2+...+ckpxp\delta_k(x)=c_{k0}+c_{k1}x_1+c_{k2}x_2+...+c_{kp}x_p is a linear function. The decision boundary is also a linear function

# Other forms of Discriminant Analysis

Pr(Y=kX=x)=πkfk(x)l=1kπlfl(x)Pr(Y=k|X=x)=\frac{\pi_kf_k(x)}{\sum_{l=1}^k\pi_lf_l(x)}

When fk(x)f_k(x) are Gaussian densities, with the same covariance matrix \sum in each class, this leads to linear discriminant analysis

It assumes that an observation from the 𝑘th class is of the form XY=kN(μk,k)X|Y=k\sim N(\mu_k, \sum_k)

  • By altering the forms for fk(x)f_k(x), we get different classifiers
    • With fk(x)=Πj=1pfjk(x)f_k(x)=\Pi_{j=1}^pf_{jk}(x)(conditional independence model) in each class we get naive Bayes.
  • With Gaussians but different k\sum_k in each class, we get quadratic discriminant analysis (QDA)
    • If Gaussian is also impose this will mean the k\sum_k are diagonal
  • Many other forms, by proposing specific density models for fk(x)f_k(x), including nonparametric approaches

# Quadratic Discriminant Analysis

The Bayes classifier assigns an observation X=xX=x to which the following formula is largest

δk(x)=12xTk1x+xTk1μk12μkT1μk12logk+log(πk)\delta_k(x)=-\frac{1}{2}x^T\sum_k^{-1}x+x^T\sum_k^{-1}\mu_k-\frac{1}{2}\mu_k^T\sum^{-1}\mu_k-\frac{1}{2}\log|\sum_k|+\log(\pi_k)

the quantity xx now appears as a quadratic function


The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem

# Naive Bayes

Assume feature are independent in each class

fk(x)=fk1(x1)×fk2(x2)×...×fkp(xp)f_k(x)=f_{k1}(x_1)\times f_{k2}(x_2)\times ...\times f_{kp}(x_p)

It often leads to decent results, especially in settings where nn is not large enough relative to pp for us to effectively estimate the joint distribution of the predictors within each class

  • XjX_j quantitative
    • we can assume that XjY=kN(μkj,σkj2)X_j|Y=k\sim N(\mu_{kj}, \sigma_{kj}^2) which amounts to QDA with assumption that class-specific covariance matrix is diagonal.
    • We can also replace \f_{kj}(x_j) with non-parametric estimate with probability mass function (histogram)
  • XjX_j qualitative
    • then we can simply count the proportion of training observations for the jjth predictor corresponding to each class
    • The posterior probability is:

      Pr(Y=kX=x)=πk×fk1(x1)×fk2(x2)×...×fkp(xp)l=1Kπl×fl1(x1)×fl2(x2)×...×flp(xp)Pr(Y=k|X=x)=\frac{\pi_k\times f_{k1}(x_1)\times f_{k2}(x_2)\times ...\times f_{kp}(x_p)}{\sum_{l=1}^K \pi_l\times f_{l1}(x_1)\times f_{l2}(x_2)\times ...\times f_{lp}(x_p)}

# Analytical Comparison of different mtds

We would like to assign an observation that maximizes the following formula

  • For LDA we have

    So LDA, like logistic regression, assumes that log odds of the posterior probabilities is linear in $x$
  • For QDA we have

    QDA assumes that the log odds of the posterior probabilities is quadratic in $x$
  • For naive Bayes

  1. LDA is a special case of QDA (with ckjl=0c_{kjl}=0 for all j=1,...,p,l=1,...,p,j=1,...,p, l=1,...,p, and k=1,...,Kk=1,...,K)
  2. Any classifier with a linear decision boundary can be link to naïve Bayes
  3. QDA and naive Bayes can produce flexible fit
  • (Normality assumption) LDA would do better than Logistic Regression
  • KNN is completely non-parametric: No assumptions are made about the shape of the decision boundary
  • QDA is a compromise between non-parametric KNN method and the linear LDA and logistic regression . If the true decision boundary is:
    • Linear: LDA and Logistic outperforms
    • Moderately Non-linear: QDA outperforms
    • More complicated: KNN is superior

# ROC(Receiver Operating Characteristic) Curve

  • Precision: TPTP+FP\frac{\text{TP}}{\text{TP+FP}}
  • Recall: TPTP+FN\frac{\text{TP}}{\text{TP+FN}} = Sensitivity = True positive rate = Power
  • Specificity: TNTN+FP\frac{\text{TN}}{\text{TN+FP}}
  • False positive rate: FPTN+FP=1Specificity(Type I error)\frac{\text{FP}}{\text{TN+FP}}=1-\text{Specificity}(\text{Type I error})
  • False negative rate: FNTP+FN=1Sensitivity(Type II error)\frac{\text{FN}}{\text{TP+FN}}=1-\text{Sensitivity}(\text{Type II error})

# Poisson Regression

Suppose that a random variable YY takes on nonnegative integer values. If 𝑌 follows the Poisson distribution, then

Pr(Y=k)=eλλkk! for k=0,1,2,...Pr(Y=k)=\frac{e^{-\lambda}\lambda^k}{k!} \text{ for } k=0,1,2,...

λ>0\lambda>0 and λ=E(Y)=Var(Y)\lambda=E(Y)=Var(Y)

  • We consider the following model for the mean λ=E(YX)\lambda=E(Y|X)
    log(λ(X1,...,Xp))=β0+β1X1+β2X2+...+βpXp\log(\lambda(X_1,...,X_p))=\beta_0+\beta_1X_1+\beta_2X_2+...+\beta_pX_p
  • Given nn independent observations from the Poisson regression model, the likelihood takes the form

l(β0,β1,...,βp)=Πi=1neλ(xi)λ(xi)yiyi!l(\beta_0, \beta_1, ..., \beta_p)=\displaystyle \Pi_{i=1}^n\frac{e^{-\lambda(x_i)}\lambda(x_i)^{y_i}}{y_i!}

# Generalized Linear Models (GLM) in Greater Generality

These approaches share some common characteristics:

  1. We assume that, conditional on X1,...,XpX_1, ...,X_p, YY belongs to a certain family of distributions.
  2. Each approach models the mean of YY as a function of the predictors.
  3. They can be expressed using a link function η\eta, which link function applies a transformation to E(YX1,...,Xp)E(Y|X_1,...,X_p) so that the transformed mean is a linear function of the predictors. That is

η(E(YX1,...,Xp))=β0+β1X1+β2X2+..+βpXp\eta(E(Y|X_1,...,X_p))=\beta_0+\beta_1X_1+\beta_2X_2+.. +\beta_pX_p

The link functions n(μ)n(\mu) are

we assume that YXY|X follows a Gaussian or normaldistribution.

we assume that YXY|X follows a Bernoulli distribution.

we assume that YXY|X follows a Poisson distribution

E(YX1,...,Xp)=β0+β1X+β2X2+...+βpXpE(Y|X_1,...,X_p)=\beta_0+\beta_1X_+\beta_2X_2+...+\beta_pX_p

E(YX1,...,Xp)=eβ0+β1X1+β2X2+.+βpXp1+eβ0+β1X1+β2X2+.+βpXpE(Y|X_1,...,X_p)=\frac{e^{\beta_0+\beta_1X_1+\beta_2X_2+.+\beta_pX_p}}{1+e^{\beta_0+\beta_1X_1+\beta_2X_2+.+\beta_pX_p}}

E(YX1,...,Xp)=eβ0+β1X1+β2X2+.+βpXpE(Y|X_1,...,X_p)=e^{\beta_0+\beta_1X_1+\beta_2X_2+.+\beta_pX_p}

n(μ)=μn(\mu)=\mu

n(μ)=log(μ/(1μ))n(\mu)=\log(\mu/(1-\mu))

n(μ)=log(μ)n(\mu)=\log(\mu)