# Logistic Regression
Write p(X)=Pr(Y=1∣X) for short and consider using balance to predict default.
Logistic regression uses the form:
E(Y∣X)=p(X)=1+eβ0+β1Xeβ0+β1X,Y∣X∼Bernoulli(p(X))
- No matter what vals β0,β1 take, p(X) will have vals between 0 and 1
- A bit of rearrangement gives log(1−p(X)p(X))=β0+β1X
- This monotone transformation is called the log odds or logit transformation of p(X)
- The decision boundary is still linear
# Multinomial logistic regression
It is easily generalized to more than two classed: Y∣X=Categorical(p(X))
Pr(Y=k∣X)=∑l=1Keβk0+βk1X1+βk2X2+...+βkpXpeβk0+βk1X1+βk2X2+...+βkpXp
log(Pr(Y=k′∣X=x)Pr(Y=k∣X=x))=(βk0−βk0′)+(βk1−βk1′)X1+...+(βkp−βkp′)Xp
# Why use the other approaches?
- When the classes are well-separated
- If n is small and the distribution of the predictors X is approximately normal in each of the classes,
- Linear discriminant analysis is popular when we have more than two response classes and it also provides low-dimensional views of the data
# Bayes thm for classification
According to the Baye’s thm:
Pr(Y=k∣X=x)=Pr(X=x)Pr(Y=k)×Pr(X=x∣Y=k)
One writes this for discriminant analysis:
pk(x)=Pr(Y=k∣X=x)=∑l=1Kπlfl(x)πkfk(x)
- fk(x)=Pr(X=x∣Y=k) is the density for X in class k. Here we will use normal densities for these, separately in each class
- πk=Pr(Y=k) is the marginal or prior probability for class k
- We discuss three classifiers that use different estimates of fk(x) to approximate the Bayes classifier: linear discriminant analysis, quadratic discriminant analysis, and naive Bayes
# Discriminant functions
To classify at the value X=x, we need to see which of the pk(x) is largest.
δk(x)=x⋅σ2μk−2σ2μk2+log(πk)
- is called the discriminant function and note that δk(x) is a linear function of x and when argmaxkpk(x)=argmaxkδk(x) results in linear decision boundary
- If there are K=2 classes and π1=π2=0.5, then one can see that the decision boundary is at
x=2μ1+μ2
# Linear Discriminant Analysis when p>1
- Density: fk(x)=(2π)p/2∣∑∣1/21e2−1(x−μk)T∑−1(x−μk)
- Descriminant function:
δk(x)=xT∑−1μk−21μkT∑−1μk+log(πk)=2−1(x−μk)T∑−1(x−μk)+log(πk)+C
- Despite its complicated form
δk(x)=ck0+ck1x1+ck2x2+...+ckpxp is a linear function. The decision boundary is also a linear function
Pr(Y=k∣X=x)=∑l=1kπlfl(x)πkfk(x)
When fk(x) are Gaussian densities, with the same covariance matrix ∑ in each class, this leads to linear discriminant analysis
It assumes that an observation from the 𝑘th class is of the form X∣Y=k∼N(μk,∑k)
- By altering the forms for fk(x), we get different classifiers
- With fk(x)=Πj=1pfjk(x)(conditional independence model) in each class we get naive Bayes.
- With Gaussians but different ∑k in each class, we get quadratic discriminant analysis (QDA)
- If Gaussian is also impose this will mean the ∑k are diagonal
- Many other forms, by proposing specific density models for fk(x), including nonparametric approaches
# Quadratic Discriminant Analysis
The Bayes classifier assigns an observation X=x to which the following formula is largest
δk(x)=−21xTk∑−1x+xTk∑−1μk−21μkT∑−1μk−21log∣k∑∣+log(πk)
the quantity x now appears as a quadratic function
![]()
The Bayes (purple dashed), LDA (black dotted), and QDA (green solid) decision boundaries for a two-class problem
# Naive Bayes
Assume feature are independent in each class
fk(x)=fk1(x1)×fk2(x2)×...×fkp(xp)
It often leads to decent results, especially in settings where n is not large enough relative to p for us to effectively estimate the joint distribution of the predictors within each class
- Xj quantitative
- we can assume that Xj∣Y=k∼N(μkj,σkj2) which amounts to QDA with assumption that class-specific covariance matrix is diagonal.
- We can also replace \f_{kj}(x_j) with non-parametric estimate with probability mass function (histogram)
- Xj qualitative
# Analytical Comparison of different mtds
We would like to assign an observation that maximizes the following formula
- For LDA we have
So LDA, like logistic regression, assumes that log odds of the posterior probabilities is linear in $x$
- For QDA we have
QDA assumes that the log odds of the posterior probabilities is quadratic in $x$
- For naive Bayes
- LDA is a special case of QDA (with ckjl=0 for all j=1,...,p,l=1,...,p, and k=1,...,K)
- Any classifier with a linear decision boundary can be link to naïve Bayes
- QDA and naive Bayes can produce flexible fit
- (Normality assumption) LDA would do better than Logistic Regression
- KNN is completely non-parametric: No assumptions are made about the shape of the decision boundary
- QDA is a compromise between
non-parametric KNN method and the linear LDA and logistic regression . If the true decision boundary is:
- Linear: LDA and Logistic outperforms
- Moderately Non-linear: QDA outperforms
- More complicated: KNN is superior
# ROC(Receiver Operating Characteristic) Curve
- Precision: TP+FPTP
- Recall: TP+FNTP = Sensitivity = True positive rate = Power
- Specificity: TN+FPTN
- False positive rate: TN+FPFP=1−Specificity(Type I error)
- False negative rate: TP+FNFN=1−Sensitivity(Type II error)
# Poisson Regression
Suppose that a random variable Y takes on nonnegative integer values. If 𝑌 follows the Poisson distribution, then
Pr(Y=k)=k!e−λλk for k=0,1,2,...
λ>0 and λ=E(Y)=Var(Y)
- We consider the following model for the mean λ=E(Y∣X)
log(λ(X1,...,Xp))=β0+β1X1+β2X2+...+βpXp
- Given n independent observations from the Poisson regression model, the likelihood takes the form
l(β0,β1,...,βp)=Πi=1nyi!e−λ(xi)λ(xi)yi
# Generalized Linear Models (GLM) in Greater Generality
These approaches share some common characteristics:
- We assume that, conditional on X1,...,Xp, Y belongs to a certain family of distributions.
- Each approach models the mean of Y as a function of the predictors.
- They can be expressed using a link function η, which link function applies a transformation to E(Y∣X1,...,Xp) so that the transformed mean is a linear function of the predictors. That is
η(E(Y∣X1,...,Xp))=β0+β1X1+β2X2+..+βpXp
The link functions n(μ) are
we assume that Y∣X follows a Gaussian or normaldistribution.
we assume that Y∣X follows a Bernoulli distribution.
we assume that Y∣X follows a Poisson distribution
E(Y∣X1,...,Xp)=β0+β1X+β2X2+...+βpXp
E(Y∣X1,...,Xp)=1+eβ0+β1X1+β2X2+.+βpXpeβ0+β1X1+β2X2+.+βpXp
E(Y∣X1,...,Xp)=eβ0+β1X1+β2X2+.+βpXp
n(μ)=μ
n(μ)=log(μ/(1−μ))
n(μ)=log(μ)