Logistic Regression

Lecture 19

Dr. Elijah Meyer + Konnie Huang

Duke University
STA 199 - Fall 2022

November 2nd, 2022

Checklist

– Clone ae-18

Announcements

– You have a final-project repo. Clone it before lab tomorrow.

– HW4 extended to Thursday - Check Sakai

Goals

– Testing and Training data

– The What, Why, and How of Logistic Regression

Warm Up

– What is a testing data set?

– What is a training data set?

Warm Up

– What is a testing data set?

“Sandbox” for model building. Build the model on these data.

– What is a training data set?

Held in reserve to test one or two chosen models.

Evaluate the performance

What is Logistic Regreesion

Similar to linear regression…. but
Modeling tool when our response is categorical

What we will do today

Start from the beginning

Terms

– Bernoulli Distribution

2 outcomes: Success (p) or Failure (1-p)
\(y_i\) ~ Bern(p)
What we can do is we can use our explanatory variable(s) to model p

2 Steps

– 1: Define a linear model

– 2: Define a link function

A linear model

\(p_i = \beta_o + \beta_1*X_1 + ...\)

But we can’t stop here
Next, we need a link function that relates the linear model to the parameter of the outcome distribution i.e. transform the linear model to have an appropriate range

Generalized linear model

Goal

– Or…. takes values between negative and positive infinity and map them to probabilities

Logit Link function

– A logit link function transforms the probabilities of the levels of a categorical response variable to a continuous scale that is unbounded

– Note: log is in reference to natural log

What’s this look like

Takes a [0,1] probability and maps it to log odds (-\(\infty\) to \(\infty\).)

This isn’t exactly what we need though…..

– Recall, the goal is to take values between -\(\infty\) and \(\infty\) and map them to probabilities. We need the opposite of the link function… or the inverse

– How do we take the inverse of a natural log?

Taking the inverse of the logit function will map arbitrary real values back to the range [0, 1]

Generalized linear model

logit(p) is also known as the log-odds
logit(p) = \(log(\frac{p}{1-p})\)
\(logit(p_i)\) = \(\beta_o +\beta_1X1_i + ....\)

So

\(logit(p_i)\) = \(\beta_o +\beta_1X1_i + ....\)
\(log(\frac{p}{1-p})\) = \(\beta_o +\beta_1X1_i + ....\)
Lets take the inverse of the logit function

\(p_i\) = \(\frac{e^{\beta_o + \beta_1X1 + ...}}{1 + e^{\beta_o + \beta_1X1 + ...}}\)

Example Figure:

Takeaways

– We can not model these data using the tools we currently have

– We can overcome some of the shortcoming of regression by fitting a generalized linear regression model

– We can model binary data using an inverse logit function to model probabilities of success