Lecture 19
Dr. Elijah Meyer + Konnie Huang
Duke University
STA 199 - Fall 2022
November 2nd, 2022
– Clone ae-18
– You have a final-project
repo. Clone it before lab tomorrow.
– HW4 extended to Thursday - Check Sakai
– Testing and Training data
– The What, Why, and How of Logistic Regression
– What is a testing data set?
– What is a training data set?
– What is a testing data set?
“Sandbox” for model building. Build the model on these data.
– What is a training data set?
Held in reserve to test one or two chosen models.
Evaluate the performance
Similar to linear regression…. but
Modeling tool when our response is categorical
– Bernoulli Distribution
2 outcomes: Success (p) or Failure (1-p)
\(y_i\) ~ Bern(p)
What we can do is we can use our explanatory variable(s) to model p
– 1: Define a linear model
– 2: Define a link function
\(p_i = \beta_o + \beta_1*X_1 + ...\)
But we can’t stop here
Next, we need a link function that relates the linear model to the parameter of the outcome distribution i.e. transform the linear model to have an appropriate range
– Or…. takes values between negative and positive infinity and map them to probabilities
– A logit link function transforms the probabilities of the levels of a categorical response variable to a continuous scale that is unbounded
– Note: log is in reference to natural log
Takes a [0,1] probability and maps it to log odds (-\(\infty\) to \(\infty\).)
– Recall, the goal is to take values between -\(\infty\) and \(\infty\) and map them to probabilities. We need the opposite of the link function… or the inverse
– How do we take the inverse of a natural log?
logit(p) is also known as the log-odds
logit(p) = \(log(\frac{p}{1-p})\)
\(logit(p_i)\) = \(\beta_o +\beta_1X1_i + ....\)
\(logit(p_i)\) = \(\beta_o +\beta_1X1_i + ....\)
\(log(\frac{p}{1-p})\) = \(\beta_o +\beta_1X1_i + ....\)
Lets take the inverse of the logit function
Example Figure:
– We can not model these data using the tools we currently have
– We can overcome some of the shortcoming of regression by fitting a generalized linear regression model
– We can model binary data using an inverse logit function to model probabilities of success