Our familiar approach to regression requires nearly normal residuals. However, there are circumstances when this is impossible. An important example is when the response variable is categorical with two levels. This video accompanies OpenIntro Statistics’ discussion of “Logistic Regression”, which is a helpful tool when the response variable is binary. First we will be introduced to the ’email’ data set, for which our goal will be to build a spam filter. Then we will learn about the fundamentals of logistic regression and modeling the probability of an event. Finally, we will discuss diagnostics for the
email classifier. Throughout this video we will focus on the email data set. These data represent incoming emails for the first three months of two thousand and twelve from a single email account, and our goal
will be to develop a basic spam filter using these data. The response variable, *spam*, has been encoded to take value one when the message is spam and zero when it is not. The data set also includes additional variables such as “cc”, which indicates if someone was copied or cc’ed on the email, and “dollar”,
which indicates whether a dollar symbol appeared in the email. Remember, regular multiple regression requires the residuals to be approximately normally distributed. This is impossible in our current example,
because the response variable is binary. Therefore, we will explore our new tool: logistic regression. Logistic regression is a type of generalized linear model for response variables where regular multiple regression doesn’t work very well. The outcome Y_i takes the value one (in our application, this represents a spam message) with probability p_i and the value zero with probability one minus p_i. It is the probability p_i that we model in relation to the predictor variables. Fundamentally, the logistic model relates the probability an email is spam to the predictors through a framework much like that of multiple regression. The key is to transform the response variable, and a common transformation for p_i is the logit transformation. You may be wondering why we picked this particular transformation. Well, let’s start with the right hand side
of the equation. Notice the coefficients beta one, beta two,
and all the rest of them, could be any number positive, negative, or zero. Therefore, when the right-hand side is summed up, it can also take any of these values. In addition to other useful properties, the logit transformation makes it possible for the left side to take values along the same interval that is possible on the right side. There are two key conditions for fitting a logistic regression model. First, the model relating the parameters to the predictors must closely resemble the true relationship between the parameter and the predictors. Here, x_1i, x_2i, and so on represent the values of the predictors corresponding to the i-th observation. And secondly, the outcome for each case must be independent of the outcome for other cases. In other words, it must be reasonable to model the emails as approximately independent from each other once we account for the influence of the predictors. Let’s consider how we might check the condition regarding the structure of the model. Among emails modeled as having a 10% chance of being spam, are about 10% of them actually spam? To help us out, we’ve borrowed an advanced statistical method called *natural splines* that estimates the local probability, smoothly from zero to one. For this video, all we need to know about natural splines is that they fit flexible lines rather than straight lines. The curve fit using natural splines is shown as a solid black line. If the logistic model fits well, the curve should closely follow the dashed line. We have added shading to represent the confidence bound for the curved line to clarify what fluctuations might plausibly be due to chance. Even with this confidence bound, there are weaknesses in the first model assumption. The solid curve and its confidence bound dip below the dashed line from about 0.1 to 0.3, and then it drifts above the dashed line from about 0.35 to 0.55. These deviations indicate the model relating the parameter to the predictors does not closely resemble the true relationship. Continuing with our investigation, We might evaluate the independence assumption using the model residuals. We can use the same approach as in regression, where each residual equals observed outcome minus the expected outcome. As we saw earlier in this video, for logistic regression, the expected value of the outcome is the fitted probability for the observation. We could plot our residuals against a variety of variables or in their order of collection, as we did with the residuals in multiple regression. Our goal in this video was to think about a model for determining whether an email was spam or not. Along the way, we were introduced to logistic regression, which is a helpful tool when the outcome variable is binary. We also reviewed the necessary assumptions for logistic regression and discussed possible diagnostic approaches. If you learned something you found interesting, share this video with a friend and visit openintro.org for more resources.