Logistic Regression

Introduction

Regression in statistics refers to a set of methods for estimating relationships between a dependent variable (often called "y") and one or more independent variables (often called "x"). The term regression was initially used in studies of heredity to describe the biological phenomenon that offspring tended to move towards the average (to regress) in terms of their physical characteristics.

Logistic Regression is a statistical model used for binary classification tasks, that is, situations where you want to predict one of two possible outcomes. For example, whether an email is spam or not spam, whether a transaction is fraudulent or not, whether a patient has a disease or not, etc.

Logistic regression uses input features to predict an output. However, unlike linear regression, which uses the same input features to predict a continuous output, logistic regression uses the input features to predict a probability. That probability is then transformed into a binary output.

The logistic function, or the sigmoid function, takes any input value from negative infinity to positive infinity and outputs a value between 0 and 1.

Logistic regression works as:

Each input feature is multiplied by a corresponding weight (just like in linear regression), and a bias term is added.
This sum is passed through the logistic function.
If the output of the logistic function is greater than 0.5, the model predicts the positive class (often labeled as 1). If it's less than 0.5, it predicts the negative class (often labeled as 0).

There are many details have been left out about how the weights are learned, how the model is trained, and so on, but this throws a blunt idea about logistic regression.

Comparison with linear regression

Linear regression is used for regression tasks, which involve predicting a continuous output. For example, predicting house prices, predicting temperatures, predicting stock prices, etc. Linear regression uses the input features (X values) and multiplies each one by a corresponding coefficient (these are the weights or parameters you're trying to learn), adds a bias term (also called the intercept), and gives you a real-valued output (Y value).

Here's the formula for linear regression:

Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn + e

Where:

Y is the dependent variable (that's what you're trying to predict),
β0 is the Y-intercept (value of Y when all X are 0),
β1 to βn are the coefficients of the independent variables (X1 to Xn),
X1 to Xn are the independent variables (input features), and
e is the error term.

This formula is used to draw a straight line (hence "linear") that best fits the data. However, this isn't suitable for classification tasks where you want to predict one of two possible outcomes.

That's where logistic regression comes in. Logistic regression is used for binary classification tasks. Instead of predicting a continuous output, it predicts the probability that a given input point belongs to a certain class. To do this, it uses the same idea of weighted sums as linear regression, but it then applies a logistic (or sigmoid) function to this sum to get a value between 0 and 1, which can be interpreted as a probability.

Here's the formula for logistic regression:

p = 1 / (1 + e^(-z))
z = β0 + β1*X1 + β2*X2 + ... + βn*Xn

where:

p is the probability of the positive class (the thing we're trying to predict),
β0 to βn are the coefficients (similar to linear regression),
X1 to Xn are the input features, and
e is the base of natural logarithms (about equal to 2.71828).

This formula ensures that the output is between 0 and 1, which makes it suitable for interpreting as a probability. Typically, if the output is greater than 0.5, the model predicts the positive class. If it's less than 0.5, it predicts the negative class.

In terms of learning the weights, logistic regression typically uses a method called maximum likelihood estimation (MLE) to find the best values. The details of MLE are beyond the scope of this writing, but the basic idea is to find the set of weights that makes the observed data most likely under the model.

References: Formulations of the linear and logistic regression model

Linear regression:
```
  Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn + e
```
The formula calculates a weighted sum of the input features (X1, X2, ..., Xn). Each feature is multiplied by a corresponding weight (β1, β2, ..., βn), and then all these products are added together along with the intercept (β0). The error term (e) accounts for the randomness or the difference between the actual and predicted values. The output (Y) is a continuous variable.

The error term is a measure of the difference between the prediction and the actual value. It can be positive or negative, and its magnitude shows how far off our prediction was. If the model's prediction is lower than the actual value, the error is positive and vice versa. The goal is to find the line that minimizes the error terms.

A continuous variable means that Y can take on any value within a specified range, in contrast to a discrete variable, which can only take on certain specific values. For example, when predicting the price of a house based on various features like its size, number of rooms, location, etc. The price of the house (Y) is a variable that can take on any non-negative value within a certain range, and it can change continuously rather than in discrete jumps.

In terms of how it's represented in code, Y can be a single value (for a single prediction) or an array of values (for multiple predictions). In the context of a regression model, Y typically refers to an array of actual output values that we're trying to predict.
Logistic regression:
```
  p = 1 / (1 + e^(-z))
  z = β0 + β1*X1 + β2*X2 + ... + βn*Xn
```
This formula first calculates a weighted sum of the input features in the same way as linear regression. The negative sum is the exponent of the constant e. The output (p) is a value between 0 and 1, representing the probability of the negative or positive class.

Weights represent the relative importance of each feature in the prediction and define the relationship between them. For instance, in a house price prediction model, the number of rooms might have a larger weight than the age of the house, indicating that the number of rooms contributes more to the price of a house than its age does. The weights are learned by iteratively adjusting them to minimize the difference between the predictions and the actual values (optimization).

Raising e to any power results in a positive number, and we can ensure our output is constrained to be between 0 and 1 by applying it within the specific structure of the logistic function: 1 / (1 + e^-z), where z is the weighted sum of the inputs. This function always returns a value between 0 and 1, regardless of the value of z, making it suitable for interpreting as a probability.

The negative weighted sum (z) is an exponent of the base e. So the larger z leads to a smaller 1 + e^-z, hence larger p (probability).

In both cases, the goal of training the model is to find the best set of weights (β0, β1, ..., βn) that minimizes the difference between the predicted and actual output values. In linear regression, we use a method called least squares to find these weights. In logistic regression, we use maximum likelihood estimation.

Both of these techniques involve a bit of calculus and optimization, but the basic idea is to adjust the weights iteratively to find the values that make the model's predictions as accurate as possible.

Eujeen Han's Blog

Eujeen Han's Blog

Logistic Regression

Introduction

Comparison with linear regression

References: Formulations of the linear and logistic regression model