Generalized Linear Models

Introduction

Generalized Linear Models (GLMs) were introduced by Robert Wedderburn in 1972 and provide a unified framework for modeling data originating from the exponential family of densities which include Gaussian, Binomial, and Poisson, among others. Furthermore, GLMs don’t rely on a linear relationship between dependent and independent variables because of the link function.

Each GLM consist of three components: link function, linear predictor, and a probability distribution with parameter p.

The linear predictor is this linear combination of input variables (predictors) and their corresponding coefficients. The link function establishes the relationship between the linear combination of input variables and the expected value of the response variable. Lastly, the probability distribution describes the assumed distribution of the response variable. In GLMs, the response variable's probability distribution belongs to the exponential family. This family includes many common distributions such as the normal, binomial, Poisson, and gamma distributions. The choice of the probability distribution for the response variable is based on the nature of the data being modeled.

From statsmodels.org, the probability distribution currently implemented are Normal, Binomial, Gamma, Gaussian, Inverse Gaussian, Negative Binomial, Poisson. The table below will help the reader choose the most appropriate link function based on the distribution of the dependent variable.

Distribution	Usage Domain of the Distribution	Common Link Function
Normal	Directly relates linear predictor to the response variable. Real: (-∞, +∞)	Identity
Poisson	Counts of events in fixed amount of time/space. Integers: >= 0	Log
Gamma	Waiting times; time until N number of events occur. Real: >= 0	Log
Bernoulli	Modeling binary outcomes. Integers: >= 0	Logit
Exponential	Waiting times; time until failure Real: >= 0	Log

The assumptions of GLMs are the following:

Linearity: although the original relationship between the response and explanatory variables may not be linear, the relationship between the transformed response (via the link function) and the explanatory variables must be linear in GLMs.
Independence:
1. All the observations are assumed to be independent.
2. Errors must be independent.
3. The responses y𝑖 are assumed to be independent of each other given the explanatory variables 𝑋𝑖. (Outcome for one observation doesn’t influence the outcome of another observation.)
Appropriate probability distribution: the error distribution must be appropriate to the nature of the response variable. Unlike classical regression models, GLMs do not require the assumption of homoscedasticity. The dependent variable and errors do not need to follow a normal distribution, as GLMs typically assume a distribution from the exponential family.
Correct link function: the link function used in the GLM must correctly relate the linear predictor to the expected value of the response variable.

Fitting GLMs

To fit a Generalized Linear Model in Python, it's crucial to understand the syntax for creating the formula argument. Below is an example of the the R-style formulas syntax used in the statsmodels Python library:

The formula must be enclosed in quotation marks, with each explanatory variable separated by a plus sign. Categorical variables should be enclosed with a capital 'C'. To remove the intercept, include "-1" in the formula. Interaction terms can be represented in different ways based on specific needs. A semicolon between two variables indicates that only the interaction term of the variables is used, while the multiplication symbol includes both the individual variables and their interaction term. Transformations of the variables can also be directly added into the formula.

1. Linear regression

Let's start with the well-known linear regression model. Linear regression is a special case of a GLM where the link function is the identity function and the probability distribution is the normal distribution:

In the simplest univariate case, (i) is simplified to:

And the data would look like something like this:

The red dotted line represents µi, while the blue dots are yi. Note the normal density distribution centered on 4 values of µ (green crosses, picked randomly). The data above was generated with β0=10, β1=2 and σ²=2.

Let’s proceed to fit a linear regression model:

Both β0 (Intercept) and the β1 (slope X) are correct and statistically significant.

Next, let’s verify some assumptions of the model:

1) Linear relationship between dependent and independent variable. This is clearly visible from the data itself.

2) Homoscedasticity: constant variance of the residuals with respect to y

3) No autocorrelation of the residuals. This can be verified by plotting the residuals in the order that the data were collected. Randomness in the plot will confirm lack of autocorrelation:

4) Residuals are normally distributed. This can be confirmed by plotting QQ plot or the histogram of the residuals:

5) Residuals are independent from X:

2. Logistic regression

Binary Logistic Regression (BLR) models the odds of the outcome as the success of a binary response variable y that dependents on one or more explanatory variables. In BLR, the link function is the logit function and the probability distribution is the Bernoulli distribution:

When only one explanatory variable is present (i) is simplified to:

And the data would look something like this:

The blue dots represent µi which follow a sigmoidal shape. The data above was generated with β0= -1.5, β1= 2.

Let’s proceed to fit a BLR model:

The predicted β0 (Intercept) and the β1 (slope X) are pretty close to the true values and statistically significant.

3. Poisson Linear Regression

Poisson Linear Regression (PLR) is typically used to model count data. It's particularly suitable when the response variable represents the number of occurrences of an event in a fixed period of time or space. In PLR, the link function is the log function and the probability distribution is the Bernoulli distribution:

When only one explanatory variable is present (i) is simplified to

And the data would look something like this:

The red dotted line represents µi. Note the Poisson density distribution centered on 5 values of

(green crosses, picked randomly). For high values of X, y's distribution is symmetric and resembles a normal distribution as expected. The data above was generated with β0=0, β1=0.5.

Let’s proceed to fit a PLR model.

The predicted β0 (Intercept) and the β1 (slope X) are pretty close to the true values and statistically significant. Next, let’s confirm that the variance of the residuals increases as mean increases (fitted values):

The residuals are fanning out as fitted values increase, suggesting that PLR model was appropriate for the data.

4. Gamma Linear Regression

The gamma distribution is used when the response variable is continuous and strictly positive, such as modeling waiting times, insurance claim amounts, or durations until failure. It's particularly useful when the data exhibit right skewness and heteroscedasticity, meaning the spread of the data changes as the predictor variables change. Furthermore, the gamma distribution has two parameters, shape (α) and scale (β), providing flexibility in capturing different shapes of distributions, including exponential, Erlang, and chi-squared distributions as special cases. In Gamma Linear Regression (GLR), the link function is the log function and the probability distribution is the Gamma distribution:

When only one explanatory variable is present (i) is simplified to

And the data would look something like this:

The red dotted line represents µi. Note the Gamma density distribution centered on 5 values of

(green crosses, picked randomly). For high values of X, y's distribution is symmetric and resembles a normal distribution as expected. The data above was generated with β0=0.5, β1=0.5.

Let’s proceed to fit a GLR model.

Conclusion

In conclusion, this article has provided a comprehensive overview of Generalized Linear Models through the lens of linear regression, logistic regression, Poisson regression and Gamma regression, demonstrating their practical implementation using Python. We began by exploring linear regression, a fundamental technique for modeling continuous outcomes, followed by logistic regression, which is invaluable for binary classification problems. We then delved into Poisson regression, an essential tool for analyzing count data. Finally we fitted a Gamma regression which is used when the response variable is continuous and strictly positive. Throughout the article, we showcased how to fit these models using Python libraries such as scikit-learn and statsmodels, as well as how to validate the assumptions of each model. The code to generate data and fit the various model can be found HERE.

Follow me on Twitter and Facebook to stay updated.

Generalized Linear Models

Introduction

Fitting GLMs

1. Linear regression

2. Logistic regression

3. Poisson Linear Regression

4. Gamma Linear Regression

Conclusion

Recent Posts

Comments