GLM Regression

Example Variables(Open Data):

The Fennema-Sherman Mathematics Attitude Scales (FSMAS) are among the most popular instruments used in studies of attitudes toward mathematics. FSMAS contains 36 items. Also, scales of FSMAS have Confidence, Effectance Motivation, and Anxiety. The sample includes 425 teachers who answered all 36 items. In addition, other characteristics of teachers, such as their age, are included in the data.
You can select your data as follows:
1-File
2-Open data
(See Open Data)

The data is stored under the name FSMAS-T(You can download this data from here ).
You can edit the imported data via the following path:
1-File
2-Edit Data
(See Edit Data)

Example Variables(Compute Variable):

The three variables of Confidence, Effectance Motivation, and Anxiety can be calculated through the following path:
1-Transform
2-Compute Variable
Items starting with the letters C, M, and A are related to the variables Confidence, Effectance Motivation, and Anxiety, respectively.
(See Compute Variable)

Introduction to GLM Regression:

A generalized linear model (or GLM) consists of three components:
1-A random component, specifying the conditional distribution of the dependent variable, Yi (for the ith of n independently sampled observations), given the values of the independent variables in the model. In the initial formulation of GLMs, the distribution of Yi was a member of an exponential family, such as the Gaussian, binomial, Poisson, gamma, or inverse-Gaussian families of distributions.

Most commonly used statistical distributions are members of the exponential family of distributions whose densities can be written in the form Where is the dispersion parameter and is the canonical parameter.

It can be shown that




2- A linear predictor that is a linear function of regressors,

3-A smooth and invertible linearizing link function , which transforms the expectation of the response variable, , to the linear predictor: , Because the link function is invertible, we can also write: Thus, the GLM may be thought of as a linear model for a transformation of the expected response or as a nonlinear regression model for the response.

Commonly employed link functions are shown in Table 1.

Table1:Some Common Link Functions

Link
Identity
Log
Inverse
Inverse-square
Square-root
Logit
Probit
Complementary log-log
Cauchit

In Table1, is the cumulative distribution function of the standard normal distribution.
Also, Commonly employed distribution functions are:

*Gaussian (or Normal):
The Gaussian distribution with mean and variance has density function:


*Inverse Gaussian:
The inverse-Gaussian distributions are another continuous family indexed by two parameters, and , with density function

 Where  , is mean and .

*Gamma:
The gamma distributions are a continuous family with density function:

 Where  , , and is the gamma function. For all positive integers,

Binomial:
The binomial distribution for the proportion of successes in n independent binary trials with probability of success has probability function

 Where  .

*Poisson:
The Poisson distributions are a discrete family with probability function

 Where  .

*Estimation Algorithm:

A single algorithm can be used to estimate the parameters of an exponential family glm using maximum likelihood.
The log-likelihood for the sample is
For , the maximum likelihood estimates are obtained by solving the score equations

We assume that ,
where are known prior weights.
A general method of solving score equations is the iterative algorithm Fisher’s Method of Scoring. In the r-th iteration, the new estimate is obtained from the previous estimate . After calculations, is obtained from the following relation:

i.e. the score equations for a weighted least squares regression of on with
, where ,

Hence the estimates can be found using an Iteratively Weighted Least Squares algorithm:

1-Start with initial estimates
2-Calculate working responses and working weights
3-Calculate by weighted least squares
4-Repeat 2 and 3 till convergence

For models with the canonical link, this is simply the Newton-Raphson method.
The estimates have the usual properties of maximum likelihood estimators. In particular, is asymptotically .

Path of GLM Regression :

You can perform GLM regression by the following path:
1-Exploratory Analysis
2- Regression
3-GLM

A. GLM Regression window:

GLM Regression panel includes two tabs, Regression and Residuals.

B. Regression

B1. Select Dependent Variable:

You can select the dependent variable through this button. After opening the window, you can select it by selecting the desired variable.
For example, the variable Confidence is selected in this data.

B2. Select Independent Variable:

You can select the independent variable through this button. After the window opens, you can select them by selecting the desired variables.
For example, the variables Effectance Motivation and Anxiety are selected in this data.

B3. Select Distribution:

After selecting the dependent and independent variables, you must select the appropriate distribution with the dependent variable type. You can select one of the following distributions:
-Gaussian (or Normal)
-Inverse Gaussian
-Gamma
-Binomial
-Poisson

B4. Select Link Function:

After selecting the distribution, you must choose one of the suggested link functions.

B4.1. Gaussian distribution:

The following link functions are suggested for Gaussian distribution:

Link
Identity
Inverse
Log

B4.2. Inverse Gaussian distribution:

The following link functions are suggested for Inverse Gaussian distribution:

Link
Identity
Inverse
Log
Inverse-square

B4.3. Gamma distribution:

The following link functions are suggested for Gamma distribution:

Link
Identity
Inverse
Log

B4.4. Error:

The type of distribution must be proportional to the type of dependent variable.
For example, to select a binomial distribution, the dependent variable needs to be a factor. If you select another variable (Like gender) of type factor, you can choose one of the Binomial or Poisson options.

B4.5. Binomial distribution:

The following link functions are suggested for Binomial distribution:

Link
Logit
Probit
Complementary log-log
Cauchit
Log

B4.6. Poisson distribution:

The following link functions are suggested for Poisson distribution:

Link
Identity
Log
Inverse

B5. Run Regression:

You can see the results of the GLM regression in the results panel by clicking this button.
Results include the following:
-Deviance Residuals
-Coefficients
-AIC
-VIF

B5.1. Results:

Results include the following:

*Deviance Residuals:
A five-number summary of the deviance residuals is given. This part of the results includes Min(minimum of residuals), 1Q(first quartile of residuals), Median(median of residuals), 3Q(third quartile of residuals), and Max(maximum of residuals).

Coefficients:
The coefficients presented in this table are obtained from the following relationships:

*Beta:


*Std. Error:


*z value:
For non-Normal data, we can use the fact that asymptotically


*Pr(>|z|):
P-Value=Pr(| |>z)

* Different model summaries are reported for GLMs. First, we have the deviance of two models: Null deviance, Residual deviance.
The first refers to the null model in which all of the terms are excluded, except the intercept if present. The degrees of freedom for this model are the number of data n minus 1 if an intercept is fitted. The second two refer to the fitted model, which has a n-p degree of freedom, where p is the number of parameters, including any intercept.
The deviance of a model is defined as
where is the log-likelihood of the fitted model and is the log-likelihood of the saturated model.

*AIC:
The AIC is a measure of fit that penalizes for the number of parameters p


*VIF:
In the Collinearity Diagnostics table, the results of multicollinearity in a set of multiple regression variables are given.
For each independent variable, the VIF index is calculated in two steps:
STEP1:
First, we run an ordinary least square regression that has xi as a function of all the other explanatory variables in the first equation.
STEP2:
Then, calculate the VIF factor with the following equation:
Where is the coefficient of determination of the regression equation in step one. Also,
A rule of decision making is that if then multicollinearity is high (a cutoff of 5 is also commonly used)

B6. Save Regression:

By clicking this button, you can save the regression results. After opening the save results window, you can save the results in “text” or “Microsoft Word” format.

B7. Bootstrap:

This option is located in the Regression tab. This option includes the following methods:
Assume that we want to fit a regression model with dependent variables y and predictors x1, x2,..., xp. We have a sample of n observations zi = (yi,xi1, xi2,...,xip) where i= 1,...,n. In random x resampling, we simply select B(Replication) bootstrap samples of the zi, fitting the model and saving the and and from each bootstrap sample. The statistic t is normally distributed (which is often approximately the case for statistics in sufficiently large Replication ). If is the corresponding estimate for the bootstrap replication and is the mean of the s , then the bootstrap estimation and the bootstrap standard error are

Thus

*Bootstrap Method:
Enabling this option means performing regression using the bootstrap method with the following parameters:

*Replication: Number of Replication(B in equations)
*Set Seed: It is an arbitrary number that will keep the Bootstrap results fixed by holding it fixed.

B7.1. Results(Bootstrap):

Running regression with the Bootstrap option enabled provides the following results: Original:
bootCoefficient:
bootSE:
z value: z value

C. Residual

C1. Residual Type:

In the Residual tab, the Residual Type panel includes a variety of residues. Several kinds of residuals can be defined for GLMs:

*Quantile: The quantile residual for observation has the same cumulative probability on a standard normal distribution as y does for the fitted distribution.

*Partial: Partial residuals plots are similar to plotting residuals against , but with the linear trend with respect to added back into the plot.

*working: from the working response in the IWLS algorithm.

*response:

*Deviance:

*Pearson:

C2. QQ Plot of Residuals:

After selecting the Residual Type, by clicking on the QQ Plot of Residuals button, you can assess the normality of the residues. If the residuals follow a normal distribution with mean and variance , then a plot of the theoretical percentiles of the normal distribution(Theoretical Quantiles) versus the observed sample percentiles of the residuals(Sample Quantiles) should be approximately linear. If a Normal QQ Plot is approximately linear, we assume that the error terms are normally distributed.

C3. Add Residuals & Name of Residual Variable:

By clicking on the “Add Residuals” button, you can save the Residuals of the regression model with the desired name (Name of Residual Variable). The default software for the residual names is “GLMResid”.
This message will appear if the balances are saved successfully:
“Name of Residual Variable Residuals added in Data Table.”
For example,“GLMResid Residuals added in Data Table.”.