GLMs, abridged

Introduction

Generalized linear models (GLMs) are indispensable tools in the data science toolbox. They are applicable to many real-world problems involving continuous, yes/no, count and survival data (and more). The models themselves are intuitive and can be used for inference and prediction. A few very high quality free and open source software implementations are available (in particular within R[11]), as are several first-rate commercial ones (Revolution Analytics[12], SAS, Stata).

Despite the wide applicability of GLMs and the availability of high-quality reference implementations, we've found it hard to find good high-performance free and open source software implementations geared to solving large problems. Moreover, we've had trouble finding succinct references that deal with core implementation details that could help us build our own high-performance implementations.

This note grew out of our own desire to better understand the numerics of generalized linear models. We highlight aspects of GLM implementations that we find particularly interesting. We present some reference implementations stripped down to illuminate core ideas; often with just a few lines of code. Our focus on numerics really gives short shrift to the statistical motivation behind GLMs--in particular the IRLS method comes out of nowhere below. We're working on a new introductory section that very briefly summarizes the derivation of that method and ties it in to our main investigation of its numerics better. Will hopefully be done this summer, along with finishing up details on distributed implementations. --Bryan and Mike

We develop our ideas in our favorite language, R, but they are easily adapted to other languages. Python and Boost/C++ are particularly well-equipped for good GLM implementations in our opinion.

Disclaimer

This is not a formal introduction to GLMs. We discuss a few key implementation ideas, focusing on numerics, performance and scalability. We refer the reader to the references [3,5,8,10] for complete introductions to linear models.

These are working notes. We'll continue to tweak and revise them.

Linear models

We begin by defining what we mean when we say "linear model" before moving on to GLMs. Our presentation and notation generally follow Björck[2]. Let $b\in\mathbb{R}^m, b\ne 0$ be a vector of measurements (called a response vector) and let $A\in\mathbb{R}^{m\times n}$, $m\ge n$, $\mathrm{rank}(A)=n$, be a matrix formed from data observations (called a model matrix). Then \[ Ax = b + e \] is what we mean when we say "linear model," where $x\in\mathbb{R}^n$ is a vector of model coefficients and the entries of the residual error term $e\in\mathbb{R}^m$ are independent and identically distributed random variables. Given a vector $x$ of model coefficients, we call the product $Ax$ a predictor of the response vector $b$. We're leaving important definitions of "random variable" and "independent and identically distributed" to the references.

Given a model matrix $A$ and response vector $b$, there are many ways we might go about solving for a vector of model coefficients $x$. We describe a particularly natural one next.

Least squares

One idea we might think of for solving for the model coefficients would be to compute the ordinary least squares solution $x$ that minimizes \[ \min_x ||Ax - b||^2, \] where $||\cdot||$ indicates Euclidean norm. The condition that $\mathrm{rank}(A)=n$ implies that there exists a unique least squares solution. It turns out that the least squares solution of linear models has important statistical properties shown in the seminal work of Gauss[4] and later rediscovered by Markoff[7]. (The least squares solution defines a minimum variance unbiased estimator, the technical details of which we leave to the references, in particular see [5,8].)

Generalized linear models

Relaxing conditions on linear models by allowing the entries of $e$ to be correlated or to be distributed with differing variances gives us generalized linear models. We need a few technical definitions to shore up this idea.

Let $\epsilon$ be a scalar-valued random variable with distribution function $F(\epsilon)$. The expected value $\mu$ and the variance $\sigma^2$ of $\epsilon$ are defined as: \[ E(\epsilon) = \mu = \int_{-\infty}^{\infty} \epsilon dF(\epsilon), \qquad \sigma^2 = \int_{-\infty}^{\infty}(\epsilon-\mu)^2dF(\epsilon). \] When $e$ is a vector of random variables, we write $E(e)$ to indicate the expected value of each entry, and similarly we define the expected value of a matrix of random variables to be the expected value of each entry. Define the variance-covariance matrix of a vector of random variables $e$ with mean vector $\mu$ to be \[ V(e) = E(ee^T) - \mu \mu^T. \]

The assumption of independence and identical distribution among the random variables in the error term $e$ of our earlier linear model required that $V(e) = \sigma^2 I$. General Gauss-Markoff linear models relax this requirement. When we say "generalized linear model" we mean a linear model of the form: \[ Ax = b + e,\qquad V(e)=\sigma^2W^{-1}, \] where $A$ is an $m\times n$ matrix of $\mathrm{rank}(A)=n$, $e$ is a vector of random variables with covariance matrix $\sigma^2 W^{-1}$ and $W$ is a symmetric positive definite matrix. See Björck[2, pp. 160-165] for a concise introduction to generalized linear models that we largely follow here. The solution of the generalized linear model can be formulated as \[ \min_x(Ax-b)^TW(Ax -b), \] with solution $x=(A^T W A)^{-1}A^T W b$. The numerical solution of model problems of this form was carefully analyzed by Paige[10]. We focus on simplified generalized models here and assume that $W$ is a diagonal matrix with positive entries along its diagonal.

Generalized linear models in practice

Generalized linear models typically arise when we constrain the values that the predictor $Ax$ can take on and assume a specific distribution function for the entries of $e$. Constraints are important for many practical problems. For example:

The response $b$ may represent binary 0/1 data and we want the predictor $Ax$ to at least be constrained to the interval $[0,1]$, in which case we might assume, for example, binomial distributions for the entries of $e$.
$b$ might represent counts and we desire the predictor to be nonnegative since negative counts might not make any sense. In this case we might assume that the entries of $e$ follow Poisson distributions.

A standard approach to solving the constrained problem is to transform it into an unconstrained one by way of a link function. A link function is a bijection chosen to map the constraint interval onto the real line (the inverse link function maps the real line onto the constraint interval). For example, the logistic function \[ g(a) = \frac{1}{1 + \exp(-a)} \] maps arbitrary values $a\in\mathbb{R}$ into the interval $(0,1)$.

Consider the binary 0/1 data example cited above. Let $A\in\mathbb{R}^{m\times n}$ be a given model matrix with $\mathrm{rank}(A)=n$, and let $b\in\mathbb{R}^m$, $b\ne 0$, be a given response vector. The generalized linear model problem is to compute an $x$ that satisfies: \[ Ax = b + e,\qquad \mathrm{such\, that}\,\, Ax\in(0,1). \] We can formulate a related unconstrained problem that solves for $x$ in \[ g(Ax) = b + e, \] where $g$ is the logistic function applied elementwise to $Ax$. The problem is no longer constrained because $g$ maps any value to the interval $(0,1)$. But it's no longer linear either!

The choice of the nature of the error distribution defines the norm-weighting matrix $W$ discussed in the previous section. If we assume that the entries of $e$ consist of uncorrelated random variables with identical variance, then $W=\sigma^{-2}I$ and the problem reduces to simple nonlinear least squares: \[ \min_x ||g(Ax) - b||^2. \] That problem can be solved in R by several optimization methods, including Broyden's quasi-Newton method as follows:

# Input: A is an m by n model matrix, b a response vector of length m,
#        g is the inverse link function.
# Output: The model coefficients x.

nlls = function(A,b,g)
{
  f = function(x) crossprod(b - g(A %*% x))[]
  optim(rep(0,ncol(A)), fn=f, method="BFGS")$par
}

Alternatively see R's nls function. If the entries of $e$ consist of uncorrelated random variables but with different variances, then $W\ne I$ but $W$ is still a diagonal matrix. This is common to many generalized linear model problems and results in a weighted nonlinear least squares problem (compare to the generalized linear model definition in the previous section): \[ \min_x(g(Ax)-b)^TW(g(Ax) -b), \] typically solved by the iteratively reweighted least square method described next.

Algorithm IRLS: Iteratively reweighted least squares estimation

Input

Model matrix $A\in\mathbb{R}^{m\times n}$, response vector $b\in\mathbb{R}^m$, inverse link function $g$ and its derivative $g'$, the variance $\mathrm{var}$ as a function of the mean associated with the distribution function of $e$, termination tolerance $\mathrm{tol}$, maximum number of iterations itmax. Here and below, "$/$" indicates element-wise division and the inverse link function $g$ and its derivative and $\mathrm{var}$ are applied element-wise to their vector arguments.

Let $x_1=0$.
For $j=1,2,\ldots,\mathrm{itmax}$, do:
1. Let $\eta = Ax_j$.
2. Let $z = \eta + \frac{b - g(\eta)}{g'(\eta)},\qquad$
3. Let $W = \mathrm{diag}(g'(\eta)^2/\mathrm{var}(g(\eta)))$.
4. Let $x_{j+1} = (A^T W A)^{-1} A^T W z$.
5. Stop if $||x_{j+1} - x_j|| \lt \mathrm{tol}$.

Output

Model coefficients $x_{j+1}$.

It's remarkable how rapidly this simple algorithm typically converges. Before we dive in to details, let's look at one simple R implementation of this algorithm and compare it to R's glm function. The minimalist GLM function below (compare to R's glm.fit function) takes as input a model matrix $A$, a response vector $b$, an R distribution family object, and optional maximum number of iterations and tolerance values. (This is the shortest GLM implementation we could come up with! But it's got some numerical issues, so read on...)

irls =
function(A, b, family=binomial, maxit=25, tol=1e-08)
{
  x = rep(0,ncol(A))
  for(j in 1:maxit)
  {
    eta    = A %*% x
    g      = family()$linkinv(eta)
    gprime = family()$mu.eta(eta)
    z      = eta + (b - g) / gprime
    W      = as.vector(gprime^2 / family()$variance(g))
    xold   = x
    x      = solve(crossprod(A,W*A), crossprod(A,W*z), tol=2*.Machine$double.eps)
    if(sqrt(crossprod(x-xold)) < tol) break
  }
  list(coefficients=x,iterations=j)
}

Let's compare our super-minimalist IRLS estimator to R's glm function. We use a real-world data example from the "mlmRev" package, which you may need to install with install.packages("mlmRev"). The example is described in Doug Bates' beautifully concise notes on GLMs[3] (you should read them!).

data("Contraception",package="mlmRev")

# Model estimated with R's glm function, returning model matrix and response
# in $x and $y, respectively:
R_GLM = glm(formula = use ~ age + I(age^2) + urban + livch,
            family = binomial, x=TRUE, data=Contraception)

# Model estimated with our radically stripped-down minimalist implementation:
mini = irls(R_GLM$x, R_GLM$y, family=binomial)

print(data.frame(R_GLM=coef(R_GLM), minimalist=coef(mini)))

                   R_GLM   minimalist
(Intercept) -0.949952124 -0.949952124
age          0.004583726  0.004583726
I(age^2)    -0.004286455 -0.004286455
urbanY       0.768097459  0.768097459
livch1       0.783112821  0.783112821
livch2       0.854904050  0.854904050
livch3+      0.806025052  0.806025052

The results agree to displayed accuracy. Our minimalist implementation of IRLS converged after five iterations in this example.

IRLS Notes

IRLS solves a generalized linear model in each iteration: \[ (A^T W A)^{-1} A^T W z. \] When $W$ is symmetric positive definite this is equivalent to the solution of the weighted least squares problem: \[ \min_x||W^{1/2}(Ax - b)||^2. \] Note that in the logistic regression case $g'(a) = g(a)(1-g(a))$ is never zero—and so the diagonal enties of $W$ are never zero—but other link functions might lead to degenerate cases with indefinite weighting matrices. We'll discuss approaches to dealing with some exceptional cases like this in the Edge cases section below.

Numerical difficulties can and do arise. What if $g'(\eta) \approx 0$ at some point? Or similarly, what if $\mathrm{var}(g(\eta))\approx 0$ or $W$ is such that $A^TWA$ becomes extremely ill-conditioned? A more subtle problem occurs when the maximum ratio of any two entries along the diagonal of $W$ is a big number. Such problems are called stiff and bring their own set of numerical difficulties.

Björck[2, (pp. 165—171)] presents an excellent survey of solution approaches for the weighted least squares problem. Direct solution of the normal equations (as illustrated in our simple R implementation of the IRLS algorithm above) is potentially numerically unstable and not generally advised. The brief survey in Björck is really fascinating reading. His cited references show us that even standard Householder QR decomposition can give poor accuracy for stiff problems unless it's modified to include column and row pivoting. It turns out that the perhaps the simplest numerically stable implementation for solving the linear system within the IRLS method is the Givens QR algorithm, shown by Anda and Park[1].

In practice, with judicious monitoring of the entries of $W$ (in particular the stiffness $\max_j(W_{j,j})/\min_j(W_{j,j})$) and the quantities $g'(\eta)$ and $\mathrm{var}(g(\eta))$, we can choose among the most stable and convenient solution methods and simply warn the user when the algorithm gets into trouble numerically. R achieves this in its own glm.fit implementation by using a Householder QR decomposition of $WA$ with column pivoting, modified to reveal the rank of the matrix at each step. See Edge cases section below for a discussion of this topic.

Connection to ordinary least squares

When the link function is the identity function and the error distribution function of $e$ is Gaussian with mean zero and variance one, then the IRLS algorithm becomes:

$x_1=0$
$\eta = Ax_1 = 0$
$z = \eta + \frac{b - g(\eta)}{g'(\eta)} = 0 + \frac{b - 0}{1} = b$
$W = \mathrm{diag}(g'(\eta)^2/\mathrm{var}(g(\eta))) = \mathrm{diag}(1^2 /1)= I$
$x_{2} = 0 + (A^T A)^{-1} A^T b$,

which is just the ordinary least squares solution.

Performance and stability

We already sketched a simple implementation of the IRLS algorithm above (see IRLS). Recall that that method can suffer from potential numerical instability. We consider variations of that method in this section that focus on numerical stability and performance.

Perhaps the most important performance considerations for a good GLM/IRLS implementation are:

Take advantage of fast basic linear algebra subroutine (BLAS) and LAPACK libraries whenever possible.
Take advantage of fast sparse matrix vector products.

We will see that #2 is somewhat at odds with numerical stability. We seek to develop a flexible method that can reasonably accommodate stability and performance. See these slides http://goo.gl/gcPezs for a brief discussion on using high performance BLAS libraries with R.

Algorithm IRLS (QR Newton variant)

This method was defined by O'Leary[9]. The idea is to compute a QR factorization of the model matrix $A$ once, possibly by a rank-revealing QR method to identify problematic model matrices at the start. The factorization is then used in the IRLS iteration.

Input

Model matrix $A\in\mathbb{R}^{m\times n}$, response vector $b\in\mathbb{R}^m$, inverse link function $g$ and its derivative $g'$, the variance $\mathrm{var}$ as a function of the mean associated with the distribution function of $e$, termination tolerance $\mathrm{tol}$, maximum number of iterations itmax. We also require a sense of "too small" for the numerical checks indicated in the algorithm.

Let $t=0\cdot b$ (or possibly some other initial guess).
Let $s=0\cdot x$.
Compute $A = QR$.
For $j=1,2,\ldots,\mathrm{itmax}$, do:
1. Let $z = t + \frac{b - g(t)}{g'(t)},\qquad$
2. Let $W = \mathrm{diag}(g'(t)^2/\mathrm{var}(g(t)))$.
3. If $\min(\mathrm{diag}(W))$ is too small warn the user or see the Edge cases section below.
4. Let $s_{old} = s$.
5. Let $s = (Q^T W Q)^{-1}Q^T W z$ (solve with a Cholesky factorization of $Q^T W Q$.)
6. Let $t = A(A^T W A)^{-1} A^T W z = Q s.$
7. Exit for loop if $||s_{old} - s|| \lt \mathrm{tol}$.
Solve $R x = Q^T t$ for $x$ (simple backsolve).

Output

Model coefficients $x$.

R implementation

irls_qrnewton =
function(A, b, family=binomial, maxit=25, tol=1e-08)
{
  s = t  = 0
  QR = qr(A)
  Q  = qr.Q(QR)
  R  = qr.R(QR)
  for(j in 1:maxit)
  {
    g      = family()$linkinv(t)
    gprime = family()$mu.eta(t)
    z      = t + (b - g) / gprime
    W      = as.vector(gprime^2 / family()$variance(g))
    wmin   = min(W)
    if(wmin < sqrt(.Machine$double.eps))
      warning("Tiny weights encountered")
    s_old   = s
    C   = chol(crossprod(Q, W*Q))
    s   = forwardsolve(t(C), crossprod(Q,W*z))
    s   = backsolve(C,s)
    t      = Q %*% s
    if(sqrt(crossprod(s - s_old)) < tol) break
  }
  x = backsolve(R, crossprod(Q,t))
  list(coefficients=x,iterations=j)
}

This is a pretty cool IRLS implementation. It only uses the orthonormal $Q$ matrix and the current set of weights in each step. That results in potentially much better numerical stability than our earlier simple implementation IRLS. Positive-definiteness for the Cholesky factorization is easily checked in each iteration by examining the diagonal entries of the weight matrix.

The computations rely on the QR and Cholesky factorizations and matrix multiplication, all available to R from high-performance tuned BLAS and LAPACK libraries like the Intel MKL, AMD ACML, OpenBlas (Goto), or ATLAS. That means that if you have one of those BLAS libraries installed and linked to R, then this method will take advantage of available CPU vectorized instrucions and multithreading and run fast.

R's native IRLS method embedded in its glm.fit function computes a QR decomposition of the weighted matrix $WA$ in each iteration step. O'Leary's QR Newton method shown above instead only computes one QR factorization of $A$ at the start. It then monitors possible numerical problems by checking the size of the smallest diagonal element of the weight matrix in each step.

Let's at least verify the plausibility of this method by repeating an earlier example from Doug Bates[3]:

data("Contraception",package="mlmRev")

# Model estimated with R's glm function, returning model matrix and response
# in $x and $y, respectively:
R_GLM = glm(formula = use ~ age + I(age^2) + urban + livch,
            family = binomial, x=TRUE, data=Contraception)

# Model estimated by IRLS/QR Newton:
iqrn = irls_qrnewton(R_GLM$x, R_GLM$y, family=binomial)

print(data.frame(R_GLM=coef(R_GLM), qr_newton=coef(iqrn)))

                   R_GLM    qr_newton
(Intercept) -0.949952124 -0.949952124
age          0.004583726  0.004583726
I(age^2)    -0.004286455 -0.004286455
urbanY       0.768097459  0.768097459
livch1       0.783112821  0.783112821
livch2       0.854904050  0.854904050
livch3+      0.806025052  0.806025052

No surprises here for this modest-sized and well conditioned data set. The IRLS/QR Newton method produces coefficients that match those of R's native GLM method to displayed accuracy.

Algorithm IRLS (SVD Newton variant)

If you're really concerned about rank deficiency of the model matrix and you don't have a rank-revealing QR method available and you're willing to pay a few extra flops on the initial matrix decomposition, then the IRLS QR Newton variation can be easily adapted to an SVD-based method that definitively determines the rank of the model matrix.

Input

Model matrix $A\in\mathbb{R}^{m\times n},\,\,m\ge n$, response vector $b\in\mathbb{R}^m$, inverse link function $g$ and its derivative $g'$, the variance $\mathrm{var}$ as a function of the mean associated with the distribution function of $e$, termination tolerance $\mathrm{tol}$, maximum number of iterations $\mathrm{itmax}$.

Let $t=0\cdot b$ (or possibly some other initial guess).
Let $s=0\cdot x$.
Compute the SVD $AV = U\Sigma,\,\,\,\, U^TU = I,\,\,\, V^TV = I, \,\,\,\Sigma=\mathrm{diag}(\sigma_1\ge\sigma_2\ge\cdots\sigma_n\ge 0)$.
(If any $\sigma_j=0$ then $A$ is rank-deficient, stop and warn the user or take other action—see the Edge cases section below.)
For $j=1,2,\ldots,\mathrm{itmax}$, do:
1. Let $z = t + \frac{b - g(t)}{g'(t)},\qquad$
2. Let $W = \mathrm{diag}(g'(t)^2/\mathrm{var}(g(t)))$.
3. If $\min(\mathrm{diag}(W))$ is too small warn the user or see the Edge cases section below.
4. Let $s_{old} = s$.
5. Let $s = (U^T W U)^{-1}U^T W z$ (solve with a Cholesky factorization of $U^T W U$.)
6. Let $t = U s.$
7. Exit for loop if $||s_{old} - s|| \lt \mathrm{tol}$.
Let $x = V\Sigma^{-1}U^T t$.

Output

Model coefficients $x$.

Similarly to the QR Newton method listed above, the SVD-based method only uses the orthonormal $U$ matrix and the current weighting matrix in each iteration. A side-benefit of the SVD is that the final solution for model coefficients no longer requires a backsolve.

This is Bryan's personal favorite IRLS implementation. It takes advantage of available fast numeric libraries and provides a straightforward path for dealing with rank-deficient problems. An R implementation of this algorithm is shown in the following section.

Edge cases

We have so far required:

The model matrix to be of full rank (and have at least as many rows as columns).
The weight matrix $W$ to be a diagonal matrix with positive diagonal entries.

Both requirements are conservative; we can solve problems where one or both requirements are not satisfied. But we have to make some decisions about how to proceed in such cases. This section outlines possible approaches to solving degenerate problems. We consider how R handles them and some alternative approaches.

What does R do?

Consulting the R language is always a good idea when faced with questions about algorithm implementations. R is a vibrant open-source software project of exceptional numeric quality that has been written and actively used and reviewed over decades. R's core algorithm implementations are carefully thought out and equipped with useful examples and superb literature references.

How R handles rank deficiency in a model matrix

R's core glm.fit function handles rank-deficient model matrices by selecting a subset of columns that are linearly independent. Model coefficients corresponding to model matrix columns that have been omitted from the model are marked missing (NA). That's a good convention because it clearly alerts users to rank-deficiency in their models.

Column subset selection is achieved by a pivoting strategey defined in the R's internal dqrdc2 Fortran subroutine. The strategy applies a Householder QR method with column pivoting to the weighted model matrix $WA$ in each step of the IRLS iteration. The strategy proceeds columnwise and checks if the norm of the subdiagonal of a QR-transformed column falls below a tolerance factor times the norm of the subdiagonal of the corresponding original model matrix column. If the transformed sub-column norm is too small then the column is pivoted to the right-most edge of the matrix and omitted from the model. The tolerance is min(1e-7, epsilon/1000), where epsilon is defined by the control argument to the glm.fit function (with a default value of epsilon=1e-8).

A consequence of the sequential nature of R's column subset selection strategy is that model results depend on the ordering of model matrix columns, in addition to possible pivoting choices. In general, up to pivoting, model matrix columns to the left are preferred for inclusion. Let's investigate with an illuminating example:

# Create a rank-deficient matrix A (rank 2) and a test observation vector b:
# (In addition to low rank, observe that the 2nd column is close to the 1st column.)

  options(digits=8)
  b = c(1,2,3)
( A = matrix(c(1,1,1, 1+1e-7,1,1, 1,0,0), ncol=3) )

     [,1]      [,2] [,3]
[1,]    1 1.0000001    1
[2,]    1 1.0000000    0
[3,]    1 1.0000000    0

( glm.fit(A, b)$coef )
[1]  15000002 -15000000        NA

# Now permute the columns of A and solve again (very different results!):

( A_permuted = A[, c(1,3,2)] )

     [,1] [,2]      [,3]
[1,]    1    1 1.0000001
[2,]    1    0 1.0000000
[3,]    1    0 1.0000000

( glm.fit(A_permuted, b)$coef )
[1]  2.5 -1.5   NA


# We can also see the effect of the pivoting strategy column selection on this
# problem by setting the tolerance to its maximum (setting epsilon large
# results in the maximum tolerance of 1e-7):

( glm.fit(A, b, control=list(epsilon=1))$coef )
[1]  2.5   NA -1.5

Note that when the NA values are replaced by zero all the solutions found in the previous example are perfectly reasonable least squares solutions—it's just that the problem itself is not well-defined.

R users should be aware however that R makes a choice of particular solution based on the ordering of the model matrix columns. The appearance of NA in the solution alerts users that the model matrix is rank deficient.

In our opinion, when users see a result like this they should step back and think about the problem carefully. We recommend a systematic variable selection process like the elastic net [12]. Another reasonable alternative approach adds additional constraints that make the problem convex—for example, requiring that the least squares solution be of minimal norm. That alternative is available in the SVD-based routine outlined later in this section.

How R handles zero and infinite diagonal entries in the weight matrix

When the diagonal weight matrix $W$ contains only positive diagonal entries then the vector inner product given by \[ v^TWv, \qquad v\in\mathbb{R}^m \] defines a vector norm. However, if the matrix has one or more zero values along its diagonal it only defines a semi-norm and could lead to ambiguity in the solution of the weighted least squares problem. We encounter similarly bad numerical problems when the matrix has one or more infinite values along its diagonal (arising from zero-variance variables for example).

Note that we've seen that at least in the logistic regression case, the diagonal elements of the weight matrix $W$ cannot be exactly zero. But they could fall to zero for other distribution families.

R simply omits rows in the model matrix (data obserations) corresponding to zero or infinite weights. That approach can also be applied to any of the decomposition methods discussed so far.

We've noted in previous sections that a large ratio of maximum value to minimum value along the diagonal of the weight matrix defines so-called stiff problems. When $W$ is a diagonal matrix (covering most applications of GLM) this is simply equivalent to ill-conditioning. In general, however, even reasonably well-conditioned problems may exhibit stiffness and in such cases the Householder QR decomposition may be unstable without full pivoting. Several alternatives for handling stiff problems are possible: We could check for tiny diagonal entries in the weight matrix and error out or warn the user (perhaps a warning is a good idea in any case!). Or, we could use Householder QR with full-pivoting or a Givens-based QR decomposition, or an alternative decomposition strategy like the SVD shown above.

Too few observations

If the total number of rows in the model matrix falls below its column rank, R terminates the modeling process with an error.

SVD Newton R implementation

We outline an R implementation of our favorite GLM solution method, the SVD Newton variant of O'Leary's QR Newton method discussed above. The implementation is equipped to handle the pitfalls discussed in the last section in numerically stable ways with a variety of user-controlled options, and it relies on standard BLAS and LAPACK routines. The method decomposes the model matrix at the start; compare to R's glm.fit function that computes a Householder QR decomposition of the weighted model matrix in each IRLS iteration.

Choose from among several strategies for handling rank-deficient model matrices in the SVD Newton method below with the rank_deficiency function parameter. We include options for producing minimum-norm solutions and an SVD subset selection method of Golub and his colleagues, outlined in greater detail here: http://bwlewis.github.io/GLM/svdss.html.

Aside from added machinery to handle rank-deficient model matrices and zero weights, the method follows the SVD Newton method outlined above.

# See http://bwlewis.github.io/GLM/svdss.html for the SVD subset selection
# method used here.

irls_svdnewton =
function(A, b, family=binomial, maxit=25, tol=1e-08, weights=rep(1,nrow(A)),
         rank_deficiency=c("select columns","minimum norm","error"))
{
  rank_deficiency = match.arg(rank_deficiency)
  m = nrow(A)
  n = ncol(A)
  select = 1:n
  zw = weights==0
  if(any(zw)) A[zw,]=0
  S = svd(A)
  tiny_singular_values = S$d/S$d[1] < tol
  k = sum(tiny_singular_values)
  if(k>0) 
  {
    if(rank_deficiency=="select columns")
    {
      warning("Numerically rank-deficient model matrix")
# NB This is a different selection method than R's default glm.fit uses.
# See https://bwlewis.github.io/GLM and https://bwlewis/github.io/GLM/svdss.html
      select = svdsubsel(A,n-k)
      S = svd(A[,select,drop=FALSE])
    } else if(rank_deficiency=="error")
    {
      stop("Near rank-deficient model matrix")
    }
  }
  t = rep(0,m)
  s = rep(0,length(select))
  good = weights > 0
  for(j in 1:maxit)
  {
    g       = family()$linkinv(t[good])
    varg    = family()$variance(g)
    if(any(is.na(varg))) stop("NAs in variance of the inverse link function")
    if(any(varg==0)) stop("Zero value in variance of the inverse link function")
    gprime  = family()$mu.eta(t[good])
    if(any(is.na(gprime))) stop("NAs in the inverse link function derivative")
    z       = rep(0,m)
    W       = rep(0,m)
    z[good] = t[good] + (b[good] - g) / gprime
    W[good] = weights[good] * as.vector(gprime^2 / varg)

    good   = W > .Machine$double.eps*2
    if(sum(good)<m) warning("Tiny weights encountered")
    s_old   = s
    C   = chol(crossprod(S$u[good,,drop=FALSE], W[good]*S$u[good,,drop=FALSE]))
    s   = forwardsolve(t(C), crossprod(S$u[good,,drop=FALSE],W[good]*z[good]))
    s   = backsolve(C,s)
    t   = rep(0,m)
    t[good] = S$u[good,,drop=FALSE] %*% s
    if(sqrt(crossprod(s - s_old)) < tol) break
  }
  x = rep(NA, n)
  if(rank_deficiency=="minimum norm") S$d[tiny_singular_values] = Inf
  x[select] = S$v %*% ((1/S$d) * crossprod(S$u[good,],t[good]))
  list(coefficients=x,iterations=j)
}

Sparse model matrices

Sparse model matrices arise in many practical problems, in particular when categorical variables are encoded by a unit basis to form so-called treatment contrasts. (The categories are contrasted against a baseline category.) We would like to take advantage of sparsity in the model matrix when it occurs, both to reduce its storage requirements and to improve performance of the IRLS method. This is particularly important for large problems.

The QR- and SVD-based IRLS methods described above destroy sparsity in the model matrix by replacing it with dense factors $Q$ and $U$, respectively. Although high-performance BLAS and LAPACK implementations are available for each method, neither takes any advantage of sparse model matrices.

In practice we're likely to encounter problems with a mixture of continuous and categorial variables that lead to model matrices with some dense columns and some sparse columns. For example, the model matrix from the mlmRev package used in the previous examples has two columns with real-valued (dense) data and five columns containing only ones and zeros (sparse).

Let's set aside numerical stability concerns for a moment and investigate how we might implement our original simple IRLS method to take advantage of sparse model matrices.

A large portion of the computational work in the IRLS method occurs in the matrix product $A^TWA$. We want to compute this product as efficiently as possible. That means using fast optimized BLAS methods for the dense part, and fast sparse matrix multiplication methods for the sparse part, suggesting that we permute the columns of the model matrix and split it into dense and sparse parts. See the sparse_n_dense.R file available in this project's Github repository for an example microbenchmark and related numerical experiments that investigates this.

The performance change realized from splitting the matrix product into sparse and dense components compared to the usual dense computation is sensitive to many factors, including the model matrix size and ratio of sparse to dense columns, the CPU architecture and number of cores, and BLAS library used. We list some results computed on our workstation using the sparse_n_dense.R experiment below for reference (don't take the results too seriously, instead experiment on your own!).

Our test workstation consisted of one AMD A10-7850K CPU with 4 physical cores and 16 GB DDR3 RAM, running Ubuntu 12.04 and R version 3.0.2. We could only get Ubuntu's default libopenblas library to use two cores, even after trying many suggested environment variable settings. But the AMD ACML BLAS library we used was version 5.3.1 with fused multiply/add and multicore support—probably the highest performance BLAS library available for this CPU. The test matrix consisted of 100000 rows and 1000 columns, 100 of them dense and 900 of them sparse with about 1% fill-in, all randomly generated. See the sparse_n_dense.R file for details.

BLAS library	Dense crossprod time in seconds	Split crossprod time in seconds
ACML FMA4_MP BLAS	3.0	1.6
OpenBLAS	4.7	1.6

Remember that this experiment is very sensitive to many factors. We encourage you to experiment on your own.

Numerical stability and sparse problems

We've seen one way to adapt our basic IRLS method to take advantage of sparse model matrices for performance, but what about numerical stability? Although we can't match the numerical properties of the QR- or SVD-based methods, all is not lost. We can apply a rank-revealing Cholesky decomposition to the product $A^TWA$ inside each IRLS iteration, for example using the LAPACK dpstrf routine. That Cholesky decomposition uses column pivoting and can detect rank deficiency at each step of the algorithm. That way we can at least detect possible problems during the iteration (although we are still subject to loss of numerical accuracy from very ill-conditioned or stiff problems). The LAPACK dpstrf routine is directly available from R's chol function.

Algorithm IRLS (split dense/sparse model matrix variant)

Let's put this long-winded sparse discussion together into a practical algorithm. Note the use of R's chol function pivoting option for rank detection below. The input dense and sparse matrix partitions should belong to the R class dgeMatrix from the Matrix package. See the compare.R example script in this project for an example use of the following functions.

# Example IRLS implementation that can take advantage of sparse model matrices.
# Here we assume that the model matrix A is already permuted and partitioned
# into A = [A_dense, A_sparse] dense and sparse columns.  The response vector b
# is also assumed to be permuted conformably with the matrix partitioning.

irls_sparse =
function(A_dense, A_sparse, b, family=binomial, maxit=25, tol=1e-08)
{
  nd = ncol(A_dense)
  ns = ncol(A_sparse)
  n  = nd + ns
  x = rep(0, n)
  for(j in 1:maxit)
  {
    eta   = as.vector(A_dense %*% x[1:nd] + A_sparse %*% x[(nd+1):n])
    g     = family()$linkinv(eta)
    gprime = family()$mu.eta(eta)
    z     = eta + (b - g) / gprime
    W     = as.vector(gprime^2 / family()$variance(g))
    xold  = x
    ATWA  = sp_wt_cross(A_dense,A_sparse,W)
    wz    = W*z
    ATWz  = c(as.vector(crossprod(A_dense, wz)), as.vector(crossprod(A_sparse, wz)))

    C   = chol(ATWA, pivot=TRUE)
    if(attr(C,"rank") < ncol(ATWA)) stop("Rank-deficiency detected.")
    p   = attr(C, "pivot")
    s   = forwardsolve(t(C), ATWz[p])
    x   = backsolve(C,s)[p]

    if(sqrt(crossprod(x-xold)) < tol) break
  }
  list(coefficients=x,iterations=j)
}

We use the following helper function that computes the sparse weighted matrix cross product:

# Sparse weighted cross product helper function
# Input: Dense Matrix A_dense, sparse Matrix A_sparse, weights W,
# where A = [A_dense, A_sparse] and length W=ncol(A).
# Output: Dense representation of crossprod(A,W*A)
sp_wt_cross = function(A_dense, A_sparse, W)
{
  nd = ncol(A_dense)
  ns = ncol(A_sparse)
  n  = nd + ns
  ATWA = matrix(0, nrow=n, ncol=n)
  WA_dense  = W*A_dense
  WA_sparse = W*A_sparse
  ATWA[1:nd,1:nd] = as.matrix(crossprod(A_dense,WA_dense))
  ATWA[1:nd,(nd+1):n] = as.matrix(crossprod(A_dense,WA_sparse))
  ATWA[(nd+1):n,1:nd] = as.matrix(crossprod(A_sparse,WA_dense))
  ATWA[(nd+1):n,(nd+1):n] = as.matrix(crossprod(A_sparse,WA_sparse))
  ATWA
}

Breaking large GLMs into a series of smaller problems

When problems have lots of observations (rows of the model matrix) it may not be possible to fit all the data in RAM memory. This section outlines strategies for breaking the GLM problem up into chunks, each one small enough to fit into RAM, incrementally building up a solution.

We don't discuss what to do when problems have so many variables that the model matrix cross product $A^TA$ is too big to work with in memory. In that case it's possible that the columns of $A$ contain lots of redundant information. Variable selection techniques like the lasso are generally indicated for such problems to reduce the number of variables.

We assume in this section and the next that although the model matrix $A$ itself may not fit into memory, its crossproduct $A^TA$ does. A dense double precision-valued matrix of size $10000 \times 10000$ consumes about 800 MB RAM. That's close to a practical working limit for typical desktop computers (in May, 2014), taking in to consideration data copies, etc. More powerful systems with lots of RAM might handle larger problems with up to 30000 coulmns in the model matrix or so.

The model coefficients are computed from a weighted matrix cross product in all of the methods shown so far; either from $A^TWA$ in the basic and sparse IRLS implementations, or from $Q^TWQ$ in the more numerically stable QR based implementation. Note however that if we can't load all of $A$ into memory, we are unlikely to be able to easily form its QR decomposition, at least without resorting to a distributed or out of core algorithm (we discuss both options later). Therefore we restrict our attention to the simple IRLS algorithm, even though the following discussion applies more generally.

It's easy to split up computation of $A^TWA$. For example, let \[ A = \left[ \begin{array}{c} A_1\\ A_2 \end{array}\right],\qquad W = \left[ \begin{array}{cc} W_1 & 0\\ 0 & W_2 \end{array}\right], \] where the subscripts on $A$ indicates a disjoint row partitioning and on $W$ the corresponding diagonal blocks associated with the rows in the partitions. Then, \[ A^T W A = A_1^T W_1 A_1 + A_2^T W_2 A_2. \] Each partial product on the right-hand side above is of the size of the (small) output cross product and only involves a subset of the rows of $A$ and $W$. That observation leads to a simple incremental algorithm that only loads a portion of the rows of the model matrix into memory at a time.

The biglm package by Thomas Lumley[6] proceeds along these lines to implement a bounded-memory GLM solver. We can similarly modify any of our earlier listed implementations. The example below adapts the basic IRLS method using a simple data file iterator function closure shown next. The function iterates by rows through a delimited text file nrows at a time, returning NULL at the end. Use init=TRUE argument to initialize/reset iterator (example below). The iterator assumes that the model matrix data are saved without row or column headers.

Basic file iterator function

iterator = function(filename, nrows=100, sep=",")
{
  function(init=FALSE)
  {
    if(init)
    {
      f <<- file(filename)
      open(f)
      return(NULL)
    }
    tryCatch(
      as.matrix(read.table(f, sep=sep, nrows=nrows)),
      error=function(e)
      {
        close(f)
        NULL
      })
  }
}

We leave out the complication of partitioning of sparse and dense columns below for clarity of presentation (using only dense matrix arithmetic). You can bound the working memory of the method to at most chunksize rows at a time.

Algorithm IRLS (incremental variant)

irls_incremental =
function(filename, chunksize, b, family=binomial, maxit=25, tol=1e-08)
{
  x     = NULL
  chunk = iterator(filename, nrows=chunksize) # a basic data file iterator
  for(j in 1:maxit)
  {
    k = 1                                     # Track the rows
    chunk(init=TRUE)                          # initialize the iterator
    A     = chunk()                           # get first chunk of model matrix
# Initialize first time through (after ascertaining ncol(A)):
    if(is.null(x)) x = rep(0,ncol(A))
    ATWA = matrix(0,ncol(A),ncol(A))
    ATWz = rep(0,ncol(A))
    while(!is.null(A))                        # iterate
    {
      eta    = A %*% x
      g      = family()$linkinv(eta)
      gprime = family()$mu.eta(eta)
      z      = eta + (b[k:(k+nrow(A)-1)] - g) / gprime
      k      = k + nrow(A)
      W      = as.vector(gprime^2 / family()$variance(g))
      ATWz   = ATWz + crossprod(A,W*z)
      ATWA   = ATWA + crossprod(A,W*A)
      A      = chunk()    # Next chunk
    }
    xold  = x
    C     = chol(ATWA, pivot=TRUE)
    if(attr(C,"rank") < ncol(C)) stop("Rank-deficiency detected.")
    p     = attr(C, "pivot")
    x     = backsolve(C,forwardsolve(t(C),ATWz[p]))[p]
    if(sqrt(crossprod(x-xold)) < tol) break
  }
  list(coefficients=x,iterations=j)
}

Let's verify the plausibility of this by recomputing our earlier test example with the incremental method, first saving the model matrix to a data file to iterate over:

data("Contraception",package="mlmRev")

# Model estimated with R's glm function, returning model matrix and response
# in $x and $y, respectively:
R_GLM = glm(formula = use ~ age + I(age^2) + urban + livch,
            family = binomial, x=TRUE, data=Contraception)

# Save the model matrix to a file:
write.table(R_GLM$x, file="data.csv", sep=",", col.names=FALSE, row.names=FALSE)

# Model estimated by our incremental method, 500 rows at a time:
inc = irls_incremental("data.csv", chunksize=500, b=R_GLM$y, family=binomial)

print(data.frame(R_GLM=coef(R_GLM), incremental=coef(inc)))

                   R_GLM  incremental
(Intercept) -0.949952124 -0.949952124
age          0.004583726  0.004583726
I(age^2)    -0.004286455 -0.004286455
urbanY       0.768097459  0.768097459
livch1       0.783112821  0.783112821
livch2       0.854904050  0.854904050
livch3+      0.806025052  0.806025052

Agreement with R again, at least to displayed accuracy for our test problem.

Distributed parallel computation of IRLS I

WRITE ME!

Generalized linear models, abridged.

Introduction

Disclaimer

Linear models

Least squares

Generalized linear models

Generalized linear models in practice

Algorithm IRLS: Iteratively reweighted least squares estimation

Input

Output

IRLS Notes

Connection to ordinary least squares

Performance and stability

Algorithm IRLS (QR Newton variant)

Input

Output

R implementation

Algorithm IRLS (SVD Newton variant)

Input

Output

Edge cases

What does R do?

How R handles rank deficiency in a model matrix

How R handles zero and infinite diagonal entries in the weight matrix

Too few observations

SVD Newton R implementation

Sparse model matrices

Numerical stability and sparse problems

Algorithm IRLS (split dense/sparse model matrix variant)

Breaking large GLMs into a series of smaller problems

Basic file iterator function

Algorithm IRLS (incremental variant)

Distributed parallel computation of IRLS I

References

Copyright