View the Project on GitHub bwlewis/GLM

Mike Kane and Bryan W. LewisGeneralized linear models (GLMs) are indispensable tools in the data science toolbox. They are applicable to many real-world problems involving continuous, yes/no, count and survival data (and more). The models themselves are intuitive and can be used for inference and prediction. A few very high quality free and open source software implementations are available (in particular within R[11]), as are several first-rate commercial ones (Revolution Analytics[12], SAS, Stata).

Despite the wide applicability of GLMs and the availability of high-quality reference implementations, we've found it hard to find good high-performance free and open source software implementations geared to solving large problems. Moreover, we've had trouble finding succinct references that deal with core implementation details that could help us build our own high-performance implementations.

This note grew out of our own desire to better understand the numerics of generalized linear models. We highlight aspects of GLM implementations that we find particularly interesting. We present some reference implementations stripped down to illuminate core ideas; often with just a few lines of code. Our focus on numerics really gives short shrift to the statistical motivation behind GLMs--in particular the IRLS method comes out of nowhere below. We're working on a new introductory section that very briefly summarizes the derivation of that method and ties it in to our main investigation of its numerics better. Will hopefully be done this summer, along with finishing up details on distributed implementations. --Bryan and Mike

We develop our ideas in our favorite language, R, but they are easily adapted to other languages. Python and Boost/C++ are particularly well-equipped for good GLM implementations in our opinion.

This is *not* a formal introduction to GLMs. We discuss
a few key implementation ideas, focusing on numerics, performance and
scalability. We refer the reader to the references [3,5,8,10] for complete
introductions to linear models.

These are working notes. We'll continue to tweak and revise them.

We begin by defining what we mean when we say "linear model" before moving on to GLMs. Our presentation and notation generally follow Björck[2]. Let $b\in\mathbb{R}^m, b\ne 0$ be a vector of measurements (called a response vector) and let $A\in\mathbb{R}^{m\times n}$, $m\ge n$, $\mathrm{rank}(A)=n$, be a matrix formed from data observations (called a model matrix). Then \[ Ax = b + e \] is what we mean when we say "linear model," where $x\in\mathbb{R}^n$ is a vector of model coefficients and the entries of the residual error term $e\in\mathbb{R}^m$ are independent and identically distributed random variables. Given a vector $x$ of model coefficients, we call the product $Ax$ a predictor of the response vector $b$. We're leaving important definitions of "random variable" and "independent and identically distributed" to the references.

Given a model matrix $A$ and response vector $b$, there are many ways we might go about solving for a vector of model coefficients $x$. We describe a particularly natural one next.

Relaxing conditions on linear models by allowing the entries of $e$ to be correlated or to be distributed with differing variances gives us generalized linear models. We need a few technical definitions to shore up this idea.

Let $\epsilon$ be a scalar-valued random variable with distribution function $F(\epsilon)$. The expected value $\mu$ and the variance $\sigma^2$ of $\epsilon$ are defined as: \[ E(\epsilon) = \mu = \int_{-\infty}^{\infty} \epsilon dF(\epsilon), \qquad \sigma^2 = \int_{-\infty}^{\infty}(\epsilon-\mu)^2dF(\epsilon). \] When $e$ is a vector of random variables, we write $E(e)$ to indicate the expected value of each entry, and similarly we define the expected value of a matrix of random variables to be the expected value of each entry. Define the variance-covariance matrix of a vector of random variables $e$ with mean vector $\mu$ to be \[ V(e) = E(ee^T) - \mu \mu^T. \]

The assumption of independence and identical distribution among the random variables in the error term $e$ of our earlier linear model required that $V(e) = \sigma^2 I$. General Gauss-Markoff linear models relax this requirement. When we say "generalized linear model" we mean a linear model of the form: \[ Ax = b + e,\qquad V(e)=\sigma^2W^{-1}, \] where $A$ is an $m\times n$ matrix of $\mathrm{rank}(A)=n$, $e$ is a vector of random variables with covariance matrix $\sigma^2 W^{-1}$ and $W$ is a symmetric positive definite matrix. See Björck[2, pp. 160-165] for a concise introduction to generalized linear models that we largely follow here. The solution of the generalized linear model can be formulated as \[ \min_x(Ax-b)^TW(Ax -b), \] with solution $x=(A^T W A)^{-1}A^T W b$. The numerical solution of model problems of this form was carefully analyzed by Paige[10]. We focus on simplified generalized models here and assume that $W$ is a diagonal matrix with positive entries along its diagonal.

Generalized linear models typically arise when we constrain the values that the predictor $Ax$ can take on and assume a specific distribution function for the entries of $e$. Constraints are important for many practical problems. For example:

- The response $b$ may represent binary 0/1 data and we want the predictor $Ax$ to at least be constrained to the interval $[0,1]$, in which case we might assume, for example, binomial distributions for the entries of $e$.
- $b$ might represent counts and we desire the predictor to be nonnegative since negative counts might not make any sense. In this case we might assume that the entries of $e$ follow Poisson distributions.

Consider the binary 0/1 data example cited above. Let $A\in\mathbb{R}^{m\times n}$ be a given model matrix with $\mathrm{rank}(A)=n$, and let $b\in\mathbb{R}^m$, $b\ne 0$, be a given response vector. The generalized linear model problem is to compute an $x$ that satisfies: \[ Ax = b + e,\qquad \mathrm{such\, that}\,\, Ax\in(0,1). \] We can formulate a related unconstrained problem that solves for $x$ in \[ g(Ax) = b + e, \] where $g$ is the logistic function applied elementwise to $Ax$. The problem is no longer constrained because $g$ maps any value to the interval $(0,1)$. But it's no longer linear either!

The choice of the nature of the error distribution defines the norm-weighting matrix $W$ discussed in the previous section. If we assume that the entries of $e$ consist of uncorrelated random variables with identical variance, then $W=\sigma^{-2}I$ and the problem reduces to simple nonlinear least squares: \[ \min_x ||g(Ax) - b||^2. \] That problem can be solved in R by several optimization methods, including Broyden's quasi-Newton method as follows:

# Input: A is an m by n model matrix, b a response vector of length m, # g is the inverse link function. # Output: The model coefficients x. nlls = function(A,b,g) { f = function(x) crossprod(b - g(A %*% x))[] optim(rep(0,ncol(A)), fn=f, method="BFGS")$par }Alternatively see R's nls function. If the entries of $e$ consist of uncorrelated random variables but with different variances, then $W\ne I$ but $W$ is still a diagonal matrix. This is common to many generalized linear model problems and results in a weighted nonlinear least squares problem (compare to the generalized linear model definition in the previous section): \[ \min_x(g(Ax)-b)^TW(g(Ax) -b), \] typically solved by the iteratively reweighted least square method described next.

- Let $x_1=0$.
- For $j=1,2,\ldots,\mathrm{itmax}$, do:
- Let $\eta = Ax_j$.
- Let $z = \eta + \frac{b - g(\eta)}{g'(\eta)},\qquad$
- Let $W = \mathrm{diag}(g'(\eta)^2/\mathrm{var}(g(\eta)))$.
- Let $x_{j+1} = (A^T W A)^{-1} A^T W z$.
- Stop if $||x_{j+1} - x_j|| \lt \mathrm{tol}$.

It's remarkable how rapidly this simple algorithm typically converges.
Before we dive in to
details, let's look at one simple R implementation of this algorithm and
compare it to R's `glm` function. The minimalist GLM function below
(compare to R's `glm.fit` function)
takes as input a model matrix $A$, a response vector $b$, an R distribution family
object, and optional maximum number of iterations and tolerance values.
(This is the shortest GLM implementation we could come up with! But it's got
some numerical issues, so read on...)

irls = function(A, b, family=binomial, maxit=25, tol=1e-08) { x = rep(0,ncol(A)) for(j in 1:maxit) { eta = A %*% x g = family()$linkinv(eta) gprime = family()$mu.eta(eta) z = eta + (b - g) / gprime W = as.vector(gprime^2 / family()$variance(g)) xold = x x = solve(crossprod(A,W*A), crossprod(A,W*z), tol=2*.Machine$double.eps) if(sqrt(crossprod(x-xold)) < tol) break } list(coefficients=x,iterations=j) }

Let's compare our super-minimalist IRLS estimator to R's `glm` function.
We use a real-world data example from the "mlmRev" package, which you may
need to install with `install.packages("mlmRev")`. The example
is described in Doug Bates' beautifully concise notes on GLMs[3] (you
should read them!).

data("Contraception",package="mlmRev") # Model estimated with R's glm function, returning model matrix and response # in $x and $y, respectively: R_GLM = glm(formula = use ~ age + I(age^2) + urban + livch, family = binomial, x=TRUE, data=Contraception) # Model estimated with our radically stripped-down minimalist implementation: mini = irls(R_GLM$x, R_GLM$y, family=binomial) print(data.frame(R_GLM=coef(R_GLM), minimalist=coef(mini))) R_GLM minimalist (Intercept) -0.949952124 -0.949952124 age 0.004583726 0.004583726 I(age^2) -0.004286455 -0.004286455 urbanY 0.768097459 0.768097459 livch1 0.783112821 0.783112821 livch2 0.854904050 0.854904050 livch3+ 0.806025052 0.806025052The results agree to displayed accuracy. Our minimalist implementation of IRLS converged after five iterations in this example.

IRLS solves a generalized linear model in each iteration:
\[
(A^T W A)^{-1} A^T W z.
\]
When $W$ is symmetric positive definite
this is equivalent to the solution of the
weighted least squares problem:
\[
\min_x||W^{1/2}(Ax - b)||^2.
\]
Note that in the logistic regression case $g'(a) = g(a)(1-g(a))$
is *never* zero—and so the diagonal enties of $W$ are
never zero—but other link functions might
lead to degenerate cases with indefinite weighting matrices.
We'll discuss approaches to dealing with some exceptional cases like this
in the Edge cases section below.

Numerical difficulties can and do arise. What if
$g'(\eta) \approx 0$ at some point? Or similarly, what if
$\mathrm{var}(g(\eta))\approx 0$
or $W$ is such that $A^TWA$ becomes extremely ill-conditioned? A more
subtle problem occurs when the maximum ratio of any two entries along the
diagonal of $W$ is a big number. Such problems are called *stiff* and
bring their own set of numerical difficulties.

Björck[2, (pp. 165—171)] presents an excellent survey of solution
approaches for the weighted least squares problem. Direct solution of the
normal equations (as illustrated in our simple R implementation of the IRLS
algorithm above) is potentially **numerically unstable and not generally
advised**. The brief survey in Björck is really fascinating reading. His
cited references show us that even standard Householder QR decomposition can
give poor accuracy for stiff problems unless it's modified to include column
*and* row pivoting. It turns out that the perhaps the simplest numerically
stable implementation for solving the linear system within the IRLS method is
the Givens QR algorithm, shown by Anda and Park[1].

In practice, with judicious monitoring of the entries of $W$ (in particular the
stiffness $\max_j(W_{j,j})/\min_j(W_{j,j})$) and the quantities $g'(\eta)$ and
$\mathrm{var}(g(\eta))$, we can choose among the most stable and convenient
solution methods and simply warn the user when the algorithm gets into trouble
numerically. R achieves this in its own `glm.fit` implementation by
using a Householder QR decomposition of $WA$ with column pivoting, modified to
reveal the rank of the matrix at each step.
See Edge cases section below for a discussion of this topic.

When the link function is the identity function and the error distribution function of $e$ is Gaussian with mean zero and variance one, then the IRLS algorithm becomes:

- $x_1=0$
- $\eta = Ax_1 = 0$
- $z = \eta + \frac{b - g(\eta)}{g'(\eta)} = 0 + \frac{b - 0}{1} = b$
- $W = \mathrm{diag}(g'(\eta)^2/\mathrm{var}(g(\eta))) = \mathrm{diag}(1^2 /1)= I$
- $x_{2} = 0 + (A^T A)^{-1} A^T b$,

We already sketched a simple implementation of the IRLS algorithm above (see IRLS). Recall that that method can suffer from potential numerical instability. We consider variations of that method in this section that focus on numerical stability and performance.

Perhaps the most important performance considerations for a good GLM/IRLS implementation are:

- Take advantage of fast basic linear algebra subroutine (BLAS) and LAPACK libraries whenever possible.
- Take advantage of fast sparse matrix vector products.

This method was defined by O'Leary[9]. The idea is to compute a QR factorization of the model matrix $A$ once, possibly by a rank-revealing QR method to identify problematic model matrices at the start. The factorization is then used in the IRLS iteration.

- Let $t=0\cdot b$ (or possibly some other initial guess).
- Let $s=0\cdot x$.
- Compute $A = QR$.
- For $j=1,2,\ldots,\mathrm{itmax}$, do:
- Let $z = t + \frac{b - g(t)}{g'(t)},\qquad$
- Let $W = \mathrm{diag}(g'(t)^2/\mathrm{var}(g(t)))$.
- If $\min(\mathrm{diag}(W))$ is too small warn the user or see the Edge cases section below.
- Let $s_{old} = s$.
- Let $s = (Q^T W Q)^{-1}Q^T W z$ (solve with a Cholesky factorization of $Q^T W Q$.)
- Let $t = A(A^T W A)^{-1} A^T W z = Q s.$
- Exit for loop if $||s_{old} - s|| \lt \mathrm{tol}$.

- Solve $R x = Q^T t$ for $x$ (simple backsolve).

irls_qrnewton = function(A, b, family=binomial, maxit=25, tol=1e-08) { s = t = 0 QR = qr(A) Q = qr.Q(QR) R = qr.R(QR) for(j in 1:maxit) { g = family()$linkinv(t) gprime = family()$mu.eta(t) z = t + (b - g) / gprime W = as.vector(gprime^2 / family()$variance(g)) wmin = min(W) if(wmin < sqrt(.Machine$double.eps)) warning("Tiny weights encountered") s_old = s C = chol(crossprod(Q, W*Q)) s = forwardsolve(t(C), crossprod(Q,W*z)) s = backsolve(C,s) t = Q %*% s if(sqrt(crossprod(s - s_old)) < tol) break } x = backsolve(R, crossprod(Q,t)) list(coefficients=x,iterations=j) }

This is a pretty cool IRLS implementation. It only uses the orthonormal $Q$ matrix and the current set of weights in each step. That results in potentially much better numerical stability than our earlier simple implementation IRLS. Positive-definiteness for the Cholesky factorization is easily checked in each iteration by examining the diagonal entries of the weight matrix.

The computations rely on the QR and Cholesky factorizations and matrix multiplication, all available to R from high-performance tuned BLAS and LAPACK libraries like the Intel MKL, AMD ACML, OpenBlas (Goto), or ATLAS. That means that if you have one of those BLAS libraries installed and linked to R, then this method will take advantage of available CPU vectorized instrucions and multithreading and run fast.

R's native IRLS method embedded in its `glm.fit` function computes
a QR decomposition of the weighted matrix $WA$ in each iteration step.
O'Leary's QR Newton method shown above instead
only computes one QR factorization of $A$ at the start. It then monitors
possible numerical problems by checking the size of the smallest diagonal
element of the weight matrix in each step.

Let's at least verify the plausibility of this method by repeating an earlier example from Doug Bates[3]:

data("Contraception",package="mlmRev") # Model estimated with R's glm function, returning model matrix and response # in $x and $y, respectively: R_GLM = glm(formula = use ~ age + I(age^2) + urban + livch, family = binomial, x=TRUE, data=Contraception) # Model estimated by IRLS/QR Newton: iqrn = irls_qrnewton(R_GLM$x, R_GLM$y, family=binomial) print(data.frame(R_GLM=coef(R_GLM), qr_newton=coef(iqrn))) R_GLM qr_newton (Intercept) -0.949952124 -0.949952124 age 0.004583726 0.004583726 I(age^2) -0.004286455 -0.004286455 urbanY 0.768097459 0.768097459 livch1 0.783112821 0.783112821 livch2 0.854904050 0.854904050 livch3+ 0.806025052 0.806025052No surprises here for this modest-sized and well conditioned data set. The IRLS/QR Newton method produces coefficients that match those of R's native GLM method to displayed accuracy.

If you're really concerned about rank deficiency of the model matrix and you don't have a rank-revealing QR method available and you're willing to pay a few extra flops on the initial matrix decomposition, then the IRLS QR Newton variation can be easily adapted to an SVD-based method that definitively determines the rank of the model matrix.

- Let $t=0\cdot b$ (or possibly some other initial guess).
- Let $s=0\cdot x$.
- Compute the SVD $AV = U\Sigma,\,\,\,\, U^TU = I,\,\,\, V^TV = I, \,\,\,\Sigma=\mathrm{diag}(\sigma_1\ge\sigma_2\ge\cdots\sigma_n\ge 0)$.
- (If any $\sigma_j=0$ then $A$ is rank-deficient, stop and warn the user or take other action—see the Edge cases section below.)
- For $j=1,2,\ldots,\mathrm{itmax}$, do:
- Let $z = t + \frac{b - g(t)}{g'(t)},\qquad$
- Let $W = \mathrm{diag}(g'(t)^2/\mathrm{var}(g(t)))$.
- If $\min(\mathrm{diag}(W))$ is too small warn the user or see the Edge cases section below.
- Let $s_{old} = s$.
- Let $s = (U^T W U)^{-1}U^T W z$ (solve with a Cholesky factorization of $U^T W U$.)
- Let $t = U s.$
- Exit for loop if $||s_{old} - s|| \lt \mathrm{tol}$.

- Let $x = V\Sigma^{-1}U^T t$.

Similarly to the QR Newton method listed above, the SVD-based method only uses the orthonormal $U$ matrix and the current weighting matrix in each iteration. A side-benefit of the SVD is that the final solution for model coefficients no longer requires a backsolve.

This is Bryan's personal favorite IRLS implementation. It takes advantage of available fast numeric libraries and provides a straightforward path for dealing with rank-deficient problems. An R implementation of this algorithm is shown in the following section.

- The model matrix to be of full rank (and have at least as many rows as columns).
- The weight matrix $W$ to be a diagonal matrix with positive diagonal entries.

Column subset selection is achieved by a pivoting strategey defined in the R's
internal `dqrdc2` Fortran subroutine. The strategy applies a
Householder QR method with column pivoting to the weighted model matrix $WA$ in
each step of the IRLS iteration. The strategy proceeds columnwise and
checks if the norm of the subdiagonal of a QR-transformed column falls
below a tolerance factor times the norm of the subdiagonal of the corresponding
original model matrix column. If the transformed sub-column norm is too small
then the column is pivoted to the right-most edge of the matrix and omitted
from the model. The tolerance is `min(1e-7,
epsilon/1000)`, where epsilon is defined by the `control` argument
to the `glm.fit` function (with a default value of
`epsilon=1e-8`).

A consequence of the sequential nature of R's column subset selection strategy is
that model results *depend on the ordering of model matrix columns*, in
addition to possible pivoting choices. In general, up to pivoting, model
matrix columns to the left are preferred for inclusion. Let's investigate with
an illuminating example:

# Create a rank-deficient matrix A (rank 2) and a test observation vector b: # (In addition to low rank, observe that the 2nd column is close to the 1st column.) options(digits=8) b = c(1,2,3) ( A = matrix(c(1,1,1, 1+1e-7,1,1, 1,0,0), ncol=3) ) [,1] [,2] [,3] [1,] 1 1.0000001 1 [2,] 1 1.0000000 0 [3,] 1 1.0000000 0 ( glm.fit(A, b)$coef ) [1] 15000002 -15000000 NA # Now permute the columns of A and solve again (very different results!): ( A_permuted = A[, c(1,3,2)] ) [,1] [,2] [,3] [1,] 1 1 1.0000001 [2,] 1 0 1.0000000 [3,] 1 0 1.0000000 ( glm.fit(A_permuted, b)$coef ) [1] 2.5 -1.5 NA # We can also see the effect of the pivoting strategy column selection on this # problem by setting the tolerance to its maximum (setting epsilon large # results in the maximum tolerance of 1e-7): ( glm.fit(A, b, control=list(epsilon=1))$coef ) [1] 2.5 NA -1.5Note that when the

R users should be aware however that R makes a choice of particular solution
based on the ordering of the model matrix columns. The appearance
of `NA` in the solution alerts users that the model matrix is rank
deficient.

In our opinion, when users see a result like this they should step back and think about the problem carefully. We recommend a systematic variable selection process like the elastic net [12]. Another reasonable alternative approach adds additional constraints that make the problem convex—for example, requiring that the least squares solution be of minimal norm. That alternative is available in the SVD-based routine outlined later in this section.

Note that we've seen that at least in the logistic regression case, the diagonal elements of the weight matrix $W$ cannot be exactly zero. But they could fall to zero for other distribution families.

R simply omits rows in the model matrix (data obserations) corresponding to zero or infinite weights. That approach can also be applied to any of the decomposition methods discussed so far.

We've noted in previous sections that a large ratio of maximum value to minimum value along the diagonal of the weight matrix defines so-called stiff problems. When $W$ is a diagonal matrix (covering most applications of GLM) this is simply equivalent to ill-conditioning. In general, however, even reasonably well-conditioned problems may exhibit stiffness and in such cases the Householder QR decomposition may be unstable without full pivoting. Several alternatives for handling stiff problems are possible: We could check for tiny diagonal entries in the weight matrix and error out or warn the user (perhaps a warning is a good idea in any case!). Or, we could use Householder QR with full-pivoting or a Givens-based QR decomposition, or an alternative decomposition strategy like the SVD shown above.

Choose from among several strategies for handling rank-deficient model matrices
in the SVD Newton method below with the `rank_deficiency` function
parameter. We include options for producing minimum-norm solutions and an SVD
subset selection method of Golub and his colleagues, outlined in greater detail
here:
http://bwlewis.github.io/GLM/svdss.html.

Aside from added machinery to handle rank-deficient model matrices and zero weights, the method follows the SVD Newton method outlined above.

# See http://bwlewis.github.io/GLM/svdss.html for the SVD subset selection # method used here. irls_svdnewton = function(A, b, family=binomial, maxit=25, tol=1e-08, weights=rep(1,nrow(A)), rank_deficiency=c("select columns","minimum norm","error")) { rank_deficiency = match.arg(rank_deficiency) m = nrow(A) n = ncol(A) select = 1:n zw = weights==0 if(any(zw)) A[zw,]=0 S = svd(A) tiny_singular_values = S$d/S$d[1] < tol k = sum(tiny_singular_values) if(k>0) { if(rank_deficiency=="select columns") { warning("Numerically rank-deficient model matrix") # NB This is a different selection method than R's default glm.fit uses. # See https://bwlewis.github.io/GLM and https://bwlewis/github.io/GLM/svdss.html select = svdsubsel(A,n-k) S = svd(A[,select,drop=FALSE]) } else if(rank_deficiency=="error") { stop("Near rank-deficient model matrix") } } t = rep(0,m) s = rep(0,length(select)) good = weights > 0 for(j in 1:maxit) { g = family()$linkinv(t[good]) varg = family()$variance(g) if(any(is.na(varg))) stop("NAs in variance of the inverse link function") if(any(varg==0)) stop("Zero value in variance of the inverse link function") gprime = family()$mu.eta(t[good]) if(any(is.na(gprime))) stop("NAs in the inverse link function derivative") z = rep(0,m) W = rep(0,m) z[good] = t[good] + (b[good] - g) / gprime W[good] = weights[good] * as.vector(gprime^2 / varg) good = W > .Machine$double.eps*2 if(sum(good)<m) warning("Tiny weights encountered") s_old = s C = chol(crossprod(S$u[good,,drop=FALSE], W[good]*S$u[good,,drop=FALSE])) s = forwardsolve(t(C), crossprod(S$u[good,,drop=FALSE],W[good]*z[good])) s = backsolve(C,s) t = rep(0,m) t[good] = S$u[good,,drop=FALSE] %*% s if(sqrt(crossprod(s - s_old)) < tol) break } x = rep(NA, n) if(rank_deficiency=="minimum norm") S$d[tiny_singular_values] = Inf x[select] = S$v %*% ((1/S$d) * crossprod(S$u[good,],t[good])) list(coefficients=x,iterations=j) }

Sparse model matrices arise in many practical problems, in particular when categorical variables are encoded by a unit basis to form so-called treatment contrasts. (The categories are contrasted against a baseline category.) We would like to take advantage of sparsity in the model matrix when it occurs, both to reduce its storage requirements and to improve performance of the IRLS method. This is particularly important for large problems.

The QR- and SVD-based IRLS methods described above destroy sparsity in the model matrix by replacing it with dense factors $Q$ and $U$, respectively. Although high-performance BLAS and LAPACK implementations are available for each method, neither takes any advantage of sparse model matrices.

In practice we're likely to encounter problems with a mixture of continuous and categorial variables that lead to model matrices with some dense columns and some sparse columns. For example, the model matrix from the mlmRev package used in the previous examples has two columns with real-valued (dense) data and five columns containing only ones and zeros (sparse).

Let's set aside numerical stability concerns for a moment and investigate how we might implement our original simple IRLS method to take advantage of sparse model matrices.

A large portion of the computational work in the IRLS
method occurs in the matrix product $A^TWA$. We want to compute this product as
efficiently as possible. That means using fast optimized BLAS methods for the
dense part, and fast sparse matrix multiplication methods for the sparse part,
suggesting that we permute the columns of the model matrix and split it into
dense and sparse parts. See the `sparse_n_dense.R` file available in
this project's Github repository for an example microbenchmark and related
numerical experiments that investigates this.

The performance change realized from splitting the matrix product into sparse
and dense components compared to the usual dense computation is sensitive to
many factors, including the model matrix size and ratio of sparse to dense
columns, the CPU architecture and number of cores, and BLAS library used. We
list some results computed on our workstation using the
`sparse_n_dense.R` experiment below for reference (don't take the
results too seriously, instead experiment on your own!).

Our test workstation consisted of one AMD A10-7850K CPU with 4 physical cores
and 16 GB DDR3 RAM, running Ubuntu 12.04 and R version 3.0.2. We could only get
Ubuntu's default `libopenblas` library to use two cores, even after
trying many suggested environment variable settings. But the AMD ACML BLAS
library we used was version 5.3.1 with fused multiply/add and multicore
support—probably the highest performance BLAS library available for
this CPU. The test matrix consisted of 100000 rows and 1000 columns, 100 of
them dense and 900 of them sparse with about 1% fill-in, all randomly
generated. See the `sparse_n_dense.R` file for details.

BLAS library | Dense crossprod time in seconds | Split crossprod time in seconds |
---|---|---|

ACML FMA4_MP BLAS | 3.0 | 1.6 |

OpenBLAS | 4.7 | 1.6 |

We've seen one way to adapt our basic IRLS method to take
advantage of sparse model matrices for performance, but what about numerical
stability? Although we can't match the numerical properties of the QR- or
SVD-based methods, all is not lost. We can apply a rank-revealing Cholesky
decomposition to the product $A^TWA$ inside each IRLS iteration, for example
using the LAPACK `dpstrf` routine. That Cholesky decomposition uses
column pivoting and can detect rank deficiency at each step of the algorithm.
That way we can at least detect possible problems during the iteration
(although we are still subject to loss of numerical accuracy from very
ill-conditioned or stiff problems). The LAPACK `dpstrf` routine is
directly available from R's `chol` function.

Let's put this long-winded sparse discussion together into a practical
algorithm. Note the use of R's `chol` function pivoting option for
rank detection below. The input dense and sparse matrix partitions should
belong to the R class `dgeMatrix` from the `Matrix` package. See
the `compare.R` example script in this project for an example use of the
following functions.

# Example IRLS implementation that can take advantage of sparse model matrices. # Here we assume that the model matrix A is already permuted and partitioned # into A = [A_dense, A_sparse] dense and sparse columns. The response vector b # is also assumed to be permuted conformably with the matrix partitioning. irls_sparse = function(A_dense, A_sparse, b, family=binomial, maxit=25, tol=1e-08) { nd = ncol(A_dense) ns = ncol(A_sparse) n = nd + ns x = rep(0, n) for(j in 1:maxit) { eta = as.vector(A_dense %*% x[1:nd] + A_sparse %*% x[(nd+1):n]) g = family()$linkinv(eta) gprime = family()$mu.eta(eta) z = eta + (b - g) / gprime W = as.vector(gprime^2 / family()$variance(g)) xold = x ATWA = sp_wt_cross(A_dense,A_sparse,W) wz = W*z ATWz = c(as.vector(crossprod(A_dense, wz)), as.vector(crossprod(A_sparse, wz))) C = chol(ATWA, pivot=TRUE) if(attr(C,"rank") < ncol(ATWA)) stop("Rank-deficiency detected.") p = attr(C, "pivot") s = forwardsolve(t(C), ATWz[p]) x = backsolve(C,s)[p] if(sqrt(crossprod(x-xold)) < tol) break } list(coefficients=x,iterations=j) }

We use the following helper function that computes the sparse weighted matrix cross product:

# Sparse weighted cross product helper function # Input: Dense Matrix A_dense, sparse Matrix A_sparse, weights W, # where A = [A_dense, A_sparse] and length W=ncol(A). # Output: Dense representation of crossprod(A,W*A) sp_wt_cross = function(A_dense, A_sparse, W) { nd = ncol(A_dense) ns = ncol(A_sparse) n = nd + ns ATWA = matrix(0, nrow=n, ncol=n) WA_dense = W*A_dense WA_sparse = W*A_sparse ATWA[1:nd,1:nd] = as.matrix(crossprod(A_dense,WA_dense)) ATWA[1:nd,(nd+1):n] = as.matrix(crossprod(A_dense,WA_sparse)) ATWA[(nd+1):n,1:nd] = as.matrix(crossprod(A_sparse,WA_dense)) ATWA[(nd+1):n,(nd+1):n] = as.matrix(crossprod(A_sparse,WA_sparse)) ATWA }

When problems have lots of observations (rows of the model matrix) it may not be possible to fit all the data in RAM memory. This section outlines strategies for breaking the GLM problem up into chunks, each one small enough to fit into RAM, incrementally building up a solution.

We don't discuss what to do when problems have so many variables that the model matrix cross product $A^TA$ is too big to work with in memory. In that case it's possible that the columns of $A$ contain lots of redundant information. Variable selection techniques like the lasso are generally indicated for such problems to reduce the number of variables.

We assume in this section and the next that although the model matrix $A$ itself may not fit into memory, its crossproduct $A^TA$ does. A dense double precision-valued matrix of size $10000 \times 10000$ consumes about 800 MB RAM. That's close to a practical working limit for typical desktop computers (in May, 2014), taking in to consideration data copies, etc. More powerful systems with lots of RAM might handle larger problems with up to 30000 coulmns in the model matrix or so.

The model coefficients are computed from a weighted matrix cross product in all of the methods shown so far; either from $A^TWA$ in the basic and sparse IRLS implementations, or from $Q^TWQ$ in the more numerically stable QR based implementation. Note however that if we can't load all of $A$ into memory, we are unlikely to be able to easily form its QR decomposition, at least without resorting to a distributed or out of core algorithm (we discuss both options later). Therefore we restrict our attention to the simple IRLS algorithm, even though the following discussion applies more generally.

It's easy to split up computation of $A^TWA$. For example, let \[ A = \left[ \begin{array}{c} A_1\\ A_2 \end{array}\right],\qquad W = \left[ \begin{array}{cc} W_1 & 0\\ 0 & W_2 \end{array}\right], \] where the subscripts on $A$ indicates a disjoint row partitioning and on $W$ the corresponding diagonal blocks associated with the rows in the partitions. Then, \[ A^T W A = A_1^T W_1 A_1 + A_2^T W_2 A_2. \] Each partial product on the right-hand side above is of the size of the (small) output cross product and only involves a subset of the rows of $A$ and $W$. That observation leads to a simple incremental algorithm that only loads a portion of the rows of the model matrix into memory at a time.

The biglm package by Thomas Lumley[6] proceeds along these lines to implement a
bounded-memory GLM solver. We can similarly modify any of our earlier listed
implementations. The example below adapts the basic
IRLS method using a simple data file iterator function
closure shown next. The function iterates by rows through a delimited text
file nrows at a time, returning NULL at the end. Use `init=TRUE`
argument to initialize/reset iterator (example below). The iterator assumes
that the model matrix data are saved without row or column headers.

iterator = function(filename, nrows=100, sep=",") { function(init=FALSE) { if(init) { f <<- file(filename) open(f) return(NULL) } tryCatch( as.matrix(read.table(f, sep=sep, nrows=nrows)), error=function(e) { close(f) NULL }) } }

We leave out the complication of partitioning of sparse and dense columns below
for clarity of presentation (using only dense matrix arithmetic).
You can bound the working memory of
the method to at most `chunksize` rows at a time.

irls_incremental = function(filename, chunksize, b, family=binomial, maxit=25, tol=1e-08) { x = NULL chunk = iterator(filename, nrows=chunksize) # a basic data file iterator for(j in 1:maxit) { k = 1 # Track the rows chunk(init=TRUE) # initialize the iterator A = chunk() # get first chunk of model matrix # Initialize first time through (after ascertaining ncol(A)): if(is.null(x)) x = rep(0,ncol(A)) ATWA = matrix(0,ncol(A),ncol(A)) ATWz = rep(0,ncol(A)) while(!is.null(A)) # iterate { eta = A %*% x g = family()$linkinv(eta) gprime = family()$mu.eta(eta) z = eta + (b[k:(k+nrow(A)-1)] - g) / gprime k = k + nrow(A) W = as.vector(gprime^2 / family()$variance(g)) ATWz = ATWz + crossprod(A,W*z) ATWA = ATWA + crossprod(A,W*A) A = chunk() # Next chunk } xold = x C = chol(ATWA, pivot=TRUE) if(attr(C,"rank") < ncol(C)) stop("Rank-deficiency detected.") p = attr(C, "pivot") x = backsolve(C,forwardsolve(t(C),ATWz[p]))[p] if(sqrt(crossprod(x-xold)) < tol) break } list(coefficients=x,iterations=j) }

Let's verify the plausibility of this by recomputing our earlier test example with the incremental method, first saving the model matrix to a data file to iterate over:

data("Contraception",package="mlmRev") # Model estimated with R's glm function, returning model matrix and response # in $x and $y, respectively: R_GLM = glm(formula = use ~ age + I(age^2) + urban + livch, family = binomial, x=TRUE, data=Contraception) # Save the model matrix to a file: write.table(R_GLM$x, file="data.csv", sep=",", col.names=FALSE, row.names=FALSE) # Model estimated by our incremental method, 500 rows at a time: inc = irls_incremental("data.csv", chunksize=500, b=R_GLM$y, family=binomial) print(data.frame(R_GLM=coef(R_GLM), incremental=coef(inc))) R_GLM incremental (Intercept) -0.949952124 -0.949952124 age 0.004583726 0.004583726 I(age^2) -0.004286455 -0.004286455 urbanY 0.768097459 0.768097459 livch1 0.783112821 0.783112821 livch2 0.854904050 0.854904050 livch3+ 0.806025052 0.806025052Agreement with R again, at least to displayed accuracy for our test problem.

- Anda, A. and Park, H., Self-scaling fast rotations for stiff least squares problems, Lin. Alg. Appl., 234, 1996, pp. 137-162.
- Björck, Å., Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996.
- Bates, D., http://www.stat.wisc.edu/courses/st849-bates/lectures/GLMH.pdf.
- Gauss, C. F., Theoria combinationis observationum erroribus minimis obnoxiae, Pars prior, 1863 (1st written 1821).
- Hastie, T. J. and Pregibon, D., Generalized linear models, Chapter 6 of Statistical Models in S, eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole, 1992.
- Lumley, T, http://cran.r-project.org/web/packages/biglm.
- Markoff, A., Wahrscheinlichheitsrechnug, Leipzig, 1912.
- McCullagh P. and Nelder, J. A., Generalized Linear Models, Chapman and Hall, London 1989.
- O'Leary, D., Robust regression computation using iteratively reweighted least squares, Siam J. Mat. Anal. Appl., Vol. 11 No. 3, 1990, pp. 466-480.
- Paige, C. C., Fast numerically stable computations for generalized least squares problems, Siam J. Num. Anal., 16, 1979, pp. 165-171.
- The R project http://www.r-project.org.
- Revolution Analytics http://revolutionanalytics.com.
- Zhou, H. and Hastie, T., Regularization and Variable Selection via the Elastic Net, J. Royal Statistical Society, B, 2005, pp. 301-320.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the Github project files.