The augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA) finds a few approximate largest singular values and corresponding singular vectors of a sparse or dense matrix using a method of Baglama and Reichel.
J. Baglama and L. Reichel, SIAM J. Sci. Comput. (2005). paper14.pdf
It is a fast and memory-efficient way to compute a partial SVD, principal components, and some specialized partial eigenvalue decompositions. I introduced Baglama and Reichel’s irlba algorithm to the R world in a talk at the useR! conference in Dortmund way back in 2008.
The package provides the following functions (see help on each for details and examples).
irlba()
partial SVD functionssvd()
l1-penalized matrix decompoisition for sparse PCA (based on Shen and Huang’s algorithm)–see https://bwlewis.github.io/irlba/ssvd.html for more detailsprcomp_irlba()
principal components function similar to the prcomp
function in stats package for computing the first few principal components of large matricessvdr()
alternate partial SVD function based on randomized SVDpartial_eigen()
a very limited partial eigenvalue decomposition for symmetric matrices (see the RSpectra package for more comprehensive truncated eigenvalue decomposition); see also https://bwlewis.github.io/irlba/comparison.html for more notes on RSpectra.For your reading pleasure…
And, some interesting related recent work by Musco and colleagues:
This is a bugfix release.
prcomp_irlba()
discovered by Xiaojie Qiu, see https://github.com/bwlewis/irlba/issues/25, and other related problems reported in https://github.com/bwlewis/irlba/issues/32.ssvd()
found by Alex Poliakov.irlba()
bug associated with centering (PCA), see https://github.com/bwlewis/irlba/issues/21.irlba()
scaling to conform to scale
, see https://github.com/bwlewis/irlba/issues/22.prcomp_irlba()
from a suggestion by N. Benjamin Erichson, see https://github.com/bwlewis/irlba/issues/23.svdr()
convergence criterion.ssvd()
).library(irlba)
set.seed(1)
A <- matrix(rnorm(100), 10)
# Define a custom matrix multiplication function that scales the columns of A
# to have unit norm (cf the scale option).
# ------------------ the new way ------------------------------------------
# Simply using an S4 class
setClass("scaled_matrix", contains="matrix", slots=c(scale="numeric"))
setMethod("%*%", signature(x="scaled_matrix", y="numeric"), function(x ,y) x@.Data %*% (y / x@scale))
setMethod("%*%", signature(x="numeric", y="scaled_matrix"), function(x ,y) (x %*% y@.Data) / y@scale)
a <- new("scaled_matrix", A, scale=col_scale)
irlba(a, 3)$d
## [1] 1.820227 1.622988 1.067185
# ------------------ the old way ------------------------------------------
col_scale <- sqrt(apply(A, 2, crossprod))
mult <- function(x, y)
{
# check if x is a vector
if (is.vector(x))
{
return((x %*% y) / col_scale)
}
# else x is the matrix
x %*% (y / col_scale)
}
irlba(A, 3, mult=mult)$d
## [1] 1.820227 1.622988 1.067185
# ------------------ compare ----------------------------------------------
irlba(A, 3, scale=col_scale)$d
## [1] 1.820227 1.622988 1.067185
# Or,
svd(sweep(A, 2, col_scale, FUN=`/`))$d[1:3]
## [1] 1.820227 1.622988 1.067185
The package vignette (PDF): https://cran.r-project.org/web/packages/irlba/irlba.pdf.
This example uses a special one-sided basis option and a custom matrix product to compute the top three principal components of the entire 1000 Genomes Project variant data set (whole genome variants for 2,504 people): http://bwlewis.github.io/1000_genomes_examples/PCA_whole_genome.html. The example optionally works in parallel and finishes in only 8 minutes on a 4 node Linux cluster.
https://bwlewis.github.io/irlba/comparison.html compares the irlba package with the RSpectra package, a high-quality eigenvalue solver for R.
The newe tcor
algorithm
http://arxiv.org/abs/1512.07246
and R package
(https://github.com/bwlewis/tcor – not yet on CRAN) use irlba to compute
thresholded correlation matrices very quickly. This example computes the most
highly correlated gene expressions from the Cancer Genome Atlas RNASeq gene
expression data for breast cancer:
https://github.com/bwlewis/tcor/blob/master/vignettes/brca.Rmd.
Many more applications will appear here soon, check back!
The mult
argument is deprecated and will be removed in a future version. We
now recommend simply defining a custom class with a custom multiplcation
operator. The example below illustrates the old and new approaches.
library(irlba)
set.seed(1)
A <- matrix(rnorm(100), 10)
# ------------------ old way ----------------------------------------------
# A custom matrix multiplication function that scales the columns of A
# (cf the scale option). This function scales the columns of A to unit norm.
col_scale <- sqrt(apply(A, 2, crossprod))
mult <- function(x, y)
{
# check if x is a vector
if (is.vector(x))
{
return((x %*% y) / col_scale)
}
# else x is the matrix
x %*% (y / col_scale)
}
irlba(A, 3, mult=mult)$d
# Compare with:
irlba(A, 3, scale=col_scale)$d
# Compare with:
svd(sweep(A, 2, col_scale, FUN=`/`))$d[1:3]
# ------------------ new way ----------------------------------------------
setClass("scaled_matrix", contains="matrix", slots=c(scale="numeric"))
setMethod("%*%", signature(x="scaled_matrix", y="numeric"), function(x ,y) x@.Data %*% (y / x@scale))
setMethod("%*%", signature(x="numeric", y="scaled_matrix"), function(x ,y) (x %*% y@.Data) / y@scale)
a <- new("scaled_matrix", A, scale=col_scale)
irlba(a, 3)$d
We have learned that using R’s existing S4 system is simpler, easier, and more flexible than using custom arguments with idiosyncratic syntax and behavior. We’ve even used the new approach to implement distributed parallel matrix products for very large problems with amazingly little code.