Picking a few of the most linearly independent columns of a matrix is a hard problem in general. SVD subset selection is a simple heuristic method that picks a subset of k columns from a matrix that estimates the best-conditioned subset of columns of size k. The method works best when the matrix is rank-deficient and there is a clear indication of numerical rank (a gap in the singular values)—see the references [3,4] for more details. We find the method to work reasonably well with many ill-conditioned matrices. It does not do a great job of estimating the most linearly independent columns when the matrix is well-conditioned, but in that case the subset selection question is perhaps not the right one to ask about the data.
The method's R implementation shown below is short and sweet. It produces a set of column indices that estimate the k most linearly independent columns of the input matrix.
# svdsubsel # # Input m*p matrix A, m >= p. # Number of output columns k<=p. # Returns an index subset of columns of A that *estimates* the k most linearly # independent columns of A. svdsubsel <- function(A,k=ncol(A)) { S <- svd(scale(A,center=FALSE,scale=TRUE), k) n <- which(svd(A)$d < 2*.Machine$double.eps)[1] if(!is.na(n) && k>=n) { k <- n - 1 warning("k was reduced to match the rank of A") } Q <- qr( t(S$v[,1:k]) ,LAPACK=TRUE) sort(Q$pivot[1:k],decreasing=FALSE) }SVD subset selection works on a matrix in isolation, independently of other aspects of a larger problem that may only employ subset selection in one step, for example in generalized linear models (GLMs). Superior subset selection can be achieved by taking account of additional available information (information provided by the response vector, for example). In particular for the GLM case we recommend the glmnet[2] implementation of the elastic net method.
On the other hand, SVD subset selection's sole reliance on a data matrix can also be an advantage, for example in unsupervised machine learning applications.
We construct in the sample R program below an ill-conditioned 10 x 10 matrix that is numerically singular. We then use SVD subset selection to select three columns that we hope are as linearly independent as possible. We then compare that selection with all 120 possible three-column subsets that can be formed from the matrix. Our basis for comparison is the condition number of the three column submatrix computed by R's kappa function.
You can change the seed value to see the result for different random-valued matrices. Lower the exponent value to improve the matrix condition number (and reduce the effectiveness of the method).
# Construct a badly conditioned 10x10 matrix set.seed(5) # Change the seed for different examples... exponent <- 2 # Lower the exponent to improve the matrix condition number... U <- qr.Q(qr(matrix(rnorm(100),10))) V <- qr.Q(qr(matrix(rnorm(100),10))) d <- exp(-(1:10)^exponent) # In particular, note that the method does not work as well for well-conditioned # matrices (by making the exponent < 1 for example). A <- U %*% (t(V) * d) # Use SVD subset selection to pick three columns idx <- svdsubsel(A,3) # *All* possible 3-column subsets of A: all <- combn(10,3) # Which combinations were chosen by SVD subset selection? idx <- which(apply(all,2, function(x) all(x==idx))) # Finally, let's plot the logarithm of the condition numbers of all possible # 3-column subsets of A and highlight the ones chosen by SVD subset selection: condition_numbers <- apply(all, 2, function(i) log(kappa(A[,i]))) p <- order(condition_numbers) # All the condition numbers: plot(condition_numbers[p], xaxt='n', xlab='Subset (sorted by condition number of submatrix)', ylab='log(condition number)', main='Logarithm of condition numbers of all 3-column subsets of A') # Higlight the SVD subset selection's condition number: set <- which(p==idx) points(set,condition_numbers[p[set]], col=4,pch=20) l <- sprintf("SVD selection of columns %s.",paste(all[,idx],collapse=",")) legend("topleft",legend=l, fill=4, col=4, bty='n')The example plots the condition numbers of all possible three-column submatrices from lowest to highest, and highlights and lists the selection made by the SVD subset selection method. The output typically looks like:
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the Github project files.