Example

1,000 × 20,522 matrix, find all pairs of columns with Pearson’s correlation coefficient ≥ 0.99.

brute force needs > 200 GFlops
tcor solves exactly the same problem using about 1 GFlop¹ (in seconds on a laptop)
easily extends to distributed and arbitrarily large data

¹For the data matrix described in the package vingette

Recall, point of view matters

What subspaces to project into?

What kinds of projections?

I prefer Krylov subspaces

\[ \mathrm{span}(\beta, X\beta, X^2\beta, \ldots, X^{k-1}\beta), \]

\[ \mathrm{span}(\beta, X^TX\beta, (X^TX)^2\beta, \ldots, (X^TX)^{k-1}\beta), \qquad \ldots \]

Random β for data exploration,
measured β for supervised models

What other applications?

…focus on the information content (PCA, SVD, etc.)…

(https://cran.r-project.org/package=irlba)

…or perhaps cluster similar features…

(also https://cran.r-project.org/package=irlba)

…or maybe bi-cluster similar rows and columns…

https://CRAN.R-project.org/package=s4vd (Kaiser & Sill)

Sparse biclust algorithm by Lee, Shen, Huang, and Marron

1,000 \(\times\) 10,000 example matrix timed on a cheap laptop

library(s4vd)
set.seed(1)
x = matrix(rnorm(1000 * 10000), nrow=1000)
i = seq(from=1, to=nrow(x), by=2)
j = seq(from=1, to=ncol(x), by=2)
x[i,j] = rnorm(length(i)*length(j), mean=2, sd=0.5)
cl = biclust(x, method=BCssvd, K=1)

- Original routines      3,522s

- Reformulated routines     20s

…maybe the data represent a network…

…we might desire the most central vertices,

or perhaps the most communicable paths, etc….

Kleinberg hub-authority centrality

https://github.com/bwlewis/rfinance-2017/topm.r

Compare to Padé approximant for 1000 × 1000 directed-network adjacency matrix

system.time(ex <- diag(expm(X) + expm(-X)) / 2)
# user    system elapsed
# 151.080 0.220  151.552

head(order(ex, decreasing=TRUE), 5)
#[1] 11 25 27 29 74

system.time(top <- topm(X, type="centrality"))
# user  system elapsed
# 0.555 0.010  0.565

top$hubs
#[1] 11 25 27 29 74

Matrix completion

Cool algorithm by Cai, Candès, Shen, prototype: https://github.com/bwlewis/rfinance-2017/blob/master/svt.r

Input matrix M with missing values every except index set P.

\[ \min\,\,\mathrm{rank}(X) \,\,\mathrm{s. t.}\,\, X_P = M_P \]

\[ \min\,\,||X||_* \,\,\mathrm{s. t.}\,\, X_P = M_P \]

…or, maybe we want to select interesting features (columns)…

Use generalized Krylov subspaces (Voss, Lanza, Morigi, Reichel, Sgallari, …) to solve

\[ \min \frac{1}{p}||X\beta - y||^p_p + \frac{\mu}{q}||\Phi(\beta)||^q_q \]

q = 1 (Lasso) and 0 < q < 1 (closer to subset selection!)

Projection methods turn big data into data

https://bwlewis.github.io/rfinance-2017

https://github.com/bwlewis/rfinance-2017

Thanks, Diethelm!