Project and Conquer!



Bryan W Lewis
rstor.io
http://illposed.net




Point of view matters




Data Shadows

The purple and pink dots are at least this far apart.

The purple and pink dots are at least this far apart.




The tcor algorithm: fast thresholded distances and correlations.


https://github.com/bwlewis/tcor


http://arxiv.org/abs/1512.07246

Example



1,000 × 20,522 matrix, find all pairs of columns with Pearson’s correlation coefficient ≥ 0.99.

  • brute force needs > 200 GFlops
  • tcor solves exactly the same problem using about 1 GFlop1 (in seconds on a laptop)
  • easily extends to distributed and arbitrarily large data


1For the data matrix described in the package vingette

Recall, point of view matters



What subspaces to project into?

What kinds of projections?

I prefer Krylov subspaces



\[ \mathrm{span}(\beta, X\beta, X^2\beta, \ldots, X^{k-1}\beta), \]


\[ \mathrm{span}(\beta, X^TX\beta, (X^TX)^2\beta, \ldots, (X^TX)^{k-1}\beta), \qquad \ldots \]



Random β for data exploration,
measured β for supervised models

What other applications?

…focus on the information content (PCA, SVD, etc.)…

    


(https://cran.r-project.org/package=irlba)

…or perhaps cluster similar features…


(also https://cran.r-project.org/package=irlba)

…or maybe bi-cluster similar rows and columns…


https://CRAN.R-project.org/package=s4vd (Kaiser & Sill)

Sparse biclust algorithm by Lee, Shen, Huang, and Marron



1,000 \(\times\) 10,000 example matrix timed on a cheap laptop

library(s4vd)
set.seed(1)
x = matrix(rnorm(1000 * 10000), nrow=1000)
i = seq(from=1, to=nrow(x), by=2)
j = seq(from=1, to=ncol(x), by=2)
x[i,j] = rnorm(length(i)*length(j), mean=2, sd=0.5)
cl = biclust(x, method=BCssvd, K=1)
- Original routines      3,522s

- Reformulated routines     20s

…maybe the data represent a network…

…we might desire the most central vertices,

or perhaps the most communicable paths, etc….

Kleinberg hub-authority centrality

https://github.com/bwlewis/rfinance-2017/topm.r

Compare to Padé approximant for 1000 × 1000 directed-network adjacency matrix

system.time(ex <- diag(expm(X) + expm(-X)) / 2)
# user    system elapsed
# 151.080 0.220  151.552

head(order(ex, decreasing=TRUE), 5)
#[1] 11 25 27 29 74
system.time(top <- topm(X, type="centrality"))
# user  system elapsed
# 0.555 0.010  0.565

top$hubs
#[1] 11 25 27 29 74

Matrix completion



Cool algorithm by Cai, Candès, Shen, prototype: https://github.com/bwlewis/rfinance-2017/blob/master/svt.r



Input matrix M with missing values every except index set P.


\[ \min\,\,\mathrm{rank}(X) \,\,\mathrm{s. t.}\,\, X_P = M_P \]



Input matrix M with missing values every except index set P.


\[ \min\,\,\mathrm{rank}(X) \,\,\mathrm{s. t.}\,\, X_P = M_P \]


\[ \min\,\,||X||_* \,\,\mathrm{s. t.}\,\, X_P = M_P \]


…or, maybe we want to select interesting features (columns)…



Use generalized Krylov subspaces (Voss, Lanza, Morigi, Reichel, Sgallari, …) to solve



\[ \min \frac{1}{p}||X\beta - y||^p_p + \frac{\mu}{q}||\Phi(\beta)||^q_q \]


q = 1 (Lasso) and 0 < q < 1 (closer to subset selection!)

Projection methods turn big data into data



https://bwlewis.github.io/rfinance-2017


https://github.com/bwlewis/rfinance-2017



Thanks, Diethelm!