Project and Conquer!
Bryan W Lewis
rstor.io
http://illposed.net
Point of view matters
Data Shadows
The purple and pink dots are at least this far apart.
The purple and pink dots are at least this far apart.
The tcor algorithm: fast thresholded distances and correlations.
1,000 × 20,522 matrix, find all pairs of columns with Pearson’s correlation coefficient ≥ 0.99.
1For the data matrix described in the package vingette
What subspaces to project into?
What kinds of projections?
\[ \mathrm{span}(\beta, X\beta, X^2\beta, \ldots, X^{k-1}\beta), \]
\[ \mathrm{span}(\beta, X^TX\beta, (X^TX)^2\beta, \ldots, (X^TX)^{k-1}\beta), \qquad \ldots \]
Random β for data exploration,
measured β for supervised models
…focus on the information content (PCA, SVD, etc.)…
   Â
…or perhaps cluster similar features…
…or maybe bi-cluster similar rows and columns…
https://CRAN.R-project.org/package=s4vd (Kaiser & Sill)
Sparse biclust algorithm by Lee, Shen, Huang, and Marron
1,000 \(\times\) 10,000 example matrix timed on a cheap laptop
library(s4vd)
set.seed(1)
x = matrix(rnorm(1000 * 10000), nrow=1000)
i = seq(from=1, to=nrow(x), by=2)
j = seq(from=1, to=ncol(x), by=2)
x[i,j] = rnorm(length(i)*length(j), mean=2, sd=0.5)
cl = biclust(x, method=BCssvd, K=1)
- Original routines 3,522s
- Reformulated routines 20s
…maybe the data represent a network…
…we might desire the most central vertices,
or perhaps the most communicable paths, etc….
https://github.com/bwlewis/rfinance-2017/topm.r
Compare to Padé approximant for 1000 × 1000 directed-network adjacency matrix
system.time(ex <- diag(expm(X) + expm(-X)) / 2)
# user system elapsed
# 151.080 0.220 151.552
head(order(ex, decreasing=TRUE), 5)
#[1] 11 25 27 29 74
system.time(top <- topm(X, type="centrality"))
# user system elapsed
# 0.555 0.010 0.565
top$hubs
#[1] 11 25 27 29 74
Cool algorithm by Cai, Candès, Shen, prototype: https://github.com/bwlewis/rfinance-2017/blob/master/svt.r
Input matrix M with missing values every except index set P.
\[ \min\,\,\mathrm{rank}(X) \,\,\mathrm{s. t.}\,\, X_P = M_P \]
Input matrix M with missing values every except index set P.
\[ \min\,\,\mathrm{rank}(X) \,\,\mathrm{s. t.}\,\, X_P = M_P \]
\[ \min\,\,||X||_* \,\,\mathrm{s. t.}\,\, X_P = M_P \]
…or, maybe we want to select interesting features (columns)…
Use generalized Krylov subspaces (Voss, Lanza, Morigi, Reichel, Sgallari, …) to solve
\[ \min \frac{1}{p}||X\beta - y||^p_p + \frac{\mu}{q}||\Phi(\beta)||^q_q \]
q = 1 (Lasso) and 0 < q < 1 (closer to subset selection!)
https://bwlewis.github.io/rfinance-2017
https://github.com/bwlewis/rfinance-2017
Thanks, Diethelm!