This example groups stocks together in a network that highlights associations within and between the groups using only historical price data. The result is far from ground-breaking: you can already guess the output. For the most part, the stocks get grouped together into pretty obvious business sectors.

Despite the obvious result, the process of teasing out latent groupings from historic price data is interesting. That’s the focus of this example. A central idea of the approach taken here comes from the great paper of Ledoit and Wolf, “Honey, I Shrunk the Sample Covariance Matrix” (http://www.ledoit.net/honey.pdf). This example employs an alternative approach based on a matrix eigenvalue decomposition, but it’s the same general idea.

This note follows an informal, how-to format. Rather than focus on mathematical analysis, which is well-detailed in the references, I try to spell out the how’s and why’s: how to do things step by step (using R) and a somewhat non-rigorous rationale for each step that’s hopefully at least convincing and intuitive.

For emphasis, allow me to restate the first sentence as an objective:

That’s what the rest of this example will do, hopefully illuminating some key ideas about regularization along the way.

Software used in the example

The example uses R of course, and the following R packages, all available on CRAN (some of the packages themselves have dependencies):

Getting data

NOTE: You can skip ahead to the Sample correlation section by simply downloading a sample copy of processed log(return) data as follows:

library(quantmod)
load(url("http://illposed.net/logreturns.rdata"))

Otherwise, follow the next two sections to download the raw stock daily price data and process those data into log(returns).

Download daily closing price data from Google Finance

The quantmod package (Ulrich and Ryan, http://www.quantmod.com/) makes it ridiculously easy to download (and visualize) financial time series data. The following code uses quantmod to download daily stock price data for about 100 companies with the largest market capitalizations listed on the Standard & Poor’s 500 index at the time of this writing. The code downloads daily closing prices from 2012 until the present. Modify the code to experiment with different time periods or stocks as desired!

Because stock symbol names may change and companies my come and go, it’s possible that some of the data for some time periods are not available. The tryCatch() block in the code checks for a download error and flags problems by returning NA, later removed from the result. The upshot is that the output number of columns of stock price time series may be smaller than the input list of stock symbols.

The output of the following code is an xts time series matrix of stock prices called prices whose rows correspond to days and columns to stock symbols.

library(quantmod)
from="2012-05-17"
sym = c("AAPL", "ABBV", "ABT", "ACN", "AGN", "AIG", "ALL", "AMGN", "AMZN", "AXP",
        "BA", "BAC", "BIIB", "BK", "BLK", "BMY", "BRK.B", "C", "CAT", "CELG", "CL",
        "CMCSA", "COF", "COP", "COST", "CSCO", "CVS", "CVX", "DD", "DHR", "DIS", "DOW",
        "DUK", "EMR", "EXC", "F", "FB", "FDX", "FOX", "FOXA", "GD", "GE", "GILD", "GM",
        "GOOG", "GOOGL", "GS", "HAL", "HD", "HON", "IBM", "INTC", "JNJ", "JPM", "KHC",
        "KMI", "KO", "LLY", "LMT", "LOW", "MA", "MCD", "MDLZ", "MDT", "MET", "MMM",
        "MO", "MON", "MRK", "MS", "MSFT", "NEE", "NKE", "ORCL", "OXY", "PCLN", "PEP",
        "PFE", "PG", "PM", "PYPL", "QCOM", "RTN", "SBUX", "SLB", "SO", "SPG", "T",
        "TGT", "TWX", "TXN", "UNH", "UNP", "UPS", "USB", "UTX", "V", "VZ", "WBA",
        "WFC", "WMT", "XOM")

prices = Map(function(n)
             {
               print(n)
               tryCatch(getSymbols(n, env=NULL, from=from)[, 4], error = function(e) NA)
             }, sym)
N = length(prices)
# identify symbols returning valid data
i = ! unlist(Map(function(i) is.na(prices[i]), seq(N)))
# combine returned prices list into a matrix, one column for each symbol with valid data
prices = Reduce(cbind, prices[i])
colnames(prices) = sym[i]

Clean up and transform data

Not every stock symbol may have prices available for every day. Trading can be suspended for some reason, companies get acquired or go private, new companies form, etc.

Let’s fill in missing values going forward in time using the last reported price (piecewise constant interpolation)–a reasonable approach for stock price time series. After that, if there are still missing values, just remove those symbols that contain them, possibly further reducing the universe of stock symbols we’re working with.

for(j in 1:ncol(prices)) prices[, j] = na.locf(prices[, j])       # fill in
prices = prices[, apply(prices, 2, function(x) ! any(is.na(x)))]  # omit stocks with missing data

Now that we have a universe of stocks with valid price data, convert those prices to log(returns) for the remaining analysis (by returns I mean simply the ratio of prices relative to the first price).

Why log(returns) instead of prices?

The log(returns) are closer to normally distributed than prices especially in the long run. Pat Burns wrote a note about this (with a Tom Waits soundrack): http://www.portfolioprobe.com/2012/01/23/the-distribution-of-financial-returns-made-simple/.

But why care about getting data closer to normally distributed?

That turns out to be important to us because later we’ll use a technique called partial correlation. That technique generally works better for normally distributed data than otherwise, see for example a nice technical discussion about this by Baba, Shibata, and Sibuya here: https://doi.org/10.1111%2Fj.1467-842X.2004.00360.x

The following simple code converts our prices matrix into a matrix of log(returns):

log_returns = apply(prices, 2, function(x) diff(log(x)))

Sample correlation matrix

It’s easy to convert the downloaded log(returns) data into a Pearson’s sample correlation matrix X:

X = cor(log_returns)

The (i, j)th entry of the sample correlation matrix X above is a measurement of the degree of linear dependence between the log(return) series for the stocks in columns i and j.

There exist at least two issues that can lead to serious problems with the interpretation of the sample correlation values:

  1. As Ledoit and Wolf point out, it’s well-known that empirical correlation estimates may contain lots of error.
  2. Correlation estimates between two stock log(return) series can be misleading for many reasons, including spurious correlation or existence of confounding variables related to both series (http://www.tylervigen.com/spurious-correlations).

A Nobel-prize winning approach to dealing with the second problem considers cointegration between series instead of correlation, see for example notes by Eric Zivot (https://faculty.washington.edu/ezivot/econ584/notes/cointegrationslides.pdf), or Bernhard Pfaff’s lovely book “Analysis of Integrated and Cointegrated Time Series with R” (http://www.springer.com/us/book/9780387759661), or Wikipedia (https://en.wikipedia.org/wiki/Cointegration). (I also have some weird technical notes on the numerics of cointegration at http://illposed.net/cointegration.html.)

Cointegration is a wonderful but fairly technical topic. Instead, let’s try a simpler approach.

We can try to address issue 2 above by controlling for confounding variables, at least partially. One approach considers partial correlation instead of correlation (see for example the nice description in Wikipedia https://en.wikipedia.org/wiki/Partial_correlation). That approach works best in practice with approximately normal data–one reason for the switch to log(returns) instead of prices.

It’s worth stating that our simple approach basically treats the log(returns) series as a bunch of vectors and not so much bona fide time series, and can’t handle as many pathologies that might occur as well as cointegration can. But as we will see, this simple technique is still pretty effective at finding structure in our data. (And indeed related methods as discussed by Ledoit and Wolf and elsewhere are widely used in portfolio and risk analyses in practice.)

The partial correlation coefficients between all stock log(returns) series are the entries of the inverse of the sample correlation matrix (also called the precision matrix, https://www.statlect.com/glossary/precision-matrix). We will treat the entries of the precision matrix as measures of association in a network of stocks below.

Market trading of our universe of companies, with myriad known and unknown associations between them and the larger economy, produced the stock prices we downloaded. Our objective is a kind of inverse problem: given a bunch of historical stock prices, produce a network of associations.

You may recall from some long ago class that, numerically speaking, inverting matrices is generally a bad idea. Even worse, issue 1 above says that our estimated correlation coefficients contain error (noise). Even a tiny amount noise can be hugely amplified if we invert the matrix. That’s because, as we will soon see, the sample correlation matrix contains tiny eigenvalues and matrix inversion effectively divides the noise by those tiny values. Simply stated, dividing by a tiny number returns a big number–that is, matrix inversion tends to blow the noise up. This is a fundamental issue (in a sense, the fundamental issue) common to many inverse problems.

Ledoit and Wolf’s sensible answer to reducing the influence of noise is regularization. Regularization replaces models with different, but related, models designed to reduce the influence of noise on their output. LW use a form of regularization related to ridge regression (a. k. a. Tikhonov regularization) with a peculiar regularization operator based on a highly structured estimate of the covariance. We will use a simpler kind of regularization based on an eigenvalue decomposition of the sample correlation matrix X.

Regularization

Here is an eigenvalue decomposition of the sample correlation matrix:

L = eigen(X, symmetric=TRUE)

Note that R’s eigen() function takes care to return the (real-valued) eigenvalues of a symmetric matrix in decreasing order for us. (Technically, the correlation matrix is symmetric positive semi-definite, and will have only nonnegative real eigenvalues.)

Each eigenvector represents an orthogonal projection of the sample correlation matrix into a line (a 1-d shadow of the data); The first two eigenvectors define a projection of the sample correlation matrix into a plane (2-d), and so on. The eigenvalues estimate the proportion of information (or variability if you prefer) from the original sample correlation matrix contained in each eigenvector. Because the eigenvectors are orthogonal, these measurements of projected information are additive.

Here is a plot of all the sample correlation matrix eigenvalues (along with a vertical line that will be explained in a moment):

plot(L$values, ylab="eigenvalues")
abline(v=10)

The eigenvalues fall off rather quickly in our example! That means that a lot of the information in the sample correlation matrix is contained in the first few eigenvectors.

Let’s assume, perhaps unreasonably, that the errors in our estimate of the correlation matrix are equally likely to occur in any direction (that the errors are white noise, basically). As we can see above, most of the information is concentrated in the subspace corresponding to the first few eigenvectors. But white noise will have information content in all the dimensions more or less equally.

One regularization technique replaces the sample correlation matrix with an approximation defined by only its first few eigenvectors. Because they represent a large amount of the information content, the approximation can be pretty good. More importantly, because we assumed noise to be more or less equally represented across the eigenvector directions and we’re cutting most of those off, this approximation tends to damp the noise more than the underlying information. Most importantly, we’re cutting off the subspace associated with tiny eigenvalues, avoiding the problem of division by tiny values and significantly reducing amplified noise in the inverse of the sample correlation matrix (the precision matrix).

The upshot is, we regularize the sample correlation matrix by approximating it by a low-rank matrix that substantially reduces the influence of noise on the precision matrix. See Per Christian Hansen’s classic paperback “Rank-Deficient and Discrete Ill-Posed Problems” (http://epubs.siam.org/doi/book/10.1137/1.9780898719697) for insight into related topics.

But how to choose a cut-off rank?

There is a substantial mathematical literature for just this topic (regularization parameter choice selection), complete with deep theory as well as lots of heuristics. Let’s keep things simple for this example and form our approximation by cutting off eigenvectors beyond where the eigenvalue plot starts to flatten out – close to the vertical line in the above plot.

Alternatively consider the lovely short 2004 paper by Chris Ding and Xiaofeng He (http://dl.acm.org/citation.cfm?id=1015408) that illuminates connections (that I happen to find fascinating) between k-means clustering and projections like truncated eigenvalue expansions. Although we aren’t interested in k-means clustering per se, our objective is connected to clustering. Ding and He show that we can find at least k (k-means) clusters using the first k - 1 eigenvectors above. This gives us another heuristic way to choose a projection dimension, at least if we have an idea about the number of clusters to look for.

A precision matrix, finally

Finally, we form the precision matrix P from the regularized sample correlation matrix. The inversion is less numerically-problematic now because of regularization. Feel free to experiment with the projected rank N below!

N = 10  # (use 1st 10 eigenvectors, set N larger to reduce regularization)
P = L$vectors[, 1:N] %*% ((1 / L$values[1:N]) * t(L$vectors[, 1:N]))
P = P / tcrossprod(sqrt(diag(P)))

Other approaches

I’m not qualified to write about them, but you should be aware that Bayesian approaches to solving problems like this are also effectively (and effective!) regularization methods. I hope to someday better understand the connections between classical inverse problem solution methods that I know a little bit about, and Bayesian methods that I know substantially less about.

Put a package on it

There is a carefully written R package to construct regularized correlation and precision matrices: the corpcor package (https://cran.r-project.org/package=corpcor, and also see http://strimmerlab.org/software/corpcor/) by Juliane Schafer, Rainer Opgen-Rhein, Verena Zuber, Miika Ahdesmaki, A. Pedro Duarte Silva, and Korbinian Strimmer. Their package includes the original Ledoit Wolf-like regularization method, as well as refinements to it and many other methods. The corpcor package, like Ledoit Wolf, includes ways to use sophisticated regularization operators and can apply more broadly than the simple approach taken in this note.

You can use the corpcor package to form a Ledoit-Wolf-like regularized precision matrix P, and you should try it! The result is pretty similar to what we get from our simple truncated eigenvalue decomposition regularization in this example.

Networks and clustering

The (i, j)th entry of the precision matrix P is a measure of association between the log(return) time series for the stocks in columns i and j, with larger values corresponding to more association.

An interesting way to group related stocks together is to think of the precision matrix as an adjacency matrix defining a weighted, undirected network of stock associations. Thresholding entries of the precision matrix to include, say, only the top ten per cent results in a network of only the most strongly associated stocks.

Thinking in terms of networks opens up a huge and useful toolbox: graph theory. We gain access to all kinds of nifty ways to analyze and visualize data, including methods for clustering and community detection.

R’s comprehensive igraph package by Gábor Csárdi (https://cran.r-project.org/package=igraph) includes many network cluster detection algorithms. The example below uses Blondel and co-authors’ fast community detection algorithm implemented by igraph’s cluster_louvain() function to segment the thresholded precision matrix of stocks into groups. The code produces an igraph graph object g, with vertices colored by group membership.

suppressMessages(library(igraph))

threshold = 0.90
Q = P * (P > quantile(P, probs=threshold))                           # thresholded precision matrix
g = graph.adjacency(Q, mode="undirected", weighted=TRUE, diag=FALSE) # ...expressed as a graph

# The rest of the code lumps any singletons lacking edges into a single 'unassociated' group shown in gray
# (also assigning distinct colors to the other groups).
x = groups(cluster_louvain(g))
i = unlist(lapply(x, length))
d = order(i, decreasing=TRUE)
x = x[d]
i = i[d]
j = i > 1
s = sum(j)
names(x)[j] = seq(1, s)
names(x)[! j] = s + 1
grp = as.integer(rep(names(x), i))
clrs = c(rainbow(s), "gray")[grp[order(unlist(x))]]
g = set_vertex_attr(g, "color", value=clrs)

Use the latest threejs package to make a nice interactive visualization of the network (you can use your mouse/trackpad to rotate, zoom and pan the visualization).

library(threejs)
graphjs(g, vertex.size=0.2, vertex.shape=colnames(X), edge.alpha=0.5)

The stock groups identified by this method are uncanny, but hardly all that surprising really. Look closely and you will see clusters made up of bank-like companies (AIG, BAC, BK, C, COF, GS, JPM, MET, MS, USB, WFC), pharmaceutical companies (ABT, AMGN, BIIB, BMY, CELG, GILD, JNJ, LLY, MRK, PFE), computer/technology-driven companies (AAPL, ACN, CSCO, IBM, INTC, MSFT, ORCL, QCOM, T, TXN, VZ, and so on.

The groups more or less correspond to what we already know!

The FB, GOOG, AMZN, PCLN (Facebook, Alphabet/Google, Amazon, Priceline) group is interesting–it includes credit card companies V (Visa), MA (Mastercard). Perhaps the returns of FB, GOOG and AMZN are more closely connected to consumer spending than technology!

This way of looking at things also nicely highlights connections between groups. For instance, the pharma group is connected to consumer products group through JNJ and PG (Johnson and Johnson and Proctor and Gamble). See the appendix below for a visualization that explores different precision matrix threshold values, including lower values with far greater network connectivity.

Review

We downloaded daily closing stock prices for 100 stocks from the S&P 500, and, using basic tools of statistics and analysis like correlation and regularization, we grouped the stocks together in a network that highlights associations within and between the groups. The structure teased out of the stock price data is reasonably intuitive.


Appendix: threejs tricks

The following self-contained example shows how the network changes with threshold value. It performs the same steps as we did above, but uses some tricks in threejs and an experimental extension to the crosstalk package and a few additional R packages to present an interactive animation. Enjoy!

suppressMessages({
library(quantmod)
library(igraph)
library(threejs)
library(crosstalk)
library(htmltools)
# using an experimental extension to crosstalk:
library(crosstool) # devtools::install_github('bwlewis/crosstool')
})

# Download the processed log(returns) data:
suppressMessages(load(url("http://illposed.net/logreturns.rdata")))

X = cor(log_returns)
L = eigen(X, symmetric=TRUE)
N = 10  # (use 1st 10 eigenvectors, set N larger to reduce regularization)
P = L$vectors[, 1:N] %*% ((1 / L$values[1:N]) * t(L$vectors[, 1:N]))
P = P / tcrossprod(sqrt(diag(P)))
colnames(P) = colnames(X)

# A function that creates a network for a given threshold and precision matrix
f = function(threshold, P)
{
  Q = P * (P > quantile(P, probs=threshold))                           # thresholded precision matrix
  g = graph.adjacency(Q, mode="undirected", weighted=TRUE, diag=FALSE) # ...expressed as a graph

  x = groups(cluster_louvain(g))
  i = unlist(lapply(x, length))
  d = order(i, decreasing=TRUE)
  x = x[d]
  i = i[d]
  j = i > 1
  s = sum(j)
  names(x)[j] = seq(1, s)
  names(x)[! j] = s + 1
  grp = as.integer(rep(names(x), i))
  clrs = c(rainbow(s), "gray")[grp[order(unlist(x))]]
  g = set_vertex_attr(g, "color", value=clrs)
  set_vertex_attr(g, "shape", value=colnames(P))
}

threshold = c(0.97, 0.95, 0.90, 0.85, 0.8)
g = Map(f, threshold, MoreArgs=list(P=P)) # list of graphs, one for each threshold

# Compute force-directed network layouts for each threshold value
# A bit expensive to compute, so run in parallel!
library(parallel)
l = mcMap(function(x) layout_with_fr(x, dim=3, niter=150), g, mc.cores=detectCores())

sdf = SharedData$new(data.frame(key=paste(seq(0, length(threshold) - 1))), key=~key)
slider = crosstool(sdf, "transmitter",
                sprintf("<input type='range' min='0' max='%d' value='0'/>",
                length(threshold) - 1), width="100%", height=20, channel="filter")
vis = graphjs(g, l, vertex.size=0.2, main=as.list(threshold), defer=TRUE, edge.alpha=0.5, deferfps=30,
        crosstalk=sdf, width="100%", height=900)

browsable(div(list(HTML("<center>"), tags$h3("Precision matrix quantile threshold (adjust slider to change)"), slider, vis)))

Precision matrix quantile threshold (adjust slider to change)