This document presents a fun example of graph-based clustering using stock market time series data. The example consists of three steps:

  1. Download daily adjusted price data for about 50 stocks and convert prices to log(returns).
  2. Compute a shrunk precision matrix using the celebrated Ledoit-Wolf method and threshold its values to retain only highly conditionally-correlated returns.
  3. Plot the result as an adjacency graph.

We don’t form explicit clusters in the graph but simply visualize its output. An example next step might be to run community or clique detection algorithms on this, for example from the igraph package.

See below for more detail. This example was largely inspired by this excellent Python Scikit-learn example:

http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html

1. Download

We use the quantmod package to obtain daily adjusted stock price data for 58 stocks from 2007 to the present. If you don’t have the quantmod package installed, you’ll need to install it to perform this step.

In case you’re wondering, Walgreen’s recently replaced their 87 year old ticker symbol “WAG” with “WBA.”

In addition to the stock ticker symbols, we assign each stock to its sector in the sector_assignment vector below. That’s used in a visualization later.

symbols <- c("AAPL", "AMZN", "AXP", "BA", "BAC", "CAJ", "CAT", "CL", "CMCSA",
              "COP", "CSCO", "CVC", "CVS", "CVX", "DD", "F", "GD", "GE", "GS",
              "GSK", "HD", "HMC", "HPQ", "IBM", "JPM", "K", "KMB", "KO", "LMT",
              "MAR", "MCD", "MDLZ", "MMM", "MSFT", "MTU", "NAV", "NOC", "NVS",
              "PEP", "PFE", "PG", "R", "RTN", "SAP", "SNE", "SNY", "TM", "TOT",
              "TWX", "TXN", "UN", "VLO", "WBA", "WFC", "WMT", "XOM", "XRX", "YHOO")

sectors           <- c("consumer", "energy", "finance", "industrial", "pharma", "tech")
sector_assignment <- factor(sectors[c(6,1,3,3,3,6,4,1,1,2,6,1,1,2,4,1,4,4,3,5,1,1,6,6,3,1,1,1,4,1,                                                                1,1,4,6,3,4,4,5,1,5,1,4,4,6,6,5,1,2,1,6,1,2,1,3,1,2,6,6)])

Download data for these stocks using the quantmod package and convert the daily adjusted prices to log returns.

library(quantmod)
p       <- lapply(symbols, function(n) {print(n);getSymbols(n, auto.assign=FALSE)[,4]})
x       <- Reduce(cbind, p)
returns <- apply(x,2,function(z) diff(log(z)))

2. Shrink

Compute a shrunk, thresholded sample correlation matrix and its inverse, referred to as a precision matrix. The precision matrix is an estimate of conditional correlation between stocks. We use the Ledoit-Wolf shrinkage method provided by R’s corpcor package. If you don’t have the corpcor package, you’ll need to install that from CRAN.

See below for very brief comments on shrinking correlation matrices.

Two user-supplied parameters are used in this step, a shrinkage parameter and a threshold. You should be suspicious of this! How are these values chosen? See below for more comments on this and an example shiny app that visualizes the effect of the parameters on the graph.

library("corpcor")
Sr <- cor.shrink(returns,lambda=0.5)             # shrink
## Specified shrinkage intensity lambda (correlation matrix): 0.5
Pr <- solve(Sr,diag(rep(1,nrow(Sr))))            # invert
Qr <- Pr*(abs(Pr)>quantile(abs(Pr),probs=0.9))   # threshold
colnames(Qr) <- rownames(Qr) <- symbols

3. Visualize

Finally we plot the thresholded precision matrix as an adjacency graph, color coding the vertices by (known) stock sector. The plot shows that correlation of return series for these financial instruments does a remarkably good job of clustering them!

We use a fork of Christopher Gandrud’s nifty networkD3 package for R to draw the plot. You can install that package directly from GitHub using the devtools package:

devtools::install_github("bwlewis/networkD3")

Or, you can use the networkD3 package currently on CRAN, but you won’t get the vertices colored by their stock sector.

library(networkD3)
edges <- which(Qr!=0, arr.ind=TRUE) # Adjaceny graph edge list
links <- data.frame(source=symbols[edges[,2]], target=symbols[edges[,1]])

# Let's color the vertices by stock sector.
names(sector_assignment) <- symbols
N <- length(levels(sector_assignment))
sector_palette <- substr(rainbow(N), 1, 7)
vertex_colors <- sector_palette[as.integer(sector_assignment[unique(Reduce(c,t(links)))])]

simpleNetwork(links, fontSize=16, textColour="#000011",
              linkColour="#bbbbbb", nodeColour=vertex_colors,
              charge=-250, nodeClickColour=NULL)

This does a pretty good job of finding connections between related stocks! Note in particular the MTU, TM, CAJ, HMC, SNE cluster; although those stocks span many sectors, they are all Japanese corporations.

A brief discussion of covariance matrix shrinkage

A sample covariance matrix (computed from observed data) may not always yield a good estimate of an underlying population covariance structure for a variety of reasons including noisy measurements, missing data, incomplete data, and others. The sample covariance matrices may also exhibit poor numerical conditioning, either on account of noisy or imperfect data or otherwise. These errors are magnified when computing the precision matrix (the inverse of the covariance matrix).

Regularization, a.k.a shrinkage, is a broad topic that has been applied in this example to help address these issues. The basic idea replaces the sample covariance matrix with a better numerically conditioned one. This has the effect of “shrinking” the covariance matrix towards a multiple of the identity matrix. The gist of the shrinkage approach is described next.

Recall that if the columns of a data matrix \(X\) all have mean zero, then the sample covariance matrix \(S\) is a multiple of \(X^T X\). Note that in particular \(S\) is a symmetric matrix. We also can deduce that \(S\) doesn’t have any negative eigenvalues. That means that simply adding a positive multiple of the identity matrix, \(S + \lambda I\), has the effect of shifting the eigenvalues of \(S\) by \(\lambda\)–away from zero, which makes \(S\) a better numerically conditioned matrix. This shift has the effect of reducing the influence of noise on the computed precision matrix values (related to the inverse of the sample covariance matrix).

The Ledoit-Wolf process defines a more sophisticated regularization method with added finesse that shrinks covariance matrices. See http://www.ledoit.net/honey.pdf for an enjoyable paper.

A brief note on the user-supplied parameters

I used two user-supplied parameters above, one for the shrinkage amount and another for thresholding. The visualization and graph clusters that fall out of this are sensitive to both of these parameters. Too little shrinkage or too little thresholding and everything globs together. Too much and we separate all the stocks into isolated islands.

I’m always suspicious of parameters. In this case, I did not select them using an analytic process, but rather I chose values that made the plot look good!

Note that if I more carefully quantify the phrase “look good” then I could cook up an analytic process for parameter selection. For example, “look good” might mean to break the graph up into a small number of connected components with large membership.

Seeing how the parameters affect the output is interesting, and in this case easy because there are only two. The following code block presents a shiny app that you can run to see how the parameters change the graph. Have fun!

library("shiny")
library("quantmod")
library("networkD3")
library("corpcor")

# Pull down the stock return data from my web site. Replace this
# with the quantmod-based data collection procedure if you like.
con = url("http://illposed.net/returns.rdata")
load(con)
close(con)


runApp(list(

  ui = pageWithSidebar( # See ?pageWithSidebar for help
    headerPanel("Stock return series clustering"),
    sidebarPanel(
      sliderInput("lambda", div(HTML("&lambda;")), min=0.0, max=1, value=0.2, step=0.01),
      sliderInput("threshold", "threshold", min=0.0, max=1, value=0.9, step=0.01)
    ),
    mainPanel(
      simpleNetworkOutput("network")
    )
  ),

  server = function(input, output, session)
  {

    output$network <- renderSimpleNetwork({
      Sr <- cor.shrink(returns,lambda=input$lambda)
      Pr <- solve(Sr,diag(rep(1,nrow(Sr))))
      Qr <- Pr*(abs(Pr)>quantile(abs(Pr),probs=input$threshold))
      colnames(Qr) <- rownames(Qr) <- symbols
      edges <- which(Qr!=0, arr.ind=TRUE) # Adjaceny graph edge list
      links <- data.frame(source=symbols[edges[,2]], target=symbols[edges[,1]])
      names(sector_assignment) <- symbols
      N <- length(levels(sector_assignment))
      sector_palette <- substr(rainbow(N), 1, 7)
      vertex_colors <- sector_palette[as.integer(sector_assignment[unique(Reduce(c,t(links)))])]
      simpleNetwork(links, fontSize=16, textColour="#000011",
              linkColour="#bbbbbb", nodeColour=vertex_colors,
              charge=-250, nodeClickColour=NULL)
  })

  }
))