1000 Genomes Whole Genome PCA Example



Upshot    It’s pretty easy to compute principal components of variants across thousands (or many more) of whole genomes.

This example illustrates principal components (PCA) decomposition of genomic variant data of 2,504 people from the 1000 genomes project1. The example projects genome-wide variant data into a three dimensional subspace. See https://github.com/bwlewis/1000_genomes_examples for the source files and additional notes associated with this and other examples using 1000 Genomes Project data.

Specifically, arrange the variant data into a sparse matrix \(A\) whose 2,504 rows represent people and columns represent genomic variants. A one in row i, column j of the matrix means that variant j occured in person i. The matrix is otherwise filled with zeros. Details on parsing the raw variant data appear below. Let \(\hat{A}\) represent the centered matrix after subtracting the column mean from each column. Then this example computes the singular value decomposition (SVD):

\[ \hat{A} V = U \Sigma, \]

where \(U\) is a 2,504 by 3 principal component matrix with orthonormal columns, \(\Sigma\) is a diagonal 3 by 3 matrix of singular values, and \(V\) is a matrix with three orthonormal columns and as many rows as \(A\) has columns. We’re not that interested in the \(V\) matrix in this example, and because it can be big (\(A\) has a lot of columns), we avoid computing it explicitly when possible.

Examples like this are often used to illustrate “big data” analysis in genomics, even though the data are not particularly big. The point of this example is not to say that PCA on genomic variants is profound, but rather that it’s relatively easy.

The example uses:

We compare two approaches below, a method based on the eigenvalue decomposition \(\hat{A} \hat{A}^T\), and a method using the IRLBA algorithm. Each method has benefits, drawbacks, and idiosyncracies described below.

Partitioning the work

I want scalable solution approaches; that is, solutions that work reasonably well on small computers and also on big ones or even clusters of computers. I primarily worked on solving this problem on my quad-core home PC, equipped with 16 GB RAM. The raw data size of the problem is approximately 123 GB, which means that I needed to break the problem up into pieces small enough to fit in RAM.

Solving the problem by breaking it up into manageble pieces has the advantage of promoting scalability. Those pieces can be run on other CPU cores or even networked computers relatively easily.

The cross product method

The variant data are represented as a very sparse matrix of 2,504 rows (people) by 81,271,844 columns (genomic variants), but with only about 9.8 billion nonzero-elements, that is only a little over 2% fill-in. In other words not every person exhibits all variants. It’s hard to make good use of available CPU floating point resources with sparse data because a lot of time is spent simply addressing and wrangling the data into CPU registers. Breaking the problem up explicitly into smaller pieces as described in the next section might help CPU utilization through explicit use of coarse-grained parallelism on the pieces.

Let’s formalize some notation for future reference.

  • Let \(A\in R^{m\times n}\) be the 2,504 by 81,271,844 variant data matrix.
  • Let \(z\) be the 81,271,844 element vector of column means of \(A\).
  • Let \(e\) represent a vector of ones of length determined by the context.
  • Let \(\hat{A} = A - ez^T\) be the centered matrix.

The fact that the data matrix is very “fat” with many many more columns than rows has some interesting consequences. For example, one naive approach to computing all the columns of the \(U\) matrix is to compute a symmetric eigenvalue decomposition of the relatively small 2,504 by 2,504 matrix \(\hat{A} \hat{A}^T = U \Sigma^2 U^T\) (the exponent indicates element-wise exponentiation). This approach has at least two potential issues, but they are relatively easy to deal with.

First, the matrix \(\hat{A}\) is dense and large (over 456 billion elements), which means we can’t explicitly form it. However, we can implicitly compute the matrix product \(\hat{A} \hat{A}^T\) without ever forming \(\hat{A}\) as follows:

  1. Let \(z=\)column means\((A)\) be a 81,271,844 element vector.
  2. Let \(e = (1,1, \ldots, 1)\) be a 2,504 element vector of all ones.
  3. Let \(B = (Az)e^T\) be their 2,504 by 2,504 product.
  4. Then \[ \hat{A} \hat{A}^T = A A^T - B - B^T + (z^T z) e e^T \]

We can then compute the symmetric eigenvalue decomposition of the small \(\hat{A}\hat{A}^T\) matrix after step 4 to obtain the desired \(U\) matrix. The computation in step 4 avoids explicitly forming a huge dense centered matrix. The biggest part of the work in step 4 is the computation of \(A A^T\), a problem that on the whole could require a large amount of memory to compute. But this matrix product is very easy to break up into smaller chunks. Because the data are so sparse, the product \(A A^T\) only requires about 2.3 trillion floating point operations (Tflop) of computation for this example. This approach only requires a single pass through the data, a potentially significant advantage for I/O bound, low memory systems.

A second issue is that the matrix cross product \(\hat{A} ^T \hat{A}\) is much worse conditioned than \(\hat{A}\), and we would expect its eigenvalue decomposition numerical accuracy to suffer as a result. However, since we are only interested in the three eigenvectors associated with the three largest eigenvalues of that matrix, the effect of poor conditioning will not be as pronounced as, say, for the eigenvectors corresponding to small eigenvalues. See these notes https://bwlewis.github.io/irlba/comparison.html for examples of numerical instability associated with this approach.

Note that if the matrix were to have many more rows so that \(A A^T\) becomes large then the cross product method might not work out so well! In those cases, the IRLBA method described below might be the only good solution approach.

The IRLBA method

The fat matrix leads to some interesting problems for IRLBA. Normally, a 3-d IRLBA PCA decomposition computes the decomposition \(\hat{A} V = U \Sigma\) for rank 3 matrices \(U\) and \(V\), while never explicitly forming the centered matrix \(\hat{A}\). The problem is that the output matrix \(V\) can be very large in our example and require a large amount of working memory, even though we are not interested in that output! For instance, on my home computer equipped with 16 GB RAM this straightforward approach runs out of memory, even when the problem is broken up into pieces, simply because of storage required for the output.

There is a little-used IRLBA option right_only=TRUE that can help in this case, but it requires some extra set up effort. When right_only=TRUE is specified IRLBA only returns \(V\) and \(\Sigma\), and uses substantially less working memory during the computation. Note however, that we want the \(U\) matrix, not the \(V\) matrix, so this option isn’t a perfect fit for this problem. A work around is to compute the SVD of \(\hat{A}^T\) instead since \[ \hat{A} V = U \Sigma \\ \hat{A}^T U = V \Sigma, \] and then the right_only=TRUE option gives us the quantities we’re interesed in.

Unfortunately, this approach introduces one additional complexity, we can’t simply use the centered=TRUE option to compute principal compontents of \(A\). Instead we need to supply a custom matrix vector product that implicitly uses the centered matrix \(\hat{A}\), similar to what we did with the cross product method above:

  • instead of \(\hat{A} x\), compute \(Ax - (z^T x)e\).

In practice this added complication is not too burdensome since we need to write a custom matrix vector product anyway to process the data in pieces.

Let p be the number of nonzero elements of \(A\), and n the number of columns of \(A\) (81,271,844). Assuming that the IRLBA method takes about 28 matrix vector products to converge (a value I found in practice for this problem), then the IRLBA method requires about 28 * (p + n) = 308 Gflop, or about one eigth of the total flop count required by the cross product method. However, 28 iterations of IRLBA require 28 sweeps over the data, and if the data can’t fit into main memory then this incurs additional I/O expense over the cross product method for this problem.

Test systems

The following experiments were run on my home PC consisting of

I also run a few examples on an Amazon r3.8xlarge instance, see https://aws.amazon.com/about-aws/whats-new/2014/04/10/r3-announcing-the-next-generation-of-amazon-ec2-memory-optimized-instances/

The r3.8xlarge machine ran Ubuntu 14.10 and R version 3.3.1 with OpenBLAS and OMP_NUM_THREADS=1.

Data prep

The section is common to both the IRLBA and cross product methods. We download the data, read it into R sparse matrices partitioned into submatrices with about 2e8 nonzero elements per submatrix, and serialize the sparse matrices to files. The particular data partition here is well-suited to systems with a RAM (in GB) to CPU core count ratio of 4.

The code below stores the sparse data matrix partitions in transposed form, for more direct use with the IRLBA code below. Column sums (row sums of the transposed data matrix) are stored along with the data because they are required by both methods.

The steps below use a command line shell or R and common Unix system utilities like zcat, sed and cut.

Download and compile the simple VCF parser

wget https://raw.githubusercontent.com/bwlewis/1000_genomes_examples/master/parse.c
cc -O2 parse.c

We could use R alone to read and parse the VCF file, it would just take a while longer.

Downloading and processing the data into R sparse matrices

The script below shows a large parallel download, suitable for systems with lots of network bandwidth (like Amazon). Remove the ampersand at the end of the wget line for a sequential download instead.

# Download the variant files
j=1
while test $j -lt 23; do
  wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr${j}.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz &
  j=$(( $j + 1 ))
done
wait

# Download 1000 genomes phenotype data file
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped

Process and partition the downloaded files into R sparse matrices

The remaining data prep steps run from R. The following R program splits process the variant files into R sparse matrix partitions with about chunksize=2e8 nonzero elements per partition, resulting in many output files that each represent a portion of the whole genome variant matrix. We store the output chunk matrices in transposed format for convenience in the IRLBA method. Use of the transpose form does not affect the cross product product method.

Note that we don’t compress the serialized R objects stored in files corresponding to the matrix partitions. If you have a computer with slow I/O then compression makes a lot sense (whenever the decompression time is more than made up for by the reduced I/O). Note that, in that case, you might want to store the data in non-transposed form which in this example will achieve much higher compression, and then transpose the data as needed in RAM after its loaded.

This takes almost 3 hours (9,970 seconds) to run on my home PC. The pipelined parallelism in the code uses about two CPU cores reasonably well. I engange the remaining two cores on my home PC using explicit parallelism and R’s mcMap() function. If you have more cores, increase the mc.cores value below. For instance on an Amazon r3.x8large machine with mc.cores=16 this takes about 2,000 seconds.

library(Matrix)
library(parallel)
t0 = proc.time()
chunksize = 100000000
meta = Reduce(rbind, mcMap(function(f)
{
  name = gsub("\\.gz", "", f); message(name)
  chunk = 1
  p = pipe(sprintf("zcat %s  | sed /^#/d | cut  -f '10-' | ./a.out | cut -f '1-2'", f), open="r")
  meta = data.frame()
  while(chunk > 0)
  {
    x = tryCatch(read.table(p, colClasses=c("integer", "integer"), fill=TRUE, row.names=NULL, nrows=chunksize),
                 error=function(e) data.frame())
    if(nrow(x) < 1) chunk = 0
    else
    {
      x = sparseMatrix(i=x[, 1] - x[1, 1] + 1, j=x[, 2], x=1.0)
      attr(x, "rowmeans") = rowMeans(x)
      cfn = sprintf("%s-%d.rdata", name, chunk)
      cf = file(cfn, open="wb")
      serialize(x, connection=cf, xdr=FALSE)
      close(cf)
      meta = rbind(meta, data.frame(file=cfn, nrow=nrow(x), ncol=ncol(x), stringsAsFactors=FALSE))
      chunk = chunk + 1
    }
    rm(x)
    gc()
  }
  close(p)
  meta
}, dir(pattern="ALL.*\\.gz"), mc.cores=2))
print(proc.time() - t0)
meta$end = cumsum(meta$nrow)
meta$start = c(1, meta$end[-length(meta$end)] + 1)
saveRDS(meta, file="meta.rdata")

The “meta.rdata” file and the meta variable above stores the positions of each partition within the (vitual) full data matrix. That information is required later by the IRLBA method but not the cross product method.

Cross product method implementation

This is the simpler method to implement in R code, but as pointed out above I expect it to take longer. The code computes the cross matrix product of the implicitly centered data matrix \(\hat{A}\) incrementally over the submatrices stored in the data prep step above. Once this (small) matrix is formed, we can simply compute its SVD to obtain the desired \(U\) vectors.

If you have more than 4 available CPU cores, adjust the mc.cores value appropriately.

library(Matrix)
library(parallel)

files = dir(pattern="ALL.*\\.rdata")
t1 = proc.time()
cross = Reduce(`+`,
  mcMap(function(i)
  {
    f = file(i, open="rb")
    A = unserialize(f)
    close(f)
    e = rep(1, ncol(A))
    B = tcrossprod((attr(A, "rowmeans") %*% A)@x, e)
    as.matrix(t(A) %*% A) - B - t(B) + drop(crossprod(attr(A, "rowmeans"))) * tcrossprod(e)
  }, files, mc.cores=4)
)
s = svd(cross)
dt = proc.time() - t1

The cross product method takes about 6 hours (351 minutes) to run on my home PC. The same code with mc.cores=16 finishes in about 4,427 seconds (74 minutes) on the Amazon r3.8xlarge AMI test machine. The r3.8xlarge system has 4x the number of CPU cores as my home PC and lower I/O cost.

IRLBA implementation

The following example shows a reasonably efficient IRLBA implementation for this problem following the above notes on IRLBA. The gist is to work on the transpose problem in order to take advantage of the right_only=TRUE option to cut down on required working memory of the problem. The complication introduced by that approach is that we can’t use the usual centered=TRUE option to compute PCA. Instead we need a custom matrix vector product that implicitly centers the (transposed) matrix.

But we need a custom matrix vector product anyway to work with the data in chunks. The irlba() function has an argument for explicitly supplying a matrix vector product function, but the code below takes a different approach that I generally prefer these days. We define a simple lightweight partitioned matrix object called “pmat” below, and supply a few basic methods including matrix times vector, vector times matrix, dims, nrow, and ncol.

The reason I like the lightweight partitioned matrix object approach is that it makes it easy to experiment with distributed matrix vector products interactively (say, for debugging) without having to run the irlba() function at all.

library(irlba)
library(Matrix)
library(parallel)
meta = readRDS("meta.rdata")
meta$file = sprintf("%s/%s", getwd(), meta$file)

setClass("pmat", contains="list", S3methods=TRUE, slots=c(dims="numeric"))
setMethod("%*%", signature(x="pmat", y="numeric"), function(x ,y)
  {
    ans = rep(0.0, nrow(x))
    p = mcMap(function(i)
    { 
      f = file(x$file[i], open="rb")
      a = unserialize(f)
      close(f)
      r = attr(a, "rowmeans") #rowMeans(a)
      drop(a %*% y - r * drop(crossprod(rep(1, length(y)), y)))
    }, 1:length(x$file), mc.cores=4)
    i = 1
    for(j in 1:length(p))
    { 
      k = length(p[[j]])
      ans[i:(i + k - 1)] = p[[j]]
      i = i + k
    }
    gc()
    ans
  })
setMethod("%*%", signature(x="numeric", y="pmat"), function(x ,y)
  {
    ans = Reduce(`+`, mcMap(function(i)
    { 
      f = file(y$file[i], open="rb")
      a = unserialize(f)
      close(f)
      j = seq(from=y$start[i], to=y$end[i])
      drop(x[j] %*% a - drop(crossprod(x[j], attr(a, "rowmeans"))))
    }, 1:length(y$file), mc.cores=4))
    gc()
    ans
  })
A = new("pmat", as.list(meta), dims=c(tail(meta$end, 1), meta$ncol[1]))
dim.pmat = function(x) x@dims
nrow.pmat = function(x) x@dims[1]
ncol.pmat = function(x) x@dims[2]

t1 = proc.time()
L  = irlba(A, nv=3, tol=1e-5, right_only=TRUE, work=4)
dt = proc.time() - t1

save(dt, L, file="chunked.out")

With the default tolerance tol=1e-5 and work=4, this method required about 28 matrix vector products and took just under 2 hours (108 minutes) on my home PC, or only about 3.25 times faster than the cross product method despite using only about one eigth as many flops. As mentioned in the IRLBA notes above, the method requires fewer flops but greater I/O than the cross product method.

The compute to I/O balance changes significantly for different sized problems. IRLBA shoud have a larger performance advantage for bigger problems with more than 2,504 people.

The same problem finishes in only 521 seconds on the r3.8xlarge Amazon test sytstem with mc.cores=16. That’s about 8 times faster than the cross product method on that system, and also about the ratio of computational work in Flops between the two methods.

The better performance on the Amazon system compared to my home PC is due to the large amount of RAM on that machine which mitigates I/O cost of iterating over the data 28 times from disk. The whole problem can fit in RAM on the Amazon machine and the data files remain in a fast buffer cache.

Minor technical notes

The use of gc() just after each reduce is important for the multicore parallelism used by mcMap() to make sure that the process image is as small as possible in the next iteration on limited memory systems (like my home PC). Otherwise each fork can bloat in memory size and eventually spill to swap, which can really slow things down. I’m investigating using non-forked parallelism which might be faster in this case for that reason.

Note that the pmat %*% numeric matrix vector product explicitly allocates a result and fills in the values in a for loop after the parallel section. Compare this to the much more elegant Reduce(Map(...)) approach in the numeric %*% pmat product. One could use a reduction function like c() instead, but that incurs lots of extra memory allocation and copy overhead that’s avoided by the less elegant looking (but more efficient) approach used above.

The two methods return nearly the same output (as expected). Here is a comparison of the first three singular values returned by the IRLBA method and by the cross product method:

L$d - sqrt(s$d[1:3])
#[1] -9.094947e-11 -5.638867e-11  0.000000e+00

And a comparison of the largest principal component vectors computed by each method (Euclidean norm of their difference):

drop(sqrt(crossprod(L$v[,1] - s$u[,1])))
#[1] 1.795767e-14

Visualizing the results

First, obtain a list of individual subject IDs in each chromosome file to match with auxiliary phylogenetic data, which we also load. In particular, we’ll use the reported populations as a kind of cluster validation. See http://www.internationalgenome.org/category/population/ for details.

f = dir(pattern="*\\.vcf\\.gz")[[1]]
cmd = sprintf("zcat %s |head -n 500 | grep '^#CHROM' |cut  -f 10-", f)
p = pipe(cmd)
ids = scan(p, what=character(), sep="\t")

ped = read.csv("20130606_g1k.ped", sep="\t", header=TRUE, stringsAsFactors=FALSE)
row.names(ped) = ped[["Individual.ID"]]
ped = ped[ids, ]
ped$Population = factor(ped$Population, levels=c("CHB", "JPT", "CHS", "CDX", "KHV", "CEU", "TSI", "FIN", "GBR", "IBS", "YRI", "LWK", "GWD", "MSL", "ESN", "ASW", "ACB", "MXL", "PUR", "CLM", "PEL", "GIH", "PJL", "BEB", "STU", "ITU"))
levels(ped$Population) = c("Han Chinese in Beijing, China", "Japanese in Tokyo, Japan", "Southern Han Chinese", "Chinese Dai in Xishuangbanna, China", "Kinh in Ho Chi Minh City, Vietnam", "Utah Residents (CEPH) with Northern and Western European Ancestry", "Toscani in Italia", "Finnish in Finland", "British in England and Scotland", "Iberian Population in Spain", "Yoruba in Ibadan, Nigeria", "Luhya in Webuye, Kenya", "Gambian in Western Divisions in the Gambia", "Mende in Sierra Leone", "Esan in Nigeria", "Americans of African Ancestry in SW USA", "African Caribbeans in Barbados", "Mexican Ancestry from Los Angeles USA", "Puerto Ricans from Puerto Rico", "Colombians from Medellin, Colombia", "Peruvians from Lima, Peru", "Gujarati Indian from Houston, Texas", "Punjabi from Lahore, Pakistan", "Bengali from Bangladesh", "Sri Lankan Tamil from the UK", "Indian Telugu from the UK")

3D Scatter plot

The first three principal components represent points in 3D space for each individual and can be directly visualized as a scatter plot. The scatter plot is colored using the above reported populations for each individual.

library(threejs)
clrs = rainbow(length(levels(ped$Population)))[ped$Population]
scatterplot3js(L$v[, 1:3], size=0.2, color=clrs, labels=ped$Population, width=1000)

You can interact with the plot by rotating, zooming, etc. Mousing over each point shows the associated individual’s population. You can see that dimensionality reduction by PCA does a good job of clustering these data by population.

Correlation network visualization

Either IRLBA or the covariance matrix approach can compute more than three principal components; for instance the covariance matrix approach computes all 2,504 components (subject to potential accuracy issues for the components associated with the smallest singular values).

A common trick to visualize higher-dimensional data sets is to represent the data as a weighted, undirected graph (one vertex for each one of the 2,504 individuals in our case). For example, edges between the vertices can be Euclidean distance or Pearson correlation coefficients.

The plot below shows a density plot of all correlation coefficients between each of the 2,504 indivudals’s first 10 principal components. We see a large number of essentially uncorrelated individuals and a smaller highly group with coefficients above about 0.8 or 0.9.

P = cor(t(s$v))
plot(density(P))

The following picture shows an interactive network visualization of the thresholded correlation of the 10-d PCA data, cutting off edges with a correlation below 0.85. Like the scatter plot, you can interact with the visualization. It does a somewhat better job of discriminating between populations.

P[P < 0.85] = 0  # correlation threshold
g = graph.adjacency(P, mode="undirected", weighted=TRUE, diag=FALSE)
g = set_vertex_attr(g, "color", value=clrs)
graphjs(g, vertex.size=0.2, bg="black", vertex.label=ped$Population, width=1000, height=600)

As above, you can interact with the plot by rotating, zooming, etc. Mousing over each vertex shows the associated individual’s population. Edges indicate variant correlation between vertices exceeding the threshold of 0.85. Vertices are colored according to the reported population data file 20130606_g1k.ped. This approach is somewhat ad-hoc in that we arbitrarily chose a dimension of 10 and then a correlation threshold for visualization. See our paper http://arxiv.org/abs/1512.07246 for a more systematic approach to thresholded correlation.


  1. http://www.1000genomes.org/

  2. Jim Baglama and Lothar Reichel (2015). irlba: Fast Truncated SVD, PCA and Symmetric Eigendecomposition for Large Dense and Sparse Matrices. R package version 2.0.1. Development version at https://github.com/bwlewis/irlba.