Thoughts on DuckDB and R, Part 2

A specialized case: overlap joins

Finding overlapping regions between two sets of coordinate values is a common problem in computational genomics and other disciplines. At least two R packages can very rapidly perform this computation: the IRanges package by Pages, Aboyoun and Lawrence in the bioconductor¹ and the widely popular data.table package by Dowle, Short, Lianoglou, Srinivasan, and others².

The Overlap Join

The overlap join finds overlapping regions between two tables of coordinates. The following figure, adapted from Alex Poliakov at Paradigm4, illustrates a simple example. The right to left extent of the blue and green boxes depicts two sets of ranges, respectively.

Two sets of ranges

The overlap join operation finds ranges with overlapping extents, for example the green box labeled “B” overlaps with the blue boxes “2” and “3”, and the green box labeled “C” overlaps with the blue box labeled “4.” In practice, overlap joins may be performed on tables with millions of ranges.

Computed Example

The example presented below finds overlaps between genomic variant coordinates from the 1000 genomes project³, and a list of human gene coordinates from https://www.pharmgkb.org/ ⁴. The example proceeds as follows:

Find variant and gene overlaps
Count the number of variants that overlap each gene by gene symbol
Order the genes by number of variant overlaps

The R packages include functions that combine steps 1 and 2 into a single step, gaining some efficiency along the way. I’m not using those more efficient functions in this example.

You can copy and paste the examples and run them on your own hardware. The timings mentioned below ran on my cheap laptop with an 8-core AMD Ryzen 7 4700U CPU running at a max frequency of 2.5 GHz, 8GB of DDR4 RAM and a SAMSUNG MZVLQ 512GB NVME SSD flash drive. Software versions used include Devuan Beowulf GNU/Linux, R version 4.0.2, IRanges version 2.24.1, data.table 1.14.0, and duckdb 0.2.6.

Download the data files

wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr7.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr8.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 
wget -O genes.tsv "https://github.com/bwlewis/1000_genomes_examples/blob/master/genes.tsv?raw=true"

Parsing the data, or, pipelined parallelism is the best!

The following code parses the input variant files into R data frames with position, reference allele, and alternate allele columns. It results in a list called variants with two data frames, one for chromosome 7 and one for chromosome 8.

library(parallel)
library(data.table)

f = function(file)
{
  cmd = sprintf("zcat %s | sed '/^#/d;/IMPRECISE/d' |  cut -f 2,4,5",file)
  name = gsub(".phase.*","",file)
  print(name)
  x = read.table(pipe(cmd), stringsAsFactors=FALSE)
  names(x) = c("start","ref","alt")

  # Process extra comma-separated alleles (would be much easier if these were
  # on separate rows!)
  idx = grepl(",",x$alt)
  s = strsplit(x[idx, "alt"],",")
  # replace these rows with 1st alternate allele
  x[idx,"alt"] = vapply(s, function(z) z[1], "")
  # Add new rows corresponding to other listed alternates
  ref = x[idx,1:2]
  N   = vapply(s, length, 1) # numbers of alternates by row
  alt = Map(function(i)
    {
      j = which(N==i)
      cbind(ref[j, ], alt=vapply(s[j], function(z)z[i], ""))
    }, seq(2, max(N)))
  rbindlist(c(list(x), alt))
}

files=dir(pattern="*.vcf.gz")
print(system.time({
  variants = mcMap(f, files)
}))
# name by chromosome number:
names(variants) = gsub("^ALL.chr", "", gsub(".phase.*","", names(variants)))

This step takes about 160 seconds on my test PC (to parse 9,362,139 variants on both chromosomes 7 and 8).

The parsing program is complicated by the doubly-delimited data file format. In particular more than one alternate allele may be listed per line separated by commas (the remainder of the file is tab separated). R handles the extra format complexity with relative ease and efficiency. Parsing multiple alternate alleles adds only about 12 lines to the R program and runs very fast. I personally think that languages like R should be used most of the time partly because of this superb flexibility and ease of handling complicated work flows.

I use shell utilities here to help uncompress and process the TSV files. R’s pipe connection object makes it easy to operate on pipelines of shell commands conveniently from R. (The data.table package includes similar functionality but, somewhat unfortunately, using a different syntax).

Because the above processing pipeline includes several stages (zcat, sed, cut, read.table), there is already a fair amount of parallel computation going on. The script additionally uses the explicitly parallel mcMap function which may lead to some over-committing of CPU resources, but still generally improves performance.

Note that the script omits imprecisely specified variants, further described here: http://goo.gl/PFsIsP. Handling imprecise variants would add a lot of complexity to this simple example.

The genes.tsv file is small and fast to parse by comparison. The following script loads the listed gene ranges corresponding to chromosomes 7 and 8 into a data frame. Note that some listed genes contain incomplete information (NA) and are removed and that we’re using the GRCh37 reference genome positions.

cmd = "cat genes.tsv | cut -f 6,13,14,15 | grep 'chr[7,8]' | sed -e 's/chr7/7/;s/chr8/8/'"
p = pipe( cmd, open="r")
genes = na.omit(read.table(p, stringsAsFactors=FALSE, header=FALSE, sep="\t"))
close(p)
names(genes) = c("gene","chromosome","start","end")
genes = genes[genes$start > 0 & genes$end > 0, ]

R data.table

The popular R data.table package includes a fast overlap join function called foverlaps. If you don’t already have it, you can install the data.table package from CRAN with:

install.packages("data.table")

The following code uses the foverlaps function to compute the overlap join and then ranks genes by the by the count of overlapping variants per gene. The code runs in parallel over each chromosome using the mcMap function.

library(data.table)
library(parallel)

overlap = function(chromosome)
{
  x = variants[[as.character(chromosome)]]
  x$end = x$start + nchar(x$ref)
  g = genes[genes$chromosome==chromosome, c(1,3,4)]
  setDT(g)
  setkey(g, start, end)
  setDT(x)
  foverlaps(x,g,nomatch=0)[, .N, by=gene]
}
print(system.time(ans.DT <- rbindlist(mcMap(overlap, c(7,8)))))
ans.DT = ans.DT[order(ans.DT[["N"]], decreasing = TRUE), ]

Example ouput of this step looks like:

head(ans.DT, n=10)
#  CSMD1 CNTNAP2    SGCZ    SDK1   MAGI2  PTPRN2  DLGAP2   CSMD3    NRG1   AUTS2 
# 147459   76073   52174   45067   43984   41662   40362   34397   33001   31275

The data.table takes about 1.7 seconds on my laptop to identify 2,026 overlapping variants.

R IRanges

We can similarly compute counts of overlapping variants and genes using the IRanges package from the bioconductor.

You can install the IRanges package with:

source("http://bioconductor.org/biocLite.R")
biocLite("IRanges")

The example program is almost identical to the data.table version above:

library(IRanges)
library(parallel)

overlap = function(chromosome)
{
  x = variants[[as.character(chromosome)]]
  ir1 = IRanges(start=x$start, end=x$start + nchar(x$ref))
  g = genes[genes$chromosome==chromosome, c(1,3,4)]
  ir2 = IRanges(start=g$start, end=g$end)
  ans = findOverlaps(ir1, ir2)
  data.frame(gene=g$gene, count=countRnodeHits(ans))
}
# Compute the overlaps in parallel for chromosomes 7 and 8
print(system.time({ans.IR = rbindlist(mcMap(overlap, 7:8))}))
ans.IR = ans.IR[order(ans.IR[["count"]], decreasing = TRUE), ]

head(ans.IR, n=10)
#  CSMD1 CNTNAP2    SGCZ    SDK1   MAGI2  PTPRN2  DLGAP2   CSMD3    NRG1   AUTS2 
# 147459   76073   52174   45067   43984   41662   40362   34397   33001   31275

The IRanges example runs in about 1 second on my test PC.

Duck DB

Let’s try DuckDB on this problem. I’m not a SQL expert, but the query I wrote below is a pretty intuitive implementation of the overlap join. I wonder if that query can be improved?

Note that the example below combines the variants data frames into a single ‘variants’ table in DuckDB, adding a ‘chromosome’ number column to idenfity each sub-table. This makes for a single, simple query.

library(duckdb)
con = dbConnect(duckdb::duckdb(), dbdir=":memory:", read_only=FALSE)

variants[[1]]$chromosome = 7
variants[[2]]$chromosome = 8

variants = rbindlist(variants)

duckdb_register(con, "genes", genes)
duckdb_register(con, "variants", variants)

Q = 'select G.gene, count(V.ref) as "count"
from genes G join variants V
on G.chromosome = V.chromosome
and (G.start <= (V.start + length(ref)) and G.end >= V.start)
group by 1 order by 2 desc'

print(system.time({ans.DB = dbGetQuery(con, Q)}))

The output counts are the same as those produced by IRanges and data.table. But the query takes about 37s on my laptop. Maybe it would be faster with native tables inside DuckDB instead of data frame referencs?

duckdb_unregister(con, "genes")
duckdb_unregister(con, "variants")
dbWriteTable(con, "genes", genes)
dbWriteTable(con, "variants", variants)
print(system.time({ans.DB = dbGetQuery(con, Q)}))

Nope, still about 37s. So in this specialized case of the overlap join, several R packages are substantially faster than DuckDB.

Gentleman R.C., Carey V.J., Bates D.M., Bolstad B., Dettling M., Dudoit S., Ellis B., Gautier L., Ge Y., Gentry J., Hornik K., Hothorn T., Huber W., Iacus S., Irizarry R., Leisch F., Li C., Maechler M., Rossini A.J., Sawitzki ., Smith C., Smyth G., Tierney L., Yang J.Y. and Zhang J. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5(10): R80.↩
https://github.com/Rdatatable/data.table/wiki ↩
An integrated map of genetic variation from 1,092 human genomes, McVean et Al, Nature 491, 56-65 (01 November 2012) doi:10.1038/nature11632.↩
M. Whirl-Carrillo, E.M. McDonagh, J. M. Hebert, L. Gong, K. Sangkuhl, C.F. Thorn, R.B. Altman and T.E. Klein. “Pharmacogenomics Knowledge for Personalized Medicine” Clinical Pharmacology & Therapeutics (2012) 92(4): 414-417.↩