On DuckDB -- B. W. Lewis

On DuckDB and R

Bryan W. Lewis for the Cleveland R User Group, June 2021

Who is DuckDB?

Mark Rassveldt and Hannes Mühleisen
and 62 other committers so far...

https://www.cwi.nl/
https://duckdb.org/

https://youtu.be/PFUZlNQIndo
(a nice talk by Mark on DuckDB)

What is DuckDB?
"SQLite for analytics"

Embedded relational, SQL database library
- Easy to install: install.packages("duckdb")
- Zero-config, data in memory or in single file DBs
Column-oriented tabular data model (unlike SQLite, like R!)
Very impressive R and Python integration

Easy SQL example to get started
library(duckdb) data("flights", package = "nycflights13") con <- dbConnect(duckdb()) duckdb_register(con, "flights", flights) dbGetQuery(con, "SELECT origin, COUNT(*) AS n FROM flights GROUP BY origin") origin n 1 EWR 120835 2 LGA 104662 3 JFK 111279

Easy SQL example to get started


library(duckdb)
data("flights", package = "nycflights13")

con <- dbConnect(duckdb())
duckdb_register(con, "flights", flights)

dbGetQuery(con,
  "SELECT origin, COUNT(*) AS n FROM flights GROUP BY origin")

  origin      n
1    EWR 120835
2    LGA 104662
3    JFK 111279

From easy to merely simple

Name the top 3 destinations for each origin airport.

OK, still easy!

One base R way
Map(function(x) head(names(sort(table(x), decreasing = TRUE)), 3), split(flights[["dest"]], flights[["origin"]])) $EWR [1] "ORD" "BOS" "SFO" $JFK [1] "LAX" "SFO" "BOS" $LGA [1] "ATL" "ORD" "CLT"

One base R way


Map(function(x) head(names(sort(table(x), decreasing = TRUE)), 3),
    split(flights[["dest"]], flights[["origin"]]))

$EWR
[1] "ORD" "BOS" "SFO"

$JFK
[1] "LAX" "SFO" "BOS"

$LGA
[1] "ATL" "ORD" "CLT"

One base R way


Map(function(x) head(names(sort(table(x), decreasing = TRUE)), 3),
    split(flights[["dest"]], flights[["origin"]]))

$EWR
[1] "ORD" "BOS" "SFO"

$JFK
[1] "LAX" "SFO" "BOS"

$LGA
[1] "ATL" "ORD" "CLT"

Dplyr

library(dplyr) flights %>% group_by(origin) %>% count(dest, sort = TRUE) %>% slice_head(n = 3) %>% select(origin, dest)

origin dest 1 EWR ORD 2 EWR BOS 3 EWR SFO 4 JFK LAX 5 JFK SFO 6 JFK BOS 7 LGA ATL 8 LGA ORD 9 LGA CLT

SQL (and DuckDB)

dbGetQuery(con, "SELECT origin, dest FROM (SELECT origin, dest, n FROM (SELECT origin, dest, n, RANK() OVER ( PARTITION BY origin ORDER BY n DESC) AS h FROM (SELECT origin, dest, COUNT(*) AS n FROM flights GROUP BY origin, dest ) AS curly ) AS moe WHERE (h <= 3) ) AS shemp ORDER BY origin;")

origin dest 1 EWR ORD 2 EWR BOS 3 EWR SFO 4 JFK LAX 5 JFK SFO 6 JFK BOS 7 LGA ATL 8 LGA ORD 9 LGA CLT

Simple data manipulation in R (and Python) can be hard in SQL.

The converse is rarely true.

Fortunately, DuckDB works with dplyr!

tbl(con, "flights") %>%
group_by(origin) %>%
count(dest, sort = TRUE, name = "N") %>%
slice_max(order_by = N, n = 3) %>%
select(origin, dest)

origin dest 1 JFK LAX 2 JFK SFO 3 JFK BOS 4 LGA ATL 5 LGA ORD 6 LGA CLT 7 EWR ORD 8 EWR BOS 9 EWR SFO

When and how is a tool like DuckDB really useful?

One way to think about this:

syntax (how), and

performance (when).

Syntax advice

Data manipulation in R is easier than SQL.

Start with R, add dplyr, data.table, etc. as needed.

With dplyr, R works with your database anyway, so one less language to deal with!

Another simple example: "as of" joins
Basic idea:

calendar
date
2020-01-01
2020-02-01
2020-03-01
2020-04-01
2020-05-01
2020-06-01

data
date	value
2019-11-18	0.6870228
2020-01-05	0.0617863
2020-01-10	0.1765568
2020-02-01	0.5000000
2020-02-12	0.2059746
2020-04-13	0.6291140
2020-05-08	0.3841037

‘as of’ desired output
date	value
2020-01-01	0.6870228
2020-02-01	0.5000000
2020-03-01	0.2059746
2020-04-01	0.2059746
2020-05-01	0.6291140
2020-06-01	0.3841037

For each 'calendar' date, find the most recent 'data' value
as of that date.

Many easy ways in R and Python

data[calendar, on = "date", roll = TRUE] # using R data.table pandas.merge_asof(calendar, data, on = "date") # using Python Pandas merge(calendar, na.locf(merge(calendar, data)), join = "left") # R xts (more...)

See https://bwlewis.github.io/duckdb_and_r/asof/asof.html for full details.

SQL
WITH z AS ( SELECT date, (NULL) AS value FROM calendar UNION SELECT date, value FROM data ORDER BY date ), a AS ( SELECT date, value, ROW_NUMBER() OVER ( ORDER BY date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) * (CASE WHEN value IS NULL THEN 0 ELSE 1 END) AS i FROM z ), b AS ( SELECT date, MAX(i) OVER ( ORDER BY date RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS j FROM a ), c AS ( SELECT b.date, value FROM a, b WHERE a.i > 0 AND a.i = b.j ), d AS ( SELECT calendar.date, value FROM calendar, c WHERE calendar.date = c.date ORDER BY c.date ) SELECT * FROM d UNION SELECT * FROM d ORDER BY date

How about performance of that "as of" join on my laptop?

It's a small example: about 5 million data rows, 250K calendar rows!

Performance advice

R outperforms DuckDB for in-memory problems.

DuckDB's advantage is that queries can work without modification on larger-than-RAM data.

Use databases for problems that don't easily fit in RAM. And when you do, use them with dplyr!

Except...

Larger-than-memory one off problems may run fastest with Jim Hester's vroom package.

See this link for an example using the NYC taxi data:
https://bwlewis.github.io/duckdb_and_r/taxi/taxi.html

As Don Quixote said, "it takes all kinds of tools to solve data science problems."

Summary

DuckDB makes using a database in R *really* easy.

DuckDB's key advantage is that queries can work without modification on larger-than-main-memory data.

Use it with dplyr.

See this link for much more:
https://bwlewis.github.io/duckdb_and_r/thoughts_on_duckdb.html

Bonus performance testing tip

Modern computers have all kinds of thermal and power saving CPU scaling going on that can really mess with performance measurements.

This can help mitigate that: echo "performance" | \ sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor