SQL is billed as a language of declarative data manipulation and analysis. But even for some simple tasks, SQL really seems more imperative to me than alternatives like R. Consider the following quite simple example using a large table that starts out like:
date | company | value |
---|---|---|
2020-09-23 | Zaman Research & Consulting Services | 0.1512763 |
2021-07-08 | Ackerman Health Care Equipment | 0.7323856 |
2021-04-30 | Zaman Research & Consulting Services | 0.8713198 |
2021-07-30 | Groves Marine | 0.4544761 |
2021-07-06 | Xavier Wireless Telecommunication Services | 0.5160909 |
2020-09-10 | Bernard Mortgage REITs | 0.6476043 |
with many more, maybe millions more, rows like that. You can see a much longer discussion of this example here: https://bwlewis.github.io/duckdb_and_r/last/last.html . Here is a very easily stated task:
For each company in the example table, return the most recent value. If the most recent dates are not unique within a company, return any one last value.
Pretty simple?
I am indebted to Richard Wesley at DuckDB Labs for teaching me a superior SQL approach (the join approach below) to solving this problem than my original one. And to Hadley Wickham for showing me how to use dplyr to generate queries like this. Please add your name here if you find other ways to improve these notes!
An R dplyr approach to performing the task hews closely to the simple statement above. The solution is composed of a few short and very declarative functions:
example %>% group_by(company) %>% summarize(answer = last(value, order_by = date))
Wow, you can almost put a period at the end and read it like a sentence.
Now let’s imagine we are working with a very low-level imperative programming language.
An iterative solution approach that unnecessarily makes multiple passes over the data might look like (the “window” approach):
groups = unique(company)
for each group g
let row.i = integer index that reverse orders date in group g
end for
for each row in example table
if row.i = 1 then keep row
end for
Alternatively, we could perform several conceptual loops (the “join” approach):
1 For each company, find the max(date) for that company and build a table called 'm'
of all such company, max(date) pairs.
2 For each row in m, find value(s) that correspond to the company and date.
3 Take (for instance) the max(value) by company of the output to get down to one row per company.
The SQL approaches correspond to the above “window” and “join” pseudocode. The join approach was contributed by Richard Wesley at DuckDB. It’s much more efficient in SQL databases than the window approach, apparently.
This is courtesy of Richard Wesley at DuckDB. It’s vastly more efficient than the window approach in DuckDB.
WITH ans AS (SELECT e.company, value FROM example AS e,
(SELECT company, MAX(date) AS date FROM example GROUP BY company) AS m
WHERE e.company = m.company AND e.date = m.date)
SELECT company, MAX(value) FROM ans GROUP by company
A standard SQL window approach to solving this task looks remarkably like the first pseudocode above except harder to read. The gist is number each row using date order within group and then go through and pick out rows with the right number:
SELECT company, value FROM
(SELECT company, date_numeric, value, ROW_NUMBER() OVER
(PARTITION BY company ORDER BY date_numeric DESC) AS i FROM example) AS cazart
WHERE i = 1
What’s worse, query optimizers for at least a few SQL database engines I tried quite literally interpret the SQL code and actually loop over the data multiple times. See https://bwlewis.github.io/duckdb_and_r/last/last.html for representative, and disastrously bad, SQL database engine timings.
Even worse yet, apparently some database engines ask the user themselves to work around inefficient implementations by using highly idiosyncratic tricks. See, for instance illuminating discussions on this problem for PostgreSQL and MySQL:
I wonder if SQL is simply a bad way to express a solution to this kind of simple problem?
SQL can be really nice for simple, basic aggregation and filtering. It’s also really good at imposing constraints on data. But my advice for most people and most problems is this: just use dplyr if you can. Don’t use SQL. Try to think about your data munging problems the dplyr way. For anything other than the most basic operations, I think dplyr is a better through-out grammar of data manipulation than SQL. And besides, dplyr can usually front your database anyway.
Remember, nothing–not dplyr, SQL, certainly not R–is always a perfect solution to every problem.
Indeed, dplyr fails to figure out how to translate even this simple problem for back-end databases. (Update: Hadley Wickham indicates that, indeed, dplyr can in fact generate appropriate SQL here.) The dplyr approach also brings with it considerable complexity for some tasks–just look at the large number of functions in the dplyr package:
suppressPackageStartupMessages(library(dplyr))
length(ls("package:dplyr"))
## [1] 290
That’s a lot of stuff to remember. And, don’t forget that plain old base R often works really well in many cases, including for this problem. Even SQL, despite my misgivings about it, has many admirable qualities and the advantage that tons of folks already know it very well (surely way better than I do) and use it productively.
For more thoughts and examples along these lines, see my notes on this topic here:
https://bwlewis.github.io/duckdb_and_r
Some, but not all, SQL database systems require that I label (alias) the sub-query for reasons unknown to me. I’m not sure what the SQL ISO/IEC 9075-2:2016 standard says about this, and I don’t really want to pay $250 to find out. For me, it’s just an example of variation in SQL implementations across database systems.