SQL is billed as a language of declarative data manipulation and analysis. But even for some simple tasks, SQL really seems more imperative to me than alternatives like R. Consider the following quite simple example using a large table that starts out like:
|2020-07-27||Zaman Research & Consulting Services||0.1512763|
|2021-05-11||Ackerman Health Care Equipment||0.7323856|
|2021-03-03||Zaman Research & Consulting Services||0.8713198|
|2021-05-09||Xavier Wireless Telecommunication Services||0.5160909|
|2020-07-14||Bernard Mortgage REITs||0.6476043|
with many more, maybe millions more, rows like that. You can see a much longer discussion of this example here: https://bwlewis.github.io/duckdb_and_r/last/last.html . Here is a very easily stated task:
For each company in the example table, return the most recent value.
How simple is that?
An R dplyr approach to performing the task hews closely to the simple statement above. The solution is composed of a few short and very declarative functions:
example %>% group_by(company) %>% summarize(answer = last(value, order_by = date))
Wow, you can almost put a period at the end and read it like a sentence.
Now let’s imagine we are working with a very low-level imperative programming language. One approach might go something like:
groups = unique(company) for each group g let i = integer index that reverse orders date in group g keep row associated with i = 1 end for
Or, if you felt like being less efficient about things you could make several passes over the data like so:
groups = unique(company) for each group g let row.i = integer index that reverse orders date in group g end for for each row in example table if row.i = 1 then keep row end for
As far as I can tell, a standard SQL approach to solving this task looks remarkably like the latter, less efficient, pseudocode above except harder to read. The gist is number each row using date order within group and then go through and pick out rows with the right number:
SELECT company, value FROM (SELECT company, date_numeric, value, ROW_NUMBER() OVER (PARTITION BY company ORDER BY date_numeric DESC) AS i FROM example) AS cazart WHERE i = 1
What’s worse, query optimizers for at least a few SQL database engines I tried quite literally interpret the SQL code and actually loop over the data multiple times. See https://bwlewis.github.io/duckdb_and_r/last/last.html for representative, and disastrously bad, SQL database engine timings.
Even worse yet, apparently some database engines ask the user themselves to work around inefficient implementations by using highly idiosyncratic tricks. See, for instance illuminating discussions on this problem for PostgreSQL and MySQL:
I argue that SQL is simply a bad way to solve this kind of simple problem.
SQL can be really nice for simple, basic aggregation and filtering. It’s also really good at imposing constraints on data. But hey, I used to work at a database company and I’ve seen SQL queries that would make Mike Stonebraker smash his banjo like Pete Townshend. SQL can be a mess, is hard to compose, and sometimes makes it difficult to express even seemingly simple objectives like the one above.
My advice for most people and most problems is this: just use dplyr if you can. Don’t use SQL. Try to think about your data munging problems the dplyr way. For anything other than the most basic operations, I think dplyr is a better through-out grammar of data manipulation than SQL. And besides, dplyr can usually front your database anyway.
Remember, nothing–not dplyr, SQL, certainly not R–is always a perfect solution to every problem.
Indeed, dplyr fails to figure out how to translate even this simple problem for back-end databases. (Update: Hadley Wickham indicates that, indeed, dplyr can in fact generate appropriate SQL here.) The dplyr approach also brings with it considerable complexity for some tasks–just look at the (ridiculously) large number of functions in the dplyr package:
##  290
That’s a lot of stuff to remember. And, don’t forget that plain old base R often works really well in many cases, including for this problem. Even SQL, despite my misgivings about it, has many admirable qualities and the advantage that tons of folks already know it very well (surely way better than I do) and use it productively.
For more thoughts and examples along these lines, see my notes on this topic here:
Some, but not all, SQL database systems require that I label (alias) the sub-query for reasons unknown to me. I’m not sure what the SQL ISO/IEC 9075-2:2016 standard says about this, and I don’t really want to pay $250 to find out. For me, it’s just an example of variation in SQL implementations across database systems.