Sunday, February 1
why aren't relational databases everywhere, like filesystems?
Background reading for this post:
- Why filesystems have loose coupling and your protocol doesn’t
- The Evolution of the Unix Time-sharing System
A couple of months ago, I went from working on one big software project to working on a series of smaller, limited-scope projects. Of course a lot changes when you do that. One of the things I’ve really been noticing is that I’m not using relational databases for much right now, and probably haven’t written any SQL since November or thereabouts.
I should note here that I think filesystems are a really powerful abstraction,
frequently underrated by people who ought to know better, and I usually argue
for simple tools. I am often that guy who wonders why you didn’t just use
grep
, and even when I have a robust database at my disposal, I often explore
data by hacking together silly ad hoc pipelines in the shell.
With that out of the way, I just spent years treating first MySQL/MariaDB and then PostgreSQL as default storage for most problems, and I’ve arrived at a healthy respect for the tools, particularly PostgreSQL. It is nice to accumulate collections of structured data. It’s nice to have defined schema with a rich set of types, and to expect type constraints to be enforced and the semantics of conversion between types made explicit. It’s really nice to have a regular language for querying these collections.
Thinking about all of that, I thought “I should really just stand up psql on my primary machine and populate it with interesting things”. SQL is, to be sure, imperfect, but as a nerd working in the guts of a money-making enterprise, it quickly becomes one of your basic tools for turning a store of facts into answers to the questions that you didn’t know you were going to have to answer. It’s one of the technologies that help render empirical questions answerable, or at least help define the boundaries of which ones are answerable.
And thinking about that, I started to wonder why I hadn’t set up a personal database years ago — why stores of relational data weren’t as much a part of my personal kit as the hierarchical filesystems I’ve been accumulating for 20+ years. And it seems to me like there’s something important in the answers to that question.
In no particular order, some thoughts:
No operating system in wide use offers a relational DB to the user out of the box. There’s a concerted effort underway to obscure them, but filesystems have been user-level interfaces for most of the history of computing and the idea of a “file” is pretty well embedded in the general awareness by now.
Moving files between machines and environments is (relatively) easy: The protocols, hardware, and services are widely available. Moving relational data is almost universally unpleasant, despite piggybacking on the file stuff. It’s unforgiving in interface, slow, error prone, and implementation specific.
There are plenty of tolerable GUI file managers, and desktop systems offer them out of the box. The same can’t really be said for databases. GUI clients exist (even some fairly good ones), but they’re mostly erratic, require installation, demand configuration, and/or cost money.
text : vim / emacs :: sql : ? — there is not, so far as I am aware, any good answer to this question. The statement generalizes quite a bit, and yet.
While most programming languages offer library support for the widely-used relational DBs, making them accessible for programmers who care to invest a little time, there’s no comparable level of access at the level of the shell, the windowing system, the web browser, etc.
People accomplish a lot with applications like Access, but while this speaks to the utility of databases, forms, and scripting, it is not anything we should feel especially good about.
A relational data store requires configuration to install, and configuration (or at least another set of credentials) to access from a client.
Like programming languages and text editors, a full-featured database is generally designed as a robust environment for tool use, offering lots of abstractions and utility functions, and quite extensible within its own space. What it’s not is much of a first-class citizen in a user-level tool environment. Databases are for people who build infrastructure, but they aren’t part of the infrastructure from the perspective of an end-user in the same way that things like filesystems, HTTP clients, and simple utility software are. Databases are earth-moving equipment, not highways and on-ramps.
Writing queries is fairly cheap, once you learn a little conceptual machinery, but defining or altering schema is usually slow and painful. (I’m pretty sure that the friction here generates a lot of the incentive to prefer NoSQL tools, and avoid rigorous schema definitions generally, which leads to a lot of probably-unnecessary pain.)
It should be trivial to pipe things in and out of queries/resultsets, but I don’t think it really is.
Of course, there’s a whole history here, and though I don’t know it well I do have a rough sense of the conditions and accidents that brought us to the status quo. It’d be a mistake to believe that the dominance of hierarchical filesystems as we know them was inevitable, but filesystems as a base layer appear to satisfy a lot of evolutionary constraints.
Which is to say that I’m not really interested in arguments that the filesystem is a dead-end or whatever. On a long enough timescale, I suppose it might be, but life is short and I have problems to solve. I’m just interested in whether you can make something approximately as good as all the good parts of PostgreSQL legible and accessible in the day-to-day nerd environment, transparent to the network and to the broader ecosystem of tools. What would it actually take?