Database internals are becoming less important than developer experience

68 points by gagejustins 4 years ago

Chiron1991 4 years ago

Friendly reminder that this post is published on the PlanetScale blog, a company that sells a database SaaS. Beware of the bias.

I personally would argue with every single point this article makes, except scalability.

pm90 4 years ago

Agreed. The recent splurge in VC money for dev tools startups is going to lead to a lot more articles like this one. I hope developers read it with an eye towards that bias.
hodgesrm 4 years ago

> I personally would argue with every single point this article makes, except scalability.
Maybe you could be more explicit about what you don't like about their ideas? I personally do like a lot of their ideas, such as the following:
> In the future, I’d expect to see a tighter coupling between the frameworks we’re using for reactive frontends – React, Vue, etc. – and the database, via hooks or otherwise.
This builds on the behavior that made MongoDB so phenomenally popular, as the article points out. Data management is pervasive in modern applications and anything that makes it easier for devs to implement is goodness.

gagejustins 4 years ago

~hello everyone, author here~

I know posts with ThOuGhT LeaDeRshIp titles like this are usually annoying, but I thought it would be interesting to write down some of the lessons I've been gathering as I've spent more time covering and using specific databases. My background is in data science / analytics with a couple of years of more traditional full stack here and there.

Broadly we've seen this pattern with infrastructure in general – it's a lot easier to set up a server than it used to be, all things considered. Now obviously if you're a tiny startup, you're more comfortable outsourcing everything to Heroku, and if you're a hyperscale enterprise, you probably want more control on exactly what your database is doing.

The thesis here is that on the tail end (hyper scale), things are getting more under control and predictable, and developers there want the same "nice things" you get with platforms like Heroku. Elsewhere in the ecosystem, more and more parts of the stack are getting turned into "simple APIs" (Stripe for payments, Twilio for comms, etc.). And perhaps most interestingly, as serverless for compute seems kind of stuck (maybe?), it may be the case that serverless for databases – whatever that ends up meaning – is actually an easier paradigm for application developers to work with.

jasonwatkinspdx 4 years ago

I think the article's thesis is a false and misleading dichotomy.
It's absolutely true that a low friction developer experience is necessary for a database product to be successful. But this in no way implies that database internals are being commoditized or relegated to minor importance.
Snowflake is a particularly bad example as taking a clean sheet and novel approach to internals is the very fulcrum that creates the easy developer experience.
Admittedly it's been a while since I looked at vitess, but my recollection is that it's cross shard functionality is so limited as to make claiming internals no longer matter a bit dubious.
The reason there's only a handful of spanner style systems is exactly because the internals both matter and are quite daunting to get right.
- felixhuttmann 4 years ago
  
  I agree. It is also amazing how different the database systems are that are competing against each other today:
  Partitioning: 1) DynamoDb: Partitioning is explicit and one of the most important parts of schema design 2) Spanner, Cockroach: Database automatically partitions the key ranges. 3) Postgres: You will probably never reach the scale where you need to partition your dataset!
  Transactions: 1) Spanner, firestore - no stored procedures, client-side transactions are important 2) Dynamodb: No stored procedures, no client-side transactions, only transactions where all items involved are known by primary key in advance. 3) Fauna, Supabase: Stored procedures are the way to go! You do not need application code, access your database from the client! 4) Postgres: We have everything, use what fits your particular use-case!
  If database internals did not matter, why are they all doing something different and are sometimes quite opinionated about it?
redwood 4 years ago

I think you bring up some good points, but it's factually incorrect to state that mongodb is not ACID compliant (it has been since its 4.0 release). The way to win is not by hyperbole and lies. It's to your point by focusing on developer experience.

TekMol 4 years ago

As a developer, I have to say that sqlite gives me the best experience.

Everything else pales in comparison.

Create a database?

    sqlite3 mydata.db

Where is the database?

    In the current directory

How is it structured on disk?

    It's a single file

How do I backup the DB?

    cp mydata.db /my/backups/mydata.db

Do I have to install a server?

No

Do I have to configure anything?

No

During setup and deployment I usually I dabble a while with the whole GRANT *.* ON localhost IDENTIFIED BY PASSWORD or something. How do I do that with sqlite?

    It just works

Do I have to close / protect any specific ports?

    No, it's just a file

Which field types should I use for ... ?

    None. It just works.

waynesonfire 4 years ago

calibre uses sqlite and it doesn't support hosting the db on a network drive.
https://manual.calibre-ebook.com/faq.html#i-am-getting-error...
i guess that's a bad -dev- user experience?
- TekMol 4 years ago
  
  You can certainly host it on a network drive if the network filesystem has the right features and behaviour.
  The same goes for a local filesystem. Sqlite has certain features it requires the filesystem to have. That is independent of how that filesystem stores the data physically.
  
  fmakunbound 4 years ago
  
  > if the network filesystem has the right features and behaviour
  Which network filesystems are those?
  
  dogma1138 4 years ago
  
  SQLite actually works just fine on Windows files shares (and yes with multiple clients since Windows file shares do support file locks) but I wouldn’t recommend it as a remote DB/multi client solution.
void_mint 4 years ago

This feels like kind of a silly comparison. SQLite can't do most of what makes modern databases difficult/valuable. As an example, if you replace your above example with a CSV file, you'd get the same output, right?
- TekMol 4 years ago
  
  I describe the differences between sqlite and mysql, postgres, oracle etc that are the reason why I like sqlite best.
  If for you there are specific differences that make you chose some other db, let us know.
  
  munk-a 4 years ago
  
  I remember that back in the day getting Postgres up and running was an enormous chore - SQLite was definitely the easiest but even MySQL was a walk in the park compared to Postgres. But Postgres is simply amazing - it is entirely transparent about the metadata it's storing to assist querying and has mature tools available to help you optimize queries. The only "big" one that I haven't used is Oracle mostly because the price makes it insanely unreasonable for small companies. Out of MySQL, SQLite, MSSQL and Postgres I'd have to say my favourite is Postgres due to compelling the least number of hair-pulling frustrations and having absolutely pristine documentation - followed closely by MSSQL though MSSQL sometimes seems to go out of its way to have an incompatible dialect.
- the_gipsy 4 years ago
  
  > with a CSV file, you'd get the same output, right?
  Not when you have threads.
DelightOne 4 years ago

Is there a wrapper to treat a database like its an sqlite database (a file) to simplify to the same level as Sqlite? This may at least simplify some issues. Doesn't matter whether its backed by a local database running in a container or by a remote one.
- munk-a 4 years ago
  
  The biggest database unification effort tends to be focused around ODBC compatibility from what I've seen - and the operations laid out by ODBC are quite trivial and easy to comprehend I think.
  
  DelightOne 4 years ago
  
  I more thought of a cli wrapper to enable simple file-like management of databases with operations like copy, list and not having to setup auth, but rather just provide the local path to the database. However, ODBC looks interesting too.
  ODBC might enable the creation of such a cli wrapper in a database-agnostic way. An authentication library retrieving the correct credentials based on the local file path may make the local use as seamless as with Sqlite. To get the best from both worlds.
  
  munk-a 4 years ago
  
  Yea I think that'd be an interesting approach - and ODBC takes care of all the annoying bits for you (providing you with a single unified API to interact with instead of tailoring the commands to each specific driver).
tyingq 4 years ago

Fully agree, but that comes with tradeoffs around concurrent writes, strong typing (though I hear that's coming), some other functionality, etc.
johncolanduoni 4 years ago

Unless you need to do something crazy, like run two nodes of your application. But only Google needs to do that, right?
- dragontamer 4 years ago
  
  You don't even need to reach two nodes before SQLite becomes grossly inadequate. Even on a single node: SQLite's paradigm of global locks leads to poor performance when multiple threads write to the same table.
  You could be a single-node 4-core $5/month VPS instance and run into this issue. SQLite requires "exclusive" access to a table to handle writes (meaning when writing, no other thread can be reading the table). Especially if your transactions start to become complex.
  In contrast, MySQL and PostgreSQL allow for simultaneous reads while writes are occurring.
  
  TekMol 4 years ago
  
  The question is if simultaneous operations really speed up your application.
  It is not as if a 4-core machine can do 4 times the DB work if you only allow it.
  Memory access, disk access .. they all have their specific behaviour when you try to do things simultaneously. In the worst case, things will just get serialized on a lower level, even if multiple CPU cores send and/or request data simultaneously.
  
  dragontamer 4 years ago
  
  > It is not as if a 4-core machine can do 4 times the DB work if you only allow it.
  If Thread#1 goes "lock. write. unlock", then that means Thread#2, #3, and #4 all have to wait until Thread#1 is done with the write.
  Even if Thread#2/#3/#4 have the data in their CPU-cache, they have to wait for Thread#1 to be complete.
  --------
  If Thread#1 is writing to a __hard drive__, that means Thread#2, #3, and #4 are waiting on a hard drive read, when they could have instead been reading from L3 cache.
  SQLite's model scales extremely poorly in practice. You run into all sorts of problems like this.
  
  TekMol 4 years ago
  
  That is why I mentioned the "worst case".
  You are describing the best case. Where everything is in a cache close enough to the CPU that it is not impacted by the other CPUs data access.
  
  dragontamer 4 years ago
  
  I'm talking about relatively simple cases like spinning up a phpbb3 (web forum) instance. The minute you have multiple users writing comments at the same time from different apache/php instances is the minute your SQLite database starts to fall over.
  Every write (aka: every comment post) is an exclusive write on SQLite. A full lock that prevents other processes from reading. (Ex: User#1 writes a comment, but User#2 hits refresh. These two items will be sequential if you use SQLite... when normally they'd be concurrent in most other databases)
  ------
  SQLite is an exceptionally good database for many purposes. But it has its weaknesses. There's a reason why MySQL and PostgreSQL exist after all.
  
  jxonas 4 years ago
  
  Check SQLite WAL mode: single write, multiple simultaneous reads. https://sqlite.org/wal.html
  
  johncolanduoni 4 years ago
  
  We don’t have to talk in abstract about this, you can benchmark these things or just look at the many ones run by others. For example, some benchmarks around some Postgres scaling improvements: https://wiki.postgresql.org/images/e/e8/FOSDEM2012-Multi-CPU....
  If you don’t need anything Postgres offers, by all means stick with sqlite. But your pessimism about being stuck with operations “serialized on a lower level” is not empirically justified for most of the databases that support multi-threading.
  
  ColonelPhantom 4 years ago
  
  I believe that SQLite in Write-Ahead Log (WAL) mode does not have this issue.
mamcx 4 years ago

Firebird is also close. With the extra capabilities that can run as server (and that is only a setup concern: the code stay same).
Also, it support more stuff (like stored procedures!) that I miss a lot on sqlite.
My only complain with firebird is that is not easy to embed into iso/android or it will my main db.
bob1029 4 years ago

We've been using sqlite as our only datastore for the last 5+ years.
There are a lot of fun little tricks you learn along the way.
The most important one is to ignore all the naysayers who claim it won't scale. I can easily saturate any IO subsystem by way of SQLite.
At first you may think it's impossible, but then you start to go outside the lines a bit. You start by turning on WAL. That helps a lot... But then you want another 100x? One thing you can do is heavily abuse the fact that any SQLite db is "just a file" by opening as many as you need to in parallel. 9/10 times you can break down a typical business app into grains roughly approximating the current state of each user.
Might make regulatory compliance easier if each user is stored in their own database too. Makes removing someone's entire presence from your system 1 file deletion activity.
- Tenoke 4 years ago
  
  How do you aggregate the data and query among all those databases? Doesn't it incur huge costs? Does sqlite support this natively or are you basically treating it as text files of data per user and don't need complex queries in the first place?
  
  bob1029 4 years ago
  
  We don't have any use cases for querying across these datastores right now. Hypothetically, if we did need something like this it would probably be an offline ETL analysis tool. I have written a few of these and they can be incredibly performant if everything lives on the same box.
  
  TekMol 4 years ago
  
  Sqlite has it natively. After connecting to user1.db just execute this query:
  ATTACH DATABASE user2.db AS user2;
  Now you can use user2.tablename in any query just as if was part of user1.db

jdblair 4 years ago

The author skips the first decade of database systems in the 1960s. The oldest databases were not relational! They were hierarchical or navigational.

Hierarchical databases were much like a filesystem, but for records instead of files.

Navigational databases allowed data to be linked in a network. Look up CODASYL for detail.

The relational database design was first proposed in the 1970s.

cl0ckt0wer 4 years ago

I would think that having bad internals would create a bad dev experience.

wmf 4 years ago

Unfortunately MongoDB proved the opposite. You don't notice the data loss and schema problems until later.
- cl0ckt0wer 4 years ago
  
  That is a pretty bad experience in the long run.
  
  ketralnis 4 years ago
  
  Right, but in the long run you're more locked in than you were when you didn't know that.
  
  dilyevsky 4 years ago
  
  In the long run most devs will be at faang and someone else will inherit the mongodb goodness
- DemocracyFTW 4 years ago
  
  > You don't notice the data loss
  Wait, is that a feature or a bug?

seibelj 4 years ago

> NoSQL databases are maturing, for sure – we’re starting to see support for transactions (on some timeframe of consistency) and generally more stability. After years of working with “flexible” databases though, it has become clearer that rigidity up front (defining a schema) can end up meaning flexibility later on.

So funny to me that NoSQL boosters have only recently understood that designing sane schemas and knowing what order your data is inserted is important for data integrity. It's like an entire generation of highly paid software devs never learned fundamental computer science principles.

blacktriangle 4 years ago

That's exactly what it is. "Self-taught coder" really isn't the right word for what many are, as it implies some form of intentional individual study. More like "self learned to duck tape shit together thanks to Stack Overflow" but we don't have a catchy term for that.
dragontamer 4 years ago

To be fair, relational algebra is hard.
That being said: going back to 1970 to read the original "A Relational Model of Data for Large Shared Data Banks" by Codd (the paper which started the relational-database + normalization model) is incredibly useful.
But yeah, all of this debate about "how data should be modeled" was the same in 1970 as it is today.
-----
SQL doesn't quite fit 100% into the relational model, but its certainly inspired by Codd's relational model and designed to work with the principles from that paper.
And strangely enough, legions of authors and teachers and courses do a worse job at explaining relational databases than Codd's original 11 page paper.
- munk-a 4 years ago
  
  I had a specific class on relational algebra in uni and it is up there with algorithm design and analysis in the realm of classes that actually provided me the most long term value.
  Relational algebra is a lot easier once you start viewing it as relational algebra - a declarative expression of intent that can be manipulated and re-expressed similar to other purely mathematical statements. Then, when performance tuning becomes the watchword, you take that flexible expression and slice and dice it according to how the DBMS you're working with requires to align it with performance. You always want to think of your queries as complex summoning spells that draw in different necessary resources in some particular patterns and then impose an expression form on that blob of data - then you'll skate through all things SQL.
  
  dragontamer 4 years ago
  
  Relational algebra comes from my study of constraint programming / optimization (a closely related field to 3SAT solvers).
  From this perspective, the study of relations is more about solving these NP-hard problems. For example, coloring a graph. You can solve things "locally", such as:
  Texas | New Mexico | Oklahoma ----------------------------- Red | Blue | Green Red | Blue | Yellow Red | Green | Blue Red | Green | Yellow Red | Yellow | Blue Red | Yellow | Green Blue | Red | Green Blue | Red | Yellow Blue | Green | Red Blue | Green | Yellow ... (etc. etc. for all other valid combinations)
  And so on for each "locally valid graph coloring" (looking only at a limited number of states). You then combine all of these relations together to find a solution to the 4-coloring problem.
  We can see that "solving" a 4-coloring problem is as simple as a cross-join over all these relations (!!!). That is: Texas_NewMexico_Okalhoma cross-join Texas_Oklahoma_Louisiana cross-join Louisiana_Mississippi_Arkansas cross-join ...
  We can see immediately that "Texas_NewMexico_Okalhoma cross-join Texas_Oklahoma_Louisiana" will create a new relation, a "Texas_NewMexico_Oklahoma_Louisiana" table, though with the information from just two earlier tables, this new table is "incomplete" so to speak (you'll need to join this table with many other Texas, NewMexico, Oklahoma, and Louisiana tables to ensure global consistency).
  We can imagine a big 48-wide table consisting of the 48 states of USA as the final solution to a graph coloring problem. Successful calculation of this massive table will enumerate all possible solutions (!!) of the graph coloring problem.
  ----------------
  Somehow, I find it more obvious to understand relations from this perspective. If anything, learning about constraint programming has made my relational algebra better (and as a result, has made my database skills probably better too)
  Its also a funny corner where if you "study relations" hard enough, you eventually reach NP complete problems. 3SAT is easily written in the form of database relations after all :-) (But using a database as a 3SAT solver is probably a bad idea, despite being mathematically connected)

fabian2k 4 years ago

Understanding a limited amount of database internals has been very useful to me. There is one aspect of using databases that you simply cannot abstract away and that is performance. If you ask your database a question in a way it is not suited to perform or that isn't supported by indexes performance is not going to be good.

And these performance differences are not small once your database has a decent size. And if you tables are really large it's not a question of fast or slow but fast enough or so slow it's indistinguishable from the database being down.

Of course to some extent you can simply throw hardware or money at the problem. This certainly works for smaller inefficiencies, but sometimes knowing the database will give you orders of magnitude better performance. Hardware and money also don't scale indefinitely.

abanayev 4 years ago

At some point, DBaaS systems should be able to understand and make inferences about your use cases, to the point where indexes and other performance optimizations are automated whenever you register a new query or something. This would be the new era of database systems, and as the article points out is increasingly true about all “infrastructure” concerns.
- pgwhalen 4 years ago
  
  One of the problems with this is that there will always be trade offs. It’s hard to imagine a database understanding the appropriate compromise between read and write speeds for your specific application, for example.

orlovs 4 years ago

Call me old, title theme for me deeply resonates with foundations chapters of loosing important knowledge.

waynesonfire 4 years ago

nothing to do with age. this is published by "PlanetScale is a MySQL compatible, serverless database platform powered by Vitess." -- they have vested interest in promoting the complexities of DBs.
In my view the author has absolutely zero basis to make such a claim.

eatonphil 4 years ago

> MySQL, MongoDB, Firebase, Spanner; there has literally never been a better time to be a database user at any level of complexity or scale. But there’s still one common thread (ha!) among them – the focus is on infrastructure, not developer experience.

It was my impression that everyone picked (and still picks) MySQL, MongoDB, and Firebase _because_ they were the easiest to use. It seemed like developer experience was by far the most important thing to them (compared to sane behavior initially in the case of Mongo and MySQL, some of which has since evolved).

IncRnd 4 years ago

> It was my impression that everyone picked (and still picks) MySQL, MongoDB, and Firebase _because_ they were the easiest to use.
I've found that to be the case, except for enterprise development, which has different concerns than how quickly code gets written to use a database.
- eatonphil 4 years ago
  
  Are you saying that enterprise developers pick MySQL, MongoDB, and Firebase for other reasons?
  
  fnord123 4 years ago
  
  I think GP was pulling apart "everyone".
  
  eatonphil 4 years ago
  
  Ah. To clarify I meant that it was my impression that most people that pick one of those dbs pick it because they are easy to use.
  
  fnord123 4 years ago
  
  Yes but obviously not "everyone picks [those DBS]"
jeffbee 4 years ago

I can't imagine any scenario under which a reasonable person of at least median intelligence would perceive MongoDB as "easier to use" than Spanner, unless their entire experience with Mongo was they put one trivial JSON doc into an M0 cluster and got it back out later. Every practical aspect of MongoDB is a complete shitshow, from sharding to backup to failover.
- eatonphil 4 years ago
  
  With the understanding that Mongo _is_ popular I think you're making my point. Though I have no clue if most people who choose Mongo evaluate it against Spanner.
- 015a 4 years ago
  
  "Sharding" "Backups" and "Failovers" are NOT "practical" aspects of any database. They're theoretical. Most databases are not big enough to need sharding. Most backups go unused. Most failover happens automatically, totally managed by your hosting provider.
  You know what is practical? Schema design. Query language. That's what made MongoDB super popular; no schemas to worry about. A query is just '{ firstName: "John" }'.
  I cannot emphasize this enough: I cannot summon even a milliliter of desire to care about whether Mongo's way of doing these things is actually "better" or "worse". But it is what made it popular.
  
  johncolanduoni 4 years ago
  
  If the database itself can’t handle failover properly, there’s nothing your hosting provider can do except rewrite it (which is exactly what Amazon ended up doing with MongoDB). Also if you’ve spent any significant time being responsible for operating a database and came away with “backups aren’t practical” you’re insanely lucky (or I and everyone I know is insanely unlucky).
  
  munk-a 4 years ago
  
  Backups are most definitely a necessary part of any persistent datastore that is accumulating value for your company. Sharding is something that becomes necessary at scale - having it in your toolbox is a really good idea - but it doesn't come up for most folks. Failovers are the least valuable of what you've mentioned IMO since downtime management is something that even very mature companies have a lot of flexibility with.
  SQL is a pretty easy language to learn - you can get pretty much everything working right with some SELECT, FROM, WHERE, GROUP BY and subqueries alone - the more arcane dialect components of SQL (like HAVING - gosh I hate how easy to misunderstand HAVING is) are things you can grow into. But basic SQL - it works and has worked fine for decades - I strongly dislike tooling and languages that go on crusades to make a better SQL.
  
  jeffbee 4 years ago
  
  > failover happens automatically, totally managed by your hosting provider.
  I see you've never used Mongo Atlas.
  
  hodgesrm 4 years ago
  
  I work for a company that supports ClickHouse. Our focus is analytic systems, which tend to run large compared to OLTP systems.
  * Sharding is part of schema design for any analytic app whose data exceeds the capacity of a single host. This is very common for use cases like web analytics, observability, network management, intrusion detection, to name just a few. Automatic resharding is one of the top asks in the ClickHouse community. (We're working on it.)
  * How do I backup ClickHouse is one of our top 3 support questions in order of frequency. I just taught a ClickHouse class yesterday--part of a regular series--and it was the first question out of the gate from the audience. It has come up at some point in almost every customer engagement I can think of.
  In my experience, your comment is only correct for relatively small applications that are not critical to the business.

askdba 4 years ago

I work at PlanetScale and this post aligns with my thoughts on https://askdbablog.wordpress.com/2018/04/09/learning-the-fun... whether you agree or disagree on some parts.

fmakunbound 4 years ago

> Database internals will eventually just not matter

Of course you need to know the internals of your database. If you've ever come across a project where the team treated a key/value, or document database as a relational one (probably because the query syntax looks similar), then you will know just how important database internals are.

tyingq 4 years ago

That does make some sense, especially for databases that have been around a while. Lots of internals were written with hard drives in mind rather than SSDs, much lower amounts of memory, and so on. On the other hand, it's nice when your database works well in even a very limited or old environment.

_jezell_ 4 years ago

Looks like they let the marketing folks write tech articles again.

klysm 4 years ago

Given that’s pretty much the only way you can differentiate in this market, it makes sense for planetscale to believe that.

withinboredom 4 years ago

> databases will win based on superior developer experience.

I guess rethinkdb really was ahead of it's time.