* Postgres still has the same problem with vacuum horizon, when a long-running query can block vacuuming of a quick-churning table. (The author uses a benchmark from 2015 when the problem was already well-understood.)
* Stock Postgres still has no tools good enough against it.
* The author's company special version of Postgres does have such tools; a few polite promotions of it are strewn across the article.
My conclusion: it's still not wise to mix long (OLAP-style) loads and quick-churning (queue-style) loads on the same Postgres instance. Maybe running 0MQ or even RMQ may be an easier solution, depending on the requirements to the queue.
Queues aside, mixing these loads will probably always be a bad idea unless your database gives you really fine control over cache/buffer pools, so the tables you run analytics on can't dirty the entire cache.
1) It seems these two statements conflict with each other:
> The oldest such transaction sets the cutoff—referred to as the "MVCC horizon." Until that transaction completes, every dead tuple newer than its snapshot is retained.
and
> For example, imagine three analytics queries, each running for 40 seconds, staggered 20 seconds apart. No individual query would trigger a timeout for running too long. But because one is always active, the horizon never advances, and the effect on vacuum is the same as one transaction that never ends.
If the three analytics *transactions* (it's transactions that matter, not queries, although there is some subtlety around deferred transactions not acquiring a snapshot until the first query) are started
at different times, they will have staggered snapshots and so once the first completes, this should allow the vacuum to advance.
2) Although the problem about this query:
SELECT * FROM jobs
WHERE status = 'pending'
ORDER BY run_at
LIMIT 1
FOR UPDATE SKIP LOCKED;
having to consider dead tuples is a genuine concern and performance problem,
this can also be mitigated by adding a monotonically increasing column
and adding a `WHERE column < ?` clause, provided you have also added an
index to make that pagination efficient.
This way you don't need to consider dead tuples and they 'only' waste space
whilst waiting to be vacuumed, rather than also bogging down read perf.
There is a little subtlety around how you guarantee that the column is monotonically
increasing, given concurrent writers, but the answer to that depends on what tricks
you can fit into your application.
3) I almost want to say that the one-line summary is 'Don't combine (very) long-running transactions with (very) high transaction rates in Postgres'
For (2): the problem is that the index would still need to keep the dead tuples, until Postgres is positively certain that no transaction holds them, and runs vacuum over them. It may speed up things a bit, but would still overflow the disk storage eventually. It may still prevent other tables from being vacuumed, too!
For disk usage, yes this doesn't address anything.
But for read performance (which is IMO what the section in the article was motivated by), it doesn't actually matter to have a bunch of entries corresponding to dead tuples in your index, provided Postgres doesn't need to actually consider the dead tuples as part of your query.
So if you have a monotonically increasing `job_id` and that's indexed, then
so long as you process your jobs in increasing `job_id` order, you can use the index and guarantee you don't have to keep reconsidering the dead tuples corresponding to jobs that already completed (if that makes sense).
[This is because the index is a b-tree, which supports efficient (O(log n) page reads for n entries) seeking on (any prefix of) the columns in the index.]
This is fair! This should as you descripbe work with a partial index, and with picking the lowest ID that has status = pending (via that index) which is not locked (via select ... for update skip locked). The query plan should be triple-checked though to actually use the index.
Did you test with fillfactor < 100 on the queue table? With HOT updates, status changes can reuse dead space without creating new index entries, which seems like it could significantly delay the onset of the death spiral?
If the status column changes, and an index depends on the contents of that status column (be it by referencing it in its columns, or in the index's WHERE filter) then an update of the status column will prevent the HOT optimization from being applied.
If I understood correctly, the queue implementation in the blog post holds a transaction while an operation is in progress.
I see the advice to make it as short as possible, but why can’t we update the status column to, say, “processing” and avoid potentially long transactions at all?
This works well for jobs that are long-ish. You need another process to sweep for orphaned jobs and requeue or fail them. Add a timestamp of when it got picked up to keep track
How does this solution compare with using pg_squeeze? We ran into this situation with a queuing solution and we've been testing it out... seems to work.
“Use Postgres for everything” is a great philosophy at low/medium scale to keep things simple, but there comes a scaling point where I want my SQL database doing as little possible.
It’s basically always the bottleneck/problem source in a lot of systems.
Yes. For example you'll typically have a "budget" of 1-10k writes/sec. And a single heavy join can essentially take you offline. Even relatively modest enterprises typically need to shift some query patterns to OLAP/nosql/redis/etc. before very long.
can share our work setup we've been tinkering with at a mid size org. iceberg datalake + snowflake for our warehouse, iceberg tables live in s3, that is now shareable to postgres via the pg_lake extension which automagically context switches using duckdb under the hood to do olap queries acrossed the vast iceberg data. we keep the postgres db as an application db so apps can retrieve the broader data they want to surface in the app from the iceberg tables, but still have spicy native postgres tables to do their high volume writes.
very cool shit, it's certainly blurred the whole olap vs oltp thing a smidge but not quite. more or less makes olap and oltp available through the same db connection. writing back to iceberg is possible, we have a couple apps doing it. though one should probably batch/queue writes back as iceberg definitely doesnt have the fast-writes story. its just nice that the data warehouse analytics nerds have access to the apps data and they can do their thing in the environment they work with back on the snowflake side.
this is definitely an "i only get to play with these techs cause the company pays for it" thing. no one wants to front the cost of iceberg datalake sized mountains of data on some s3 storage somewhere, and it doesn't solve for any sort of native-postgres'ing. it just solves for companies that are doing ridic stuff under enormous sla contracts to pay for all manners of cloud services that joe developer the home guy isn't going to be tinkering with anytime soon. but definitely an interesting time to work near data, so much "sql" has been commercialized over the years and it's really great to see postgres being the peoples champ and helping us break away from the dumb attempts to lock us in under sql servers and informix dbs etc. but we still havent reached a one database for everything yet, but postgres is by and large the one carrying the torch though in my head cannon. if any of them will get there someday, it's postgres.
Of course. The flip side is that many, many more people are in the "low/medium scale" zone than would self report. Everyone thinks they're a scale outlier because people tend to think in relative terms based on their experience. Just because something is larger scale than one is used to, doesn't mean it's high scale.
SQS is dead simple, and if your in AWS (forever) it is "in the stack" with some easy to use features that may make sense to you (delay queue is a great one).
Kafka is... a lot. If you need what it provides, then it's great. You just have to be able to support it, and thats non trivial.
I can point to more than a hand full of Kafka project that exist because it was clear that someone wanted it on their resume. I dont think any one is doing that with SQS, it is just (a fairly good utility). However if you want to leave (or branch out from) AWS and you're reliant on it, good luck.
MySQL’s concurrency control is a limit on the total number of threads that can be active at once, across all queries.
Traffic Control limits concurrency and resource use according to configurable metadata like the username, remote address, or the contents of any sqlcommenter tags included in the query. So you can say things like “the batch processing role can’t run more than four queries at a time.” The finer granularity is key.
In short:
* Postgres still has the same problem with vacuum horizon, when a long-running query can block vacuuming of a quick-churning table. (The author uses a benchmark from 2015 when the problem was already well-understood.)
* Stock Postgres still has no tools good enough against it.
* The author's company special version of Postgres does have such tools; a few polite promotions of it are strewn across the article.
My conclusion: it's still not wise to mix long (OLAP-style) loads and quick-churning (queue-style) loads on the same Postgres instance. Maybe running 0MQ or even RMQ may be an easier solution, depending on the requirements to the queue.
Queues aside, mixing these loads will probably always be a bad idea unless your database gives you really fine control over cache/buffer pools, so the tables you run analytics on can't dirty the entire cache.
Decent article, but some remarks:
1) It seems these two statements conflict with each other:
> The oldest such transaction sets the cutoff—referred to as the "MVCC horizon." Until that transaction completes, every dead tuple newer than its snapshot is retained.
and
> For example, imagine three analytics queries, each running for 40 seconds, staggered 20 seconds apart. No individual query would trigger a timeout for running too long. But because one is always active, the horizon never advances, and the effect on vacuum is the same as one transaction that never ends.
If the three analytics *transactions* (it's transactions that matter, not queries, although there is some subtlety around deferred transactions not acquiring a snapshot until the first query) are started at different times, they will have staggered snapshots and so once the first completes, this should allow the vacuum to advance.
2) Although the problem about this query:
having to consider dead tuples is a genuine concern and performance problem, this can also be mitigated by adding a monotonically increasing column and adding a `WHERE column < ?` clause, provided you have also added an index to make that pagination efficient. This way you don't need to consider dead tuples and they 'only' waste space whilst waiting to be vacuumed, rather than also bogging down read perf.
There is a little subtlety around how you guarantee that the column is monotonically increasing, given concurrent writers, but the answer to that depends on what tricks you can fit into your application.
3) I almost want to say that the one-line summary is 'Don't combine (very) long-running transactions with (very) high transaction rates in Postgres'
(Is this a fair representation?)
For (2): the problem is that the index would still need to keep the dead tuples, until Postgres is positively certain that no transaction holds them, and runs vacuum over them. It may speed up things a bit, but would still overflow the disk storage eventually. It may still prevent other tables from being vacuumed, too!
For disk usage, yes this doesn't address anything.
But for read performance (which is IMO what the section in the article was motivated by), it doesn't actually matter to have a bunch of entries corresponding to dead tuples in your index, provided Postgres doesn't need to actually consider the dead tuples as part of your query.
So if you have a monotonically increasing `job_id` and that's indexed, then so long as you process your jobs in increasing `job_id` order, you can use the index and guarantee you don't have to keep reconsidering the dead tuples corresponding to jobs that already completed (if that makes sense).
[This is because the index is a b-tree, which supports efficient (O(log n) page reads for n entries) seeking on (any prefix of) the columns in the index.]
This is fair! This should as you descripbe work with a partial index, and with picking the lowest ID that has status = pending (via that index) which is not locked (via select ... for update skip locked). The query plan should be triple-checked though to actually use the index.
The article is sparse on what pending means, but I would guess that that where condition would be enough?
Yo! Author here, I’ll be around if anyone’s got questions!
Did you test with fillfactor < 100 on the queue table? With HOT updates, status changes can reuse dead space without creating new index entries, which seems like it could significantly delay the onset of the death spiral?
If the status column changes, and an index depends on the contents of that status column (be it by referencing it in its columns, or in the index's WHERE filter) then an update of the status column will prevent the HOT optimization from being applied.
If I understood correctly, the queue implementation in the blog post holds a transaction while an operation is in progress.
I see the advice to make it as short as possible, but why can’t we update the status column to, say, “processing” and avoid potentially long transactions at all?
This works well for jobs that are long-ish. You need another process to sweep for orphaned jobs and requeue or fail them. Add a timestamp of when it got picked up to keep track
How does this solution compare with using pg_squeeze? We ran into this situation with a queuing solution and we've been testing it out... seems to work.
The problem is way worse if you update rows, if you stick with insert and delete you can get quite far.
Postgres can do so much. I see people choose Kafka and SQS for things that Graphile Worker could do all day long.
“Use Postgres for everything” is a great philosophy at low/medium scale to keep things simple, but there comes a scaling point where I want my SQL database doing as little possible.
It’s basically always the bottleneck/problem source in a lot of systems.
Yes. For example you'll typically have a "budget" of 1-10k writes/sec. And a single heavy join can essentially take you offline. Even relatively modest enterprises typically need to shift some query patterns to OLAP/nosql/redis/etc. before very long.
can share our work setup we've been tinkering with at a mid size org. iceberg datalake + snowflake for our warehouse, iceberg tables live in s3, that is now shareable to postgres via the pg_lake extension which automagically context switches using duckdb under the hood to do olap queries acrossed the vast iceberg data. we keep the postgres db as an application db so apps can retrieve the broader data they want to surface in the app from the iceberg tables, but still have spicy native postgres tables to do their high volume writes.
very cool shit, it's certainly blurred the whole olap vs oltp thing a smidge but not quite. more or less makes olap and oltp available through the same db connection. writing back to iceberg is possible, we have a couple apps doing it. though one should probably batch/queue writes back as iceberg definitely doesnt have the fast-writes story. its just nice that the data warehouse analytics nerds have access to the apps data and they can do their thing in the environment they work with back on the snowflake side.
this is definitely an "i only get to play with these techs cause the company pays for it" thing. no one wants to front the cost of iceberg datalake sized mountains of data on some s3 storage somewhere, and it doesn't solve for any sort of native-postgres'ing. it just solves for companies that are doing ridic stuff under enormous sla contracts to pay for all manners of cloud services that joe developer the home guy isn't going to be tinkering with anytime soon. but definitely an interesting time to work near data, so much "sql" has been commercialized over the years and it's really great to see postgres being the peoples champ and helping us break away from the dumb attempts to lock us in under sql servers and informix dbs etc. but we still havent reached a one database for everything yet, but postgres is by and large the one carrying the torch though in my head cannon. if any of them will get there someday, it's postgres.
Of course. The flip side is that many, many more people are in the "low/medium scale" zone than would self report. Everyone thinks they're a scale outlier because people tend to think in relative terms based on their experience. Just because something is larger scale than one is used to, doesn't mean it's high scale.
> I see people choose Kafka and SQS
SQS is dead simple, and if your in AWS (forever) it is "in the stack" with some easy to use features that may make sense to you (delay queue is a great one).
Kafka is... a lot. If you need what it provides, then it's great. You just have to be able to support it, and thats non trivial.
I can point to more than a hand full of Kafka project that exist because it was clear that someone wanted it on their resume. I dont think any one is doing that with SQS, it is just (a fairly good utility). However if you want to leave (or branch out from) AWS and you're reliant on it, good luck.
It would be nice if this ad at least explained a little bit of the technical side of the solution.
It sounds vaguely like InnoDB’s concurrency control solution which uses tokens [0] as a unit of maximum work a query can perform.
0: https://dev.mysql.com/doc/refman/8.4/en/innodb-performance-t...
tickets, not tokens, but yeah
MySQL’s concurrency control is a limit on the total number of threads that can be active at once, across all queries.
Traffic Control limits concurrency and resource use according to configurable metadata like the username, remote address, or the contents of any sqlcommenter tags included in the query. So you can say things like “the batch processing role can’t run more than four queries at a time.” The finer granularity is key.
We’ve got a separate blog post all about how traffic control works: https://planetscale.com/blog/behind-the-scenes-how-traffic-c...