ryantownsend 3 years ago

This will make complex upserts so much more simple, fantastic addition.

I really hope `RETURNING` support gets added to `MERGE` asap though (I believe it's been noted as a fairly trivial addition to come in future), then it'll be super powerful for doing bulk upserts that require post-processing.

  • ei8ths 3 years ago

    returning is the best.

baq 3 years ago

> prior to Postgres 15, you could use the "upsert" method with the ON CONFLICTclause.

> Now, MERGE can be used instead!

No mention of deadlocks in the article has me worried about thoroughness of the analysis.

  • lfittl 3 years ago

    Agreed - I think MERGE is great, but this post is missing the "Why not" part of the analysis.

    In the Postgres community, MERGE has been talked about for a long time, but in my understanding, part of the reason why the Postgres team initially shipped INSERT ... ON CONFLICT (instead of straight up MERGE) is that it lets you have guarantees about the outcome of the statement (i.e. either INSERT or UPDATE, by use of speculative insertion handling), vs MERGE can cause unique constraint violations and other issues.

    AFAIK, the generic syntax of MERGE does not allow for stricter guarantees, and therefore there will always be cases where one is better than the other.

alexchamberlain 3 years ago

This sounds really cool; maybe slightly cleaner than an INSERT ... ON CONFLICT statement. However, is that the only advantage?

throw03172019 3 years ago

Great to see!

My biggest gripe with ON CONFLICT upserts are the IDs (sequences) having gaps in them. Any good ways to prevent that?

  • loloquwowndueo 3 years ago

    Don’t rely on sequences having no gaps?

    Why do you even need that?

    • leetbulb 3 years ago

      for i smell jank code

    • throw03172019 3 years ago

      I don’t rely on them having no gaps. If the sequence is an integer, it’s much quicker to wrap if the table gets a lot of upserts.

    • btgeekboy 3 years ago

      If they're not accounting for it, integer overflow could be a real problem. (And it's not clear they would account for it in a table that normally has 10s of thousands of entries.) Someone else mentioned it here too: https://news.ycombinator.com/item?id=33242669

      • eusto 3 years ago

        bigint is the default data type for sequences and it's highly unlikely that somebody would reach an overflow with it.

        I've worked on various database solutions, both rdbms and analytical and I find sequences to be one of the most misunderstood features in the industry. The only guarantee they make is that they generate unique values. Some of the newer distributed rdbms don't even guarantee they'll be monotonic.

        Relying on them generating consecutive values is a sure way to get vendor lock-in to whatever database has made that guarantee.

        • richbell 3 years ago

          The problem isn't that values aren't consecutive, it's that the sequence is exhausted quicker than you'd expect. Someone could define a PK as `int4 GENERATED ALWAYS AS IDENTITY` because their expected number of entries is below the cap, only for that cap to be exhausted due to the behavior (in fact, that's been mentioned a few times).

          It 'violates' the principle of least surprise. Intuitively you'd expect `ON CONFLICT .... DO NOTHING` to do nothing, but it will increment the PK every time.

        • btgeekboy 3 years ago

          I wouldn’t be worried about the sequence itself running out, but the column it’s stored in in the table may not be large enough to handle those numbers.

  • richbell 3 years ago

    This is my gripe as well. We used `ON CONFLICT ... DO NOTHING` for a table with ~10k entries that are frequently upserted.

    By the time we reached a few thousand entries, we had primary keys in the millions. I personally don't care that there are gaps in the sequence, but gaps of hundreds of thousands definitely leaves a lot to he desired. I think you can circumvent this behaviour by changing from `DO NOTHING` to `DO UPDATE` and doing a dummy write, but that too leaves much to be desired.

    I also discovered that `ON CONFLICT` doesn't really work with a high number of concurrent writes. We had to implement our own upsert logic using advisory locks.

    • singron 3 years ago

      re concurrent writes: The issue we ran into was deadlocks when we upserted batches of records within a transaction. E.g. transaction A updates row 1 then row 2 and transaction B updates row 2 then row 1. If you sort so that you update rows in the same order across all transactions, then it won't deadlock. E.g. sort records by primary key and then upsert. If you can't do that, then advisory locks seems like a good idea.

    • SPBS 3 years ago

      If you're inserting a lot of data with a high likelihood of conflicts you're probably better off using `INSERT ... SELECT .. WHERE NOT EXISTS` instead, that way you can have gapless sequences.

          -- Will leave gaps
          INSERT INTO t (val) VALUES ('abc'), ('def') ON CONFLICT DO NOTHING;
      
          -- Won't leave gaps
          INSERT INTO t (val)
          SELECT * FROM (VALUES ('abc'), ('def')) AS tmp (val)
          WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.val = tmp.val);
      

      https://www.db-fiddle.com/f/6s4uYA987owJxu3CEvfu8t/0

    • btown 3 years ago

      One of the weirdest outages I've ever had to debug was when one of our tables with millions of rows (but in the low millions, at least so we thought) overran the maximum value for a 32 bit signed integer, due to it being primarily updated with INSERT ON CONFLICT on a nightly basis. At the time we had no idea why the primary keys were climbing so high! That was a tough one to figure out.

  • Merad 3 years ago

    Based on some quick testing MERGE does not appear to cause id gaps, though that does make me wonder if it's able to do so by providing less strict guarantees than ON CONFLICT.

  • dewey 3 years ago

    This only seems to be an issue from an ideological point of view and not a technical one. On small projects it bothered me that I can't immediately see if something is an old entry or a new one just by glancing at the ID.

    In any bigger project it's just noise and IDs are not helpful information anyway, not even at a glance. It's better to just ignore them, use bigint and move on.

albertopv 3 years ago

In SqlServer you have to use MERGE with HOLDLOCK to avoid issues, is it possible in Postgres? Is it so by default?

ramraj07 3 years ago

This is one of my favorite features in snowflake (I’m assuming they didn’t invent it) and am happy to see it in Postgres now.

datalopers 3 years ago

Awesome. How the hell did it take 20 years to finally land?

  • hn_throwaway_99 3 years ago

    > How the hell did it take 20 years to finally land?

    Perhaps because people working for free can decide what they want to work on?

    I think merge is cool, but it also easily replicated with what most of us do now for upserts in postgres using ON CONFLICT.

    • tpxl 3 years ago

      Lots of people are paid to work on Postgres.

    • petesergeant 3 years ago

      > people working for free

      I don’t think that is an accurate characterisation of the Postgres core team