Why should a Trace-ID be 128 bits? (A Surprisingly Long Answer)

ralferoo 1 day ago

Surprised the author didn't even think about the logical conclusion of his closing paragraph: "128 bits is the ideal sweet spot, collision safety effectively forever, and it happens to match the size of a UUID, which means every database, every language, and every protocol already knows how to handle it."

UUIDs are already generated randomly for exactly the same reason. Rather than inventing something new, they should have just used a UUID.

benmmurphy 1 day ago

Generating 16 random bytes is simpler than generating a random UUID
- drdaeman 1 day ago
  
  And there’s a good reason for that, because UUIDs have additional properties. I don’t know if versioning, partial ordering, or stable references are useful for traces or not, but with UUIDs those could’ve been a possibility.
- ralferoo 1 day ago
  
  It basically makes no odds, unless you consider applying a constant AND and constant OR operator complicated - as UUID v4 is just 122 random bits and 6 bits fixed.
  UUID v7 is a 48 bit timestamp, 74 random bits and 6 bits fixed. Sure, this is a little more complicated, but it's often worth it for many applications because it can be sorted, so keys will be approximately monotonically increasing.
  
  benmmurphy 1 day ago
  
  I think UUIDv7 could make sense but I suspect the recommendations in the spec predate UUIDv7. Also, if you want sorted schemes then there are slightly more efficient schemes than UUIDv7. With UUID you are always sacrificing some bits to distinguish between the UUID types which I guess does not really matter in practice but it seems unnecessary.
  
  ralferoo 1 day ago
  
  Yeah, in practice the 6 bits you lose aren't important. Redoing his calculations with 122 bits and a quadrillion generated IDs (that's a million billion), the probability of a collision is 9.4 x 10^-8 (or one in 10 million) using UUID v4.
  In my opinion, UUID v7 is useful because you per millisecond, you still have 74 bits split between user defined (up to 12) or randomness (minimum 62). If you choose the minimum 64 bits randomness, you can read the numbers straight from the article - 1 million UUIDs per millisecond with less than one in a million chance of collision, but you still have 10 bits to add additional data, such as which machine generated it.
  If you stick with just time and have the full 74 bits of randomness, you can generate a trillion (10^12) UUIDs per millisecond with less than one in 40 billion chance of collision (2.6 x 10^-11) using UUID v7.
  I think the fact the formula is (k^2/2N) actually shows that having a time component makes better use of the bits than a purely randomised space. In this example, we have a lower chance of collision with a trillion (10^12) UUIDs generated per millisecond than a quadrillion (10^15) UUIDs across all time.
  
  ralferoo 1 day ago
  
  BTW I realised I didn't address why those bits are necessary - actually, while it might seem you are increasing randomness with more bits and so reducing the risk of collisions, that's not necessarily true.
  The old schemes generated numbers that weren't uniformly distributed across the 128-bit space as they were intentionally biased in certain ways, such as time [0] and MAC addresses [1]. This means that most of the IDs generated in previous schemes would have many bits in common, and so the UUIDs that had been generated were not uniformly distributed across that 128-bit space [2] and so if you just used the whole 128-bits for random data, but didn't use those extra bits to avoid conflicts with the previous schemes, then random IDs that happened to be valid in the previous schemes would be more likely to collide.
  Of course, this only matters if the properties of globally unique matter to you. For a closed system with a guaranteed scope, sure who cares? But given that the extra randomness doesn't add any useful value beyond a certain threshold, you might as well use a UUID because you don't know what that identifier might end up being used for in the future, plus you can use off-the-shelf systems to generate them.
  [0] Ironically, future proofed time fields with many bits are more likely to be non-linearly distributed - e.g. the original version 0 UUID supported timestamps from 1582AD to 5236AD but was only used from 1987 for around a decade.
  [1] With certain manufacturers of network cards massively more popular than others, their MAC address prefixes showed up significantly more frequently, and there were privacy concerns were you could correlate between UUIDs generated on a single machine, and sometimes infer machines that might be on the same network because they had similar MAC addresses and so the cards were probably all from the same manufacturing batch.
  [2] Which is fine within the scope of UUIDs as they are still very likely to be globally unique, so it doesn't really matter if bits are wasted in this scheme

devin 1 day ago

From a practical standpoint, isn't it usually the case that there are retention periods for traces given how numerous they can be?

I bring this up because this article starts with "I asked Claude", but it doesn't explore the the length of time you're generating IDs over at all, which is an important aspect to consider when selecting size.

singron 1 day ago

Yes. The original Dapper used 64 bit trace ids and collisions were rarely a problem.
If you don't drop any spans from a trace, you can completely disambiguate a collision since the trace will have two distinct root spans. If you are missing spans, you might have a break in the parent-child links.
Even with infinite retention, your analysis will bucket by time somehow, so a collision might have no effect if the collision doesn't happen at a proximate time. If you are manually looking at traces, it will be very obvious there is a collision unless they happen at the same time.
Also, birthday paradox only expresses probability that there is a collision somewhere, but if you are filtering or looking at single spans, then the probabiliy that you actually see a collision is greatly reduced.
I think for basically all systems, an additional 64-bits has insignificant additional cost, so you may as well prevent collisions, but I think it could be a reasonable tradeoff if it mattered.
- devin 1 day ago
  
  nod Adding this to my growing list of "things experienced engineers would discuss which is conspicuously missing in this case"
  The future is going to be filled with "best practices" trendslop decision-making.

_trampeltier 1 day ago

Why not 256, "because of bandwith costs". An adblocker does save bandwith costs, but not a handful bytes from an ID.

gpderetta 1 day ago

TL;DR Birthday Paradox.

qbane 1 day ago

tl;dr we reinvented UUID and it works well
- drdaeman 1 day ago
  
  Certainly not true. UUIDs have structure to them, and variants. Trace IDs are just 128-bit numbers, with any further semantics (almost) completely non-standardized (some systems encode timestamps, etc). They slapped a “last 56 bits are random” flag (not in the ID itself but as external metadata, so not like UUID at all) later giving IDs just a bit of semantics, but it’s not a reinvention of UUIDs.
  
  qbane 1 day ago
  
  Okay, sightly more bits than UUID v4. The whole article is merely reasoning "why at least 128 bits are required", and if you smuggle some non-random data inside these bits the entropy can only drop, making it more vulnerable to collision, i.e. inferior to UUID v4.
  
  ralferoo 1 day ago
  
  I kind of addressed this in https://news.ycombinator.com/item?id=48060549
  Actually, because the birthday paradox has k^2 as a term, this is actually less true than you might think. Having a time component actually reduces the chance of collisions over the long run, albeit at a cost of reducing the number that can be safely generated in any given quantum.
  If you consider a 128-bit random number, you effectively have 64 bits of allocation space before you are likely to get a collision.
  If you devote 48-bits to time, which provides millisecond accuracy for 9000 years, you then have 80 bits of randomness, effectively giving 40 bits of allocation space per millisecond before you are likely to get a collision.
  Instead of approx 2^64 allocations across all time before a collision, you instead have 2^40 (1 trillion per millisecond). That sounds like a poor deal, until you realise that the factor is only 2^24, or 16777216ms or under 280 minutes.
  So in reality, reducing the random space and increasing bits that are guaranteed unique is actually a great trade.
  
  qbane 16 hours ago
  
  I realized that my mentioning UUID without v4 was misleading.