tialaramex 1 day ago

Very often if you have text, which this does, you can make huge savings by being intelligent with the text.

Rust intentionally provides the simplest possible growable string buffer String, which is literally (under the hood, you can't poke this legitimately) Vec<u8> plus the promise that this is UTF-8 text.

But you might find your needs better served by one (or several) of:

Box<str> -- you don't need capacity, so, don't store it => length == capacity

CompactString -- use the entire 24 bytes for SSO, up to 24 bytes of UTF-8 inline, obviously doesn't make sense if all or the vast majority of your strings are 25 bytes or longer

ColdString -- same idea but for 8 bytes, and also not storing capacity, this only makes sense over Box<str> if you have plenty of <= 8 byte strings

  • ben-schaaf 1 day ago

    There's really an endless list of these optimizations. A few I've used (though not necessarily in rust):

    Atoms: Each string can be referenced with a single u32 or even u16, and they're inherently deduplicated.

    Bump allocator: your strings are &str, allocation is super fast with limited fragmentation.

    Single pointer strings (this has a name, I can't think of it right now): you store the length inside the allocation instead of in each reference, so your strings are a single pointer.

    • MrBuddyCasino 1 day ago

      Atoms: is this similar to interned strings?

      • ben-schaaf 23 hours ago

        The names of these things are hazy and inconsistent. In Java and C# an interned string is the same type as other strings. Others describe atoms as interned strings, some call them symbols. At my work we call the u16/u32 atoms and interned strings are the single pointer strings described above.

    • locknitpicker 1 day ago

      > There's really an endless list of these optimizations.

      These aren't really optimizations. They are specialized implementations that introduce design and architectural tradeoffs.

      For example, Rust's Atom represents a string that has been interned, and it's actually an implementation of a design pattern popular in the likes of Erlang/Elixir. This is essentially a specialized implementations of the old Flyweight design pattern, where managing N independent instances of an expensive read-only object is replaced with a singleton instance that's referenced through a key handle.

      I would hardly call this an optimization. It actually represents a significant change to a system's architecture. You have to introduce a set of significant architectural constraints into your system to leverage a specific tradeoff. This isn't just a tweak that makes everything run magically leaner and faster.

      • ben-schaaf 1 day ago

        You might want to refresh your understanding of the word optimisation. Changing a system to be more effective/efficient is optimisation, how big that change is makes no difference.

      • dryarzeg 1 day ago

        > everything run magically leaner and faster

        In my opinion, there's no magic in the software engineering. Everything (or almost everything) is a system that can be described, explained, modified and so on. Applications, libraries, operating systems, kernels, CPUs/RAM/GPU/NPU/xPU/whatever silicon there is, ALUs/etc, transistors, electricity, physics... That's nowhere near "magic". There's always some trade-offs, it's just that you may not be aware of them initially.

    • tialaramex 21 hours ago

      ColdString is both your "Single pointer string" and a Small String Optimisation on top.

      First, on the heap we have a self-indicating length prefix, basically we use the bottom 7 bits of each byte to indicate 7 bits of length and the top bit indicates there are more bits in the next byte. So "ben-schaaf" would be 0x0A then the ASCII for "ben-schaaf"

      But, we avoid even having a heap allocation if we have 8 or fewer UTF-8 bytes to encode the text, that's our Small String Optimisation.

      To pull this off we specify that our heap allocations will have 4 byte alignment even though they don't need it. This shouldn't be a problem, in fact many allocators never actually deliver smaller alignments anyway.

      This means our pointer now has two spare bits, the least significant bits are now always zero for a valid heap pointer. We rotate these bits to the top of the first byte (this varies depending on whether the target is big-endian or little-endian) and we mask them so that for these valid pointers they are 0b10xxxxxx

      So, now we can look at the "single pointer" and figure out

      If it begins 0b10xxxxxx it really will be a valid pointer, rotate it, mask out that flag bit and dereference the pointer to find the length-prefixed text.

      If it begins 0b11111AAA there's a short string here but it didn't need all 8 bytes, the next AAA bytes of the "pointer" are just UTF-8 and conveniently AAA is enough binary for 0 through 7 to be signalled, the exact length we have

      If it has any other value the entire 8 bytes of "pointer" is a UTF-8 encoded string

      • VorpalWay 21 hours ago

        Storing the length in the allocation has potential performance tradeoffs too! The most obvious is that taking a substring/string view will need a copy (or use a different type that store the length outside the allocation).

        But it also means the CPU has to follow the pointer (and potentially get a cache miss or pipeline stall) to find the length. Having a fat pointer of ptr+length makes a lot of sense for string views, and for owned string buffers with capacity it can mean avoiding a cache miss when appending to the buffer.

        It's complicated in other words.

    • Rendello 16 hours ago

      > Bump allocator

      You can build an ad-hoc bump allocator by using a String and indexing into it. You can't use &str references though, as a growing String may reallocate elsewhere and invalidate your references (Rust won't even let you try this), so you have to use your own indices. This is the same thing that bump allocator libraries usually do, too. It can be tricky but have great performance gains.

      I recently 100x-d the speed of an XML/HTML builder I use internally by rewriting it to only have one thing on the heap, a single String. Every push happens right at the call site linearly, and by passing data through closures the formatting (indentation, etc.) is controllable. My first iteration was written in the least efficient way possible and had thousands of tiny allocations in nested heap objects, it was painfully slow.

  • eldenring 1 day ago

    CompactStr doesnt have any additional runtime overhead iirc right? So in theory you can drop it in everywhere even when you expect > 25 chars. Maybe an extra branch in the >25 char case?

    • kibwen 1 day ago

      SSO does have overhead. Firstly, on every access you have a branch. Secondly, and more severely, the "most general" umbrella type that all string methods are defined on is a string slice, and whereas conversion from `String` to `&str` is literally a no-op, SSO strings require work to be done to convert them to string slices. Furthermore, note that in the (surprisingly common) case where the string is zero-length, String already skips the allocation, same as an SSO string.

  • SkiFire13 1 day ago

    > String, which is literally (under the hood, you can't poke this legitimately) Vec<u8>

    `String::as_vec_mut` kinda implies that, since it gives you access to that underlying `Vec` which must then exist somewhere.

    • _flux 1 day ago

      I looked it up: https://doc.rust-lang.org/std/string/struct.String.html#meth...

      In case anyone else was wondering it, yes, it's "unsafe".

      • tialaramex 1 day ago

        The thing they were gesturing at, correctly, is the naming. This is of course a convention and not a promise, but by convention Goose::as_crow would be a function that is cheap and gets you say &Crow instead of the &Goose you might have now, whereas Goose::to_donkey suggests that although we can have a Donkey instead of this Goose it's expensive to do that.

        Commonly as... conversions are actually no-ops at runtime (the type changes but the data does not, no CPU instructions are emitted) whereas to... conversions might do quite a lot, especially if they bring into existence an actual thing at runtime -- maybe Goose::to_donkey actually needs to go allocate memory for a Donkey and destroy the Goose.

        Yes it's unsafe because the Vec doesn't enforce the promise we made about this being UTF-8 text whereas String did, so now that promise is ours to keep and `unsafe` is how we signify that you the programmer took on the responsibility for safety here.

        • SkiFire13 1 day ago

          Yes, naming does play a role here, but the biggest hint is `as_vec_mut` returning a reference. For that to work a `Vec<u8>` needs to exist somewhere, and continue to exist after this function returns. For comparison, `to_` conversions generally just return the new data, so this reasoning doesn't apply to them.

  • lenkite 1 day ago

    I wish Box and Option got specialized specialized shorthand syntax in Rust say `^`/ `? or something like that.

       Option<Box<SmithyTraits> -> SmithyTrait^?
    • ninkendo 1 day ago

      Box<T> used to be ~T early on in rust… (then it became a `box` keyword, before being removed entirely.) They got rid of it because they wanted to move more things into libraries and have a less opinionated compiler.

      I think I agree though, especially with Option. Swift’s option syntax (and kotlin’s which is similar) is so much better, a simple question mark in the type. Options are important enough that dedicated syntax makes so much sense. Rust blew their chance here with ? meaning “maybe early return”, it would have been a lot more useful as an Option indicator.

      • kibwen 1 day ago

        I used to think this too, but nowadays I think this would have been a mistake. Result is actually the more important and fundamental type, not Option, and giving Option special syntax would cause people to want to favor it over Result even when it shouldn't be. If anything, I think that `?` should have been reserved as a suffix for function names (not types) that return Result, so that instead of `fn try_foo()` and `let x = try_foo();` we could have `fn foo?()` and `let x = foo?();`, and then the current `?` operator could just be spelled `.try`, akin to `.await`. (And then we maybe could go further and reserve `!` as a suffix for functions that can panic...)

      • foresterre 1 day ago

        The Try trait (representing the ? the operation) is super cool though! I wish it was marked stable so you could implement it for types without using the nightly compiler.

        Note that both Option and Result implement that same trait.

        Perhaps if try blocks ever become a thing... we can finally use it for our own types ;)

        https://doc.rust-lang.org/std/ops/trait.Try.html

        • tialaramex 23 hours ago

          The modern Try trait (try_v2) is indeed wonderful and I hope to see it stabilized one day.

          AIUI a key innovation is ControlFlow, reifying the Break/ Continue choice as a sum type in the type system. This is already stable and is a useful piece of vocabulary even without its contribution to understanding the Try trait.

          Knowing that Bob's CircusPerformance trait and Sarah's SeaLion type both use ControlFlow to decide whether we should keep going or halt ASAP means you don't have to write fraught adaptor code because Bob thought obviously the boolean "true" means keep going while Sarah's understanding was that it's a signal about being finished, so "true" means stop.

          For Try what ControlFlow did was unlock the difference between "Success / Failure" as encoded by Result::Ok and Result::Err and the "Halt / Carry on" distinction ControlFlow::Break and ControlFlow::Continue. Often we want to stop when there's an error, but sometimes we mean the exact opposite, carry on trying things until one of them succeeds.

  • arijun 23 hours ago

    What does Box<str> give you that &str doesn’t?

    • h4x0rr 23 hours ago

      Ownership

_alphageek 1 day ago

If anyone's doing this kind of optimization, dhat-rs is worth a look, it shows you exactly which fields and call sites are eating memory, instead of just a total. Saves a lot of guessing about where to start.

Groxx 1 day ago

tbh "trait" feels like a very problematic name for that type, for this kind of educational purpose - `trait` is already an established concept and keyword: https://doc.rust-lang.org/book/ch10-02-traits.html

It's especially problematic because traits don't have memory behaviors like this article in most cases - by default they're unsized, because it's a description of behavior, not data, and you can't even use them as a struct field without extra work.

Like, replace "trait" in here with "box" and see how confusing it would be to be describing how you saved memory by boxing your box, because option doesn't box like many other languages do.

mstange 1 day ago

Are there any tools that help finding these kinds of things? Like a profiler that says "80% of the allocated bytes are objects of this type, with 95% of those having that field set to None"

  • gizmo686 1 day ago

    The closest I am aware of is clippy (`cargo clippy` in a standard Rust project will run it with default configurations).

    Clippy is essentially a linter; and one of its checks catches cases where different enum variants have a significantly different size; with a suggestion to Box the larger variant.

    Since this is just a linter, it doesn't actually have any knowledge of how frequently each variant is actually used. It also doesn't address the situation in the article at all.

  • majormajor 1 day ago

    It would be super useful since I think this is pretty likely to be surprising to many users. But the profiler would need to be a particularly-specific refinement of even that: you need to make it obvious that it's not "95% of your Option<Thing>s are None, and your Option<Things> are using X bytes", but that "95% of the bytes used for your Option<Thing>s are used for None versions." Otherwise you could just assume that your non-None ones are just that chunky, or you have that many of them... I haven't seen a profiler with that level of insight, unfortunately.

    Perhaps because this feels like a fairly rust-specific gotcha. Especially if you're coming from languages where there's often not much syntactical distinction made between "this is a pointer because I don't want to be copying it" and "this is a pointer because it's optional."

    For instance, it's not until now that I actually understood what the sibling comment about the Enum type size discrepancy lint meant: "This lint obviously cannot take the distribution of variants in your running program into account. It is possible that the smaller variants make up less than 1% of all instances, in which case the overhead is negligible and the boxing is counter-productive. Always measure the change this lint suggests." I had always accidentally read this backwards, thinking it meant something more to the effect of "if most of the instances are actually small, then it's not a problem here, but be aware that some of them are much larger so some of your calls to things with this could end up passing much larger types."

    • Groxx 1 day ago

      "You have 400 megabytes of zeros in <this type>" is probably a pretty easy heuristic to add.

      • gizmo686 1 day ago

        That may be surprisingly difficult in Rust. We generally think of Option<T> using O to represent None. However, it can actually use any invalid value of T

        • tialaramex 1 day ago

          For example None of Option<OwnedFd> is the bit pattern for the integer -1, the invalid Unix file descriptor

          And in this particular context None of Option<CompactString> is the bit pattern for a carefully chosen impossible 24 byte slice, all zeroes is of course a completely valid way to spell 24 of the ASCII NUL U+0000 character so we can't use that to signify None, but many 24 byte slices are not valid UTF-8 encodings.

          When some day I get to make my own BalancedI8 in stable Rust (the 8-bit signed integer except without the slightly annoying and rarely needed most negative value -128) then None of Option<BalancedI8> will occupy the bit pattern for -128 which is 0x80

          • kawogi 1 day ago

            I often wish this type existed, but didn't find a way to define the niche value for the compiler to optimize the None-case.

            Do you know a solution to this?

            • tialaramex 1 day ago

              I suppose you are thinking of BalancedI8 ?

              So, the long answer with lots of reading material is that you want "Pattern Types" and who knows when that could land in Rust. Pattern types are a simple Refinement Type, you can go absolutely wild with refinement in theory and some niche languages go much deeper but all we want here is to say "Only these values of the type are allowed" and by implication all bit patterns not used for those values are a niche and Rust would optimise accordingly.

              Rust doesn't (even unstably) have Pattern Types so you might wonder how NonZeroU32 works for example, or indeed OwnedFd, and similar types. For these types there's a permanently unstable "compiler only" feature flag which allows you to explicitly specify the niche, that's how they're made. So, if you're comfortable writing unstable nightly-only software you can do this today, go look at the guts of NonZeroU32 for example.

              If you want to write Rust software for ordinary people who use stable Rust you have two options today and for the indefinite future in which Pattern Types are not stable:

              1. Enumerations: An enumeration which has say 5 values clearly doesn't need byte values 0x5 through 0xFF, so Rust stably promises this is a niche. For BalancedI8 that's kinda messy with 255 values but tolerable, some real crates use this trick or one related to it and they work fine - for a hypothetical BalancedI16 it's imaginable to create 65535 values but silly, and for BalancedI32 clearly we should not expect the compiler to accept an enumeration with 4 billion+ named values so no...

              2. XOR trick: Rust provides NonZeroI8 for example. We can "make" BalancedI8 by providing accessors which always XOR -128 (0x80) with the value and then actually internally we store the NonZeroI8 instead, at runtime now operations incur an extra XOR which is a very cheap ALU operation. This works for any "Not this" value because of the properties of the XOR operation, so unlike with enums this is practical for BalancedI64 for example.

            • wongarsu 1 day ago

              There is nook which does it with unstable features [1] and nonany [2] which uses xor operations to map your custom niche value to 0 so it can use the NonZero* types to achieve the same in stable rust.

              Eventually rust will likely gain pattern types as a more generally useful version of the features used in nook. There is even some actively ongoing work on this [3]

              1: https://github.com/tialaramex/nook/blob/main/src/balanced.rs

              2: https://github.com/rick-de-water/nonany

              3: https://github.com/rust-lang/rust/issues/135996

        • Groxx 10 hours ago

          Niche values: fair! You can't just scan memory looking for 0000000000... (... for some types at least)

          But if a debugger can show you all values of type X, and the compiler defines null-y values of X, surely there's some way? That still doesn't seem complicated, just not as overwhelmingly-trivial as some languages.

  • groundzeros2015 1 day ago

    I think the number of instances should be a clue that you need to look at the layout.

  • JavierFlores09 1 day ago

    I've personally found heaptrack[1] pretty good for this task, very straightforward to use and the info is detailed enough. Though, it'll only tell you where they are happening (e.j. allocation rate for Box::new), but not exactly what type they are given that info isn't available at runtime. Usually that kind of thing would be reserved to GC-based languages where they keep track of counts for each object.

    1: https://github.com/kde/heaptrack

vlovich123 1 day ago

Small correction:

> a lot of boxes means a fragmented heap. In such case it's not a problem but this might be worth keeping in mind.

A good malloc will be able to handle this without issue due to various optimizations specifically that inherently fight fragmentation. Default Linux malloc (glibc) may have issues but I did say good malloc (and even glibc generally shouldn’t struggle with the pattern described I think).

el_pollo_diablo 1 day ago

So there are now two ways to represent the same state: None or Some(struct whose fields are all None). Even though one of these representations is never produced by the deserialization routine, anyone could construct it if the constructor is public. And even if they don't, the different representations will show up in pattern matching as separate paths for every access to the field. This looks like a good opportunity to make these types (optimized for storage) private, and to define public view objects/accessors (optimized for usage) on top of them that merge equivalent representations.

mattstir 22 hours ago

This is an interesting concept. I was initially thinking that this would only lower the memory usage of the stack by moving everything to the heap, but of course you can skip heap allocation if the struct you'd be instantiating is "empty". The stack doesn't really have that luxury since the compiler needs to guarantee that there's space on the stack for the value, if it were to be initialized.

krautsauer 1 day ago

[Edit:deleted]

  • stebalien 1 day ago

    Box<str> is still two words (length and pointer). That's better than the 3 words (length, pointer, capacity) for strings, but Box<String> is one word (not including the heap allocation).

squirrellous 1 day ago

I wonder from time to time whether you can decide the best “schema shape” beforehand, ie before you can run real workloads that stress the memory implications of such things. This can be very useful if you are trying to decide the boundary of some public facing API, but for whatever reason can’t run benchmarks (lack of impl, data, time, etc).

Without that, if you try to suggest a transformation like this when the schema is first conceived, it will likely be considered premature optimization.

WhyNotHugo 1 day ago

TLDR: use a nullable pointer instead of fields in nested structs to save memory.