I have written a lot of SIMD for both x86 and ARM over many years and many microarchitectures. Every abstraction, including autovectorization, is universally pretty poor outside of narrow cases because they don’t (and mostly can’t) capture what is possible with intrinsics and their rather extreme variation across microarchitectures. If I want good results, I have to write intrinsics. No library can optimally generate non-trivial SIMD code. Neither can the compiler. Portability just amplifies this gap.
I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?
The intrinsics are not difficult but you do have to learn how the hardware works. This is true even if you are using a library. A good software engineer should have a rough understanding of this regardless.
For me the main issue is that if you're serious about SIMD, you need to use a state-of-the-art library and can't rely on some standard library whose quality is variable, unreliable, and which is by design always behind.
For some algorithms you have to compromise the data layout for compatibility across the widest number of microarchitectures by nerfing the performance on advanced SIMD microarchitectures working on the same data structures. There really isn’t a way to square that circle. You can make it portable or you can make it optimal, and the performance gap across those two implementations can be vast.
In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.
NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?
Numpy is interesting in that regard since its dispatch mechanism adds up to a lot of overhead. There are a lot of problems where a naive list comprehension is faster, even when SIMD could be used to great effect.
For Boost.SIMD (which is what became Eve), a large part of what we did to tackle those problems was building an overload dispatching system so that we could easily inject increasingly specialized implementations depending on the types and instruction set available, in such a way that operations could combine efficiently.
That, however, performed quite poorly at compile-time, and was not really ODR-safe (forceinline was used as a workaround). At least one of the forks moved to using a dedicated meta-language and a custom compiler to generate the code instead. There are better ways to do that in modern C++ now.
We also focused on higher-level constructs trying to capture the intent rather than trying to abstract away too low-level features; some of the features were explicitly provided as kernels or algorithms instead of plain vector operations.
Don't let the best be the enemy of the good. I got amazing performance for swapping for-loops with some simple SIMD patterns. Moreover. By doing this. I noticed that the codebase started to become better shaped for performance as well. By writing SIMD patterns, you get into the mindset of tight, hot loops.
The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.
If you wanted to explicitly opt into bundling/batching of operations, you wouldn't actually want to define a fixed register size. You'd want a data type that represents an arbitrarily sized register and exposes some across batch operations. Then the compiler can make use of this mini DSL to optimize your SIMD code to actual instructions.
The problem is solvable, but it requires cooperation from all parties. CPU vendors must offer a basic set of vector instructions that is supported on all architectures. The language committee must be willing to support function local variable size data types that are never exposed in the ABI. The compiler developers must increase the quality of their auto vectorizers.
> The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.
This will work only for the most basic SIMD usages.
> CPU vendors must offer a basic set of vector instructions that is supported on all architectures.
This will take decades because you cannot change existing architectures/processors.
Yeah AVX-512 is basically dead as a universal target for x86, the future is now AVX-10. But I believe there is a reasonable subset that will work on both.
It's a little dramatic to say avx512 is dead versus 10 - rather, I would say that avx10 finalizes a universally available set of avx512 extensions. For AVX 10.1, there's essentially, no difference after Intel backed out of reducing the vector length.
For at least the next decade AVX 512 will be the high performance target, reaching all of the zen4/5/6 CPUs as well as whatever avx-10 enabled CPUs Intel producers.
This works today :) Highway provides such an abstraction for arbitrary vector lengths and maps them to intrinsics. All on the library level, no need to wait years for compiler or language updates.
> I think a legitimate criticism is that it is unclear who std::simd is for.
I think it's for people like me, who recognize that depending on the dataset that a lot of performance is left on the table for some datasets when you don't take advantage of SIMD, but are not interested in becoming experts on intrinsics for a multitude of processor combinations.
Having a way to be able to say "flag bytes in this buffer matching one of these five characters, choose the appropriate stride for the actual CPU" and then "OR those flags together and do a popcount" (as I needed to do writing my own wc(1) as an exercise), and have that at least come close to optimal performance with intrinsics would be great.
Just like I'd rather use a ranged-for than to hand count an index vs. a size.
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.
I mean, why not? That's exactly my use case. I don't use SIMD today as it's a PITA to do properly despite advancements in glibc and binutils to make it easier to load in CPU-specific codes. And it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions. But it is legitimately important for improving performance for many workloads, so I don't want to miss it where it will help.
And even gaining 60, 70% of the "optimal" SIMD still puts you much closer to highest performance that the alternative.
In the end I did end up having to write some direct SIMD intrinsics, I forget what issue I'd run into starting off with std::simd, but std::simd was what had made that problem seem approachable for the first time.
You raise some good points. I think a lot about how to make SIMD more accessible, and spend an inordinate amount of time experimenting with abstractions, because I’ve experienced its many inadequacies.
The design of the intrinsics libraries do themselves no favors and there are many inconsistencies. Basic things could be made more accessible but are somewhat limited by a requirement for C compatibility. This is something a C++ standard can actually address — it can be C++ native, which can hide many things. Hell, I have my own libraries that clean this up by thinly wrapping the existing intrinsics, improving their conciseness and expressiveness for common use cases. It significantly improves the ergonomics.
An argument I would make though is that the lowest common denominator cases that are actually portable are almost exactly the cases that auto-vectorization should be able to address. Auto-vectorization may not be good enough to consistently address all of those cases today but you can see a future where std::simd is essentially vestigial because auto-vectorization subsumes what it can do but it can’t be leveled up to express more than what auto-vectorization can see due to limitations imposed by portability requirements.
The other argument is that SIMD is the wrong level of abstraction for a library. Depending on the microarchitecture, the optimal code using SIMD may be an entirely different data structure and algorithm, so you are swapping out SIMD details at a very abstract macro level, not at the level of abstraction that intrinsics and auto-vectorization provide. You miss a lot of optimization if you don’t work a couple levels up.
SIMD abstraction and optimization is deeply challenging within programming languages designed around scalar ALU operators. We can’t even fully abstract the expressiveness of modern scalar ALUs across microarchitectures because programming languages don’t define a concept that maps to the capabilities of some modern ALUs.
That said, I love that silicon has become so much more expressive.
IMO what's needed is ISPC like guided autovec with a lot of hinting support to control codegen (e.g. hint for generating an unrolled version only or an unrolled and non-unrolled version).
Basically something like #pragma omp SIMD, but actually designed for the SIMD model, not parallel one, that erros when vectorization isn't possible.
Ideally it would support things like reductions, scans, reference of elements from other iterations (e.g. out[i] = in[i-1]+in[i+1]), full gather scatter, early break, conditional execution control (masking or also a fast-path, when no active elements), latency vs throughput sensitive (don't unroll or unroll to max without spilling), data dependent termination (fault-only-first load or page aligned for thigs like strlen), ...
> it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions
This is one complaint I toss back at Intel and AMD.
If an instruction/intrinsic is universally worse than the P90/P95/P99 use case where it's going to be used to another set of instrinsics, then it shouldn't exist. Stop wasting the die space and instruction decode on it, if not only the developer time wasted finding out that your dot product instruction is useless.
There are a lot of smart people that have worked on compilers, optimized subroutines for LAPACK/BLAS, and designed the decoders and hardware. A lot of that effort is wasted because no one knows how to program these weird little machines. A little manual on "here's how to program SIMD, starting from linear algebra basics" would be worth more to Intel than all the money they've wasted trying to improve autovectorization passes in ICC and now, LLVM.
Have you considered our Highway library? Runtime dispatch need not be a PITA :) It's basically portable intrinsics, and a much more complete set (>300) than the ~50 in std.
Yes, the EMU128 target is scalar only, with for loops. This is a fun way to see how well autovectorization works, with the same source code.
That works on any CPU. Curious which projects have such concerns, any link?
People reported challenges building V8 (whether upstream or the Node.js variant) on s390x with z13 support. I don't know if it was discussed on the porters mailing list because it's not public: https://groups.google.com/g/v8-s390-ports
Thanks for sharing. The first link seems non public indeed.
I can imagine there is some compile issue we could reasonably fix, with the help of someone who has Z13 access. Please encourage them to raise an issue. I will be back on May 26.
After that, it should at least be able to use the scalar fallback.
The issue with Z14 is that it lacks fp32 support. Would their usage be integer only?
I hadn't but it would make sense for doing my own personal programming challenges.
Given the ongoing disasters around the software supply chains I've been fighting the creeping NPM-ism that people are trying to introduce to C++, where you just FetchContent 20 different libraries to build your own app upon.
I do use gtest, fmt and a few others though, so something as broadly used as Highway would probably be fine by that standard as well. But I'd still like it better if there was a Good Enough solution that was part of C++ stdlib to reduce the number of external integrations that are deemed required for a modern C++ program.
Fair point. If it helps, our security team has called Highway critical infrastructure and helped to harden the repo.
The flip side of standardization is that it would be much harder and slower to add ops as the need arises, which we do regularly.
> I think a legitimate criticism is that it is unclear who std::simd is for
It's for people that don't use SIMD today.
SIMD is hard, or at least nuanced and platform-dependant. To say that std::simd doesn't lower the learning curve is intellectually dishonest.
---
Despite the title, the primary criticism of the article is that the compilers' auto-vectorizers have improved better than the current shipped stdlib version.
My criticism could mostly be summarized similarly. The scope of what a portable std::simd can do is almost exactly the scope that you would expect auto-vectorization to subsume over time. SIMD, to the extent it is covered by std::simd, is the part of SIMD that should be pretty simple to learn.
There isn’t an obvious path to elevate it above what auto-vectorization should theoretically be capable of in a portable way. This leads to a potential long-term outcome where std::simd is essentially a no-op because scalar code is automagically converted into the equivalent and it is incapable of supporting more sophisticated SIMD code.
The full scope of what SIMD is used for is much larger than parallelizing evaluation of numeric types and algorithms.
For example, it is used for parallel evaluation of complex constraints on unrelated types simultaneously while packed into a single vector. Think a WHERE clause on an arbitrary SQL schema evaluated in full parallel in a handful of clock cycles. SIMD turns out to be brilliant for this but it looks nothing like auto-vectorization.
None of the SIMD libraries like Google Highway cover this case.
Almost literally what I stated. Consider a row in Postgres table or similar. Convert the entire WHERE clause across all columns in that table into a very short sequence of SIMD instructions against the same memory. All of the columns, regardless of type, are evaluated simultaneously using SIMD. For many complex constraints you can match rows in single digit clock cycles even across many unrelated types. This is much faster than using secondary indexes in many cases.
It isn’t hypothetical, I’ve shipped systems that worked this way. You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth.
OK, I thought it couldn't be that, because that should be doable with std::simd or a SIMD abstraction. Well, unless you JIT it, in which case intrinsics wouldn't help either.
> You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth
Do I underatand it correctly, that this would only work, if you have multiple of the same comparisons (e.g. equality check with same sized data) in the WHERE clause and the relevant collumns are within one multiple of the SIMD width of each other?
Every column has its own independent constraint: equality, order, range intersection, bit sets, etc that is evaluated concurrently in single operations. Independent per column in parallel. It does require handling the representation of columns to enable it but that isn’t onerous in practice.
It isn’t intuitive but it is one of those things that is obvious in hindsight once you see how it works. The gap is that people struggle to understand how to make this something SIMD native, especially in high-performance systems.
Ah, so you're just doing SoA or AoSoA layout? It sounded like you where doing something more special than the standard SIMD usecase.
This does easily work with SIMD abstractions and even length-agnostic vector ISAs, unless you're doing AoSoA and your storage format has to match your memory format and it has the be the same on all machines. In which case you probably want to do something like 4K blocks anyways, in which case you can make it agnostic for all vector length anybody reasonably cares about for this type of application anyways.
Autovectorisation is the main way SIMD hardware gets put into use, whether you think it's pretty poor or not.
SIMD came to mainstream in 1995 Pentium MMX and has been proven rather difficult for compilers to target, but after 30+ years is doing a bit better despite PLT conspiring against it. (see eg CUDA, Futhark etc)
In my limited experience with looking at autovectorisation compiler output, gcc is quite bad unless you hold its hand, and clang tries to autovectorise everything it sees.
Compilers have definitely got better though: another issue in the past (maybe still is to a degree? although compilers have got a lot better at this in the past 15 years, but it used to be one of the things only Intel's ICC actually got right), that if you wrapped the base-level '__m128' or 'float32x4_t' in a struct/union in order to provide some abstraction, the compiler would often lose track of this when passing the struct/union through functions (either by value or const ref), and would often end up 'spilling' (not entirely the correct terminology in this context, but...) the variable from registers, and just producing asm which ended up uselessly loading the variable again from a stack address further up the call stack, when it didn't actually need to do that. So that was the situation even when using intrinsics within custom wrappers.
From 2011 to around 2013 ICC seemed to be the only compiler on amd64 which wouldn't do this. If you passed the actual '__m128' down the function call chain instead, clang and gcc would then do the right thing.
Part of that could be ABI constraints. There are some surprising calling convention differences between a vector and a struct or union with vectors in it, and they vary platform to platform. E.g. on ARM a struct with two 128-bit vectors will pass in two registers where on x86 it must pass via the stack.
Using __attribute__ to tweak calling conventions can often really clean this up, but that's just as obscure and non-portable as the problem it fixes. So you either end up writing weird non-portable code one way or weird non-portable code another... Code working with these types doesn't get to benefit from zero-cost abstraction to the degree we're used to with normal scalar code.
In such discussions, whenever you mention abstractions are universally "pretty poor", to the extent anyone is listening, I think this hyperbole can do real damage. Maybe it prevents people from getting relevant performance gains, even if not 100% of the optimum, which is anyway unattainable. And what is the alternative? Not many projects can afford to hand write intrinsics for all platforms. And are you aware that Highway is basically a thin wrapper over intrinsics, which you can still drop down to where it helps?
Not who you asked but I think the meaning is that since intrinsics for simd are different in each platform, being able to have something that is portable and sometimes works faster is something, while writing for Intel, ARM and a zoo of instruction sets is not an option for some.
:) I figure there is always something left to improve. For some kernels which really want to keep 30+ live registers, the compiler might not do as good a job as careful manual tuning, so intrinsics can have a bit of a cost. But I also figure optimization time is limited, so better to get 90% of several kernels rather than one to 99%.
Besides Spolsky's law of leaky abstractions, "abstractions" can also result in "lowest common denominator" situations, which are the opposite of performance optimization. Talking negatively about abstractions is not what deals damage; you are shooting the messenger here. It's the abstractions themselves that deal damage when misplaced. "Zero-cost abstractions" is the true hyperbole.
Is this a good faith reply? The particular abstraction we built, and is being discussed, is manifestly and obviously not a lowest common denominator.
Looks like you are deploying a second straw man, that of zero cost. In other comments here I acknowledge a cost to intrinsics.
I am aware of Highway. It doesn’t add much value for the kind of SIMD code I write. I have better abstractions because I don’t have to consider portability nearly as much. Some useful constructions don’t have a good expression on weaker SIMD architectures.
Do you say that from the perspective of compiled languages? I hear good things about .net core wrt SIMD, but that has the advantage it can decide at JIT.
I'm not the person you're asking, but I share that opinion for both compiled languages and JIT solutions, including .net core specifically. All but the most trivial use cases can't be autovectorized, by JIT or otherwise. One of the recent things I worked on (reed-solomon decoding) offers basically zero opportunities for autovectorization unless the compiler reinterprets certain scalar loops as dedicated galois instructions on AVX512F hardware, but that optimization isn't implemented, it wouldn't help other architectures anyway, and it's still 10x slower than a well thought out vectorized approach.
I think what a SIMD library does, above all else, is get the programmer to write code in a way that can be directly translated into SIMD instructions. A big issue that compilers have to contend with is that they aren't allowed (unless you enable ffast_math) to rearrange floating point operations. Putting an add or a multiply in the wrong place can spoil SIMD optimizations that the compiler could otherwise pull off.
But the problem is as you state. For people that really care about that sort of thing, they are likely going to have the exact SIMD sequence they want to execute in mind anyways. That leaves you with a definition that is doomed to be both not low level enough and too low level.
I think what this is useful for is a fallback description of the desired SIMD operations. It won't be ideal on non-targeted platforms, but it will be something.
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.
why? at least I see that I will start with std::simd in my pets. If this would not enough, I would go forward to intrinsics. But, I think, starting with std::simd would be much simpler for beginner.
> I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?
There is plenty of vectorization that are simple enough to be done with std::simd today and that will still bring any autovectorizer begging on its knees for various reasons.
As an anecdote, I currently got a 8x speedup with std::simd (AVX2 & SVE2) on a rather trivial parser of mine recently that autovectorizer failed miserably to do properly.
Would I have get better result using intrinsics ? Likely, yes.
Did I want to suffer the maintainability and portability pain associated with it for a simple parser ? Certainly not.
For these use case, std.simd does the job.
And will probably do a better and wider job with time when it get enriched by the committee.
The blog brings some valid criticism but really looks like a flame war trying to destroy an already opened door.
(1) Is there more performant solutions that std::simd for vectorization ?
Yes, of course.
The STL evolves slow, its main goal is to provide a generic and portable implementations of a set of algorithms. Not to provide the best implementation in existence.
The best implementation of most algorithms (including SIMD patterns) evolves every 6 month, you can not expect a standard library with 3 different implementation to keep up with that.
(2) Is the future of vectorization ISPC ?
Nope. ISPC has been around for > 10y and is still niche.
There is very good reasons to that: Yes it can generate better code but in most use case, adding a massive dependency of a compiler + an arbitrary LLVM version + a DSL on your project is not worth it.
Specially considering that it is an Intel project and that Intel (almost) abandonned the project multiple time (In pure Intel fashion).
So yes, criticism is easy, and yes std::simd is full of problems.
But I am glad it exists, and thanks to the people that made it happen... Because it is useful, even in the current state.
The point about the optimizer only seeing "opaque templates and function calls" makes little sense.
First off, templates are the opposite of opaque due to the fundamental requirement that the implementation be visible to every translation unit using a template. This makes any function calls trivially inlinable.
Second, and the reason for the above requirement, templates are compiled by monomorphization – making a distinct, separately optimizable copy of each concrete instantiation of a template. By the time the compiler backend sees the intermediate representation, there’s nothing about templates left.
There are of course reasons why highly abstracted template code may be difficult to optimize, for instance if function call chains are so deep that the inliner gives up. There are also legitimate reasons why a fully language-based solution might beat a library-based one. But one of the points of adding a library to the std is that the standard library is allowed to cheat as much as it wants. It can be deeply integrated to the compiler and implemented entirely using compiler magic if necessary.
std::simd may be too little, too late for many reasons, but I doubt any of them is that the compiler can’t see through the code.
Compiler guy here - yes, the optimizer claim is,in fact, just totally wrong. The claim is it can't be constant folded, scheduled, etc because it doesn't turn into simd instructions. At the bottom, as you say, it will turn into code. That code is almost certainly builtins or uses the vector extensions or whatever , not raw asm statements. compilers can and do turn these into ir level simd instructions.
> First off, templates are the opposite of opaque due to the fundamental requirement that the implementation be visible to every translation unit using a template.
That's not strictly true, you can have an implementation hidden in a separate TU, as long as that TU instantiates the template for all template arguments that the users are going to use.
As a compiler guy, the complaint about "opaque templates and function calls" to me raises serious doubts that the author has any idea what they're talking about. std::simd is designed to be akin to taking vector operations as intrinsics on <4 x f32> and similar types and wrapping them in a more C++ dialect than bare compiler intrinsics (and then a second layer on top of that to make things somewhat more portable).
So the implementation of all of the std::simd at the bottom should be tiny functions that map to essentially a single instruction, specified via a header file in a mechanism that guarantees you always have the body. This makes the functions trivially obvious candidates for inlining. Since it's a C++26 addition, the dispatching logic through the layers can largely be done via if constexpr, which means most of the code is discarded by the frontend.
Given that the complaint seems to be about not vectorizing a call to a sin function, it's possible that it's implemented in libstdc++ in such a way that the library doesn't know about the compiler's -fveclib implementation. But then again, the complaint is based on the libstdc++ v14 implementation of C++ Parallelism std::simd, not the v16.1 implementation of C++26 std::simd, which is completely different (and landed circa 2 months ago).
So I've got a foot in each camp, I think you're just using different languages - you guys mean different things with the same words.
You can't claim he doesn't know what he's talking about with a single point he may have gotten run, the makes a tonne of valid points, especially around the existing problems of C++ that this library doesn't help with.
In addition to that, he's not wrong about this library from a user perspective. I can't use this. I wrote something very similar back in 2016 - at the time it served my needs but now it's hilariously outdated.
One thing I will point out is that the code in the article is compiled with `-march=native` and `-ffast-math`, meaning that they're really only compiling for the exact same machine they are running on and no other. This seems like it is mainly applicable to places which can easily recompile code for the exact known hardware that they run on, such as HFT and some scientific computing.
Places which compile code to distribute for people to run on a variety of processors and platforms (or that require floating point code to be consistent between them), i.e. games and applications, will still be targetting a low end baseline architecture and therefore have a different outcome. I can say that in this space we are only now reaching the point where we can start compiling for AVX2, as we can expect the lowest end-user processor to support it.
That certainly convinced me. When I was doing my taxes recently and had to watch those forced loading animations, I kept asking myself "why can't my compiler do this?" Thanks to std::simd, now it can!
This is the first time I've seen a classic for loop called a "boomer loop", but apparently this isn't even the first instance (not the first definition):
I made the first proposal to the C++ standard committee to introduce SIMD in 2011, before Matthias Kretz got involved with his own version (which is what became std::simd). This was based on what eventually became Eve (mentioned in the article).
Back then, it was rejected, for the same arguments that people are making today, such as not mapping to SVE well, having a separate way to express control flow etc.
There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language. Then that died out (I'm not sure why), and SIMD became trendy, so the committee was more open to doing something to show that they were keeping up with the times.
Trying to abstract over SVE with a SIMD library is a bit of a fool's errand. The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it. All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.
Frankly, the length agnostic stuff is a mistake that I hope hardware designers will eventually see the light on, like delay slots.
> Trying to abstract over SVE with a SIMD library is a bit of a fool's errand
It reallt isn't. You just make the default SIMD-width agnostic and anything less portable opt-in.
You can still specialize for a specific width pn scalabe vector ISAs.
> The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it.
Such as?
> All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.
google highway doesn't. And while Arm is stuck with 128-bit SVE, because they alsp have to implement NEON as fast as possible to be competitive, RVV already has a large diversitly of hardware with different vector length available 128,256,512,1024.
I have a database that has big columns that get functions applied to them to compute the result set. This is a perfect case for length agnostic instructions, except out ends up horribly memory bound. A nice optimization is to only compute those lanes containing rows that might actually be in the result set by keeping track of a sparse record that depends on the lane size. But the cnt instructions are optional, and this also inhibits compiler optimizations in that lookup.
CNT and CNTP don't seem to be optional for SVE, from what I found. (unless you mean HISTCNT)
It seems to me like you want tp use CNTP on a bitset that tells you, which rows are relevant, skipping them if CNT is 0?
Is that what you where describing?
I was confused and thinking that streaming mode and CNT were in separate extensions, but they're both in SME. My bad.
Anyway, essentially yes. My previous comment didn't mention all of the context. The join enforces that the result set is the intersection of the individual column sets, so it gets increasingly sparse as individual columns are computed. So I just maintain a bit tree that says which columns could populate the result set and skip computing the other lanes, which depends on the vector width and benefits from knowing it at compile time.
In a way it's worse because at least with int you're not really expecting to run the same binary on architectures with different int lengths, and also for several decades there have only been two realistic options (32 or 64), which makes it easy to deal with.
With RVV (and SVE I assume) there are a wider range of realistic options - at least 128, 256 and 512. The RVV spec allows up to 65536! Also it's totally reasonable to want a single binary to work with all of them so then you're into compiling parts of your code multiple times with runtime dispatch which is a right pain.
I haven't looked into how Highway does it but I don't really know you you write length-agnostic code in high level languages. It's easy in assembly, but it sucks if you have to do it in assembly.
I don't know how SVE works but I thought the point of it was to let implementations pick a larger size than the CPU supports and then get an automatic speedup from better processors with more vector lanes.
To me it’s clear adding the ability to express intent to parallelise is the Right Thing. This is the only way the compiler can actually know what you want it to do.
> There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language.
I think this is the best solution for truely portable SIMD. Sure it doesn't cover everything, but it makes autovec explicit, guaranteed and more powerfull.
One of the biggest problems with "portable" SIMD libraries, is that when it's used for simple things, often autovec is better, as it has access to the direct ISA semantics and can much easier do things like unrolling.
GCC already solved it:
https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
The operations behave like C++ valarrays. Addition is defined as the addition of the corresponding elements of the operands. For example, in the code below, each of the 4 elements in a is added to the corresponding 4 elements in b and the resulting vector is stored in c.
Those type attributes are also used for the x86 intrinsics API, and they override default C behaviors like promotions and presumptions around aliasing (ironically they make type punning easier, though maybe it was just the few use cases I explored, and this isn't an area where I have alot of experience). C23 also gained the _BitInt type, which discards all the old promotion rules, which should help autovectorization.
I think ISPC is still the proper way to go. But these days everybody wants One Language to Rule Them All along with standard libraries for doing everything out-of-the-box. And while in principle ISPC's approach could be stitched into C or C++ in a fairly clean manner (perhaps with well-defined and enforced segregation of constructs to minimize complexity), it's just not gonna happen: C++ is too enamored with constructing libraries through deeply complex templated types (hammer, nail, yada yada), and C is just too conservative (though if GCC or clang went the distance with a full implementation, there's a good chance the C committee would adopt it).
Currently experimental, but looks like the first Intel arch will arrive in the next release in about 3 months. They are also going to support a portable layer.
Wondering what people here think about the approach the Go team is taking; I think they would appreciate more eyeballs on their design. (I’m not competent in this space (yet))…
"The Default Width Problem" -- this section seems confused and definitely reeks of LLM authorship. It's comparing -march=native against std::simd and complaining that std::simd<T,8> breaks portability with pre-Haswell. This is a real issue, but -march=native is no better! It bakes in the SIMD width at compile-time as well, so that binary also won't run on a pre-Haswell machine. It's a real issue but neither side solves it. You need runtime dispatch (a la Google Highway) to solve this.
If we want to improve cross-platform SIMD, in my opinion we should start by supporting more operations in LLVM IR. Like vector expansion (currently we only have expandload), runtime-known shuffle vectors, pdep/pext operations.
Also, let's stop with the "vector length agnostic" types being the sole option for SVE extensions. I'd rather write an optimized routine for a 16-byte machine I'm targeting and be able to upgrade it in 5 years than have "agnostic" code that wants to pretend like it would work amazingly on all platforms, but the machine I optimized it for is theoretical. I'm fine with recompiling my code, I do it every day. If I have an algorithm that's truly vector length agnostic, I can make the vector length a constant in my code that can change based on the compile target.
> Also, let's stop with the "vector length agnostic" types being the sole option for SVE extensions
They aren't, see the `arm_sve_vector_bits` attribute.
> I'm fine with recompiling my code, I do it every day
Then you can do that.
> If I have an algorithm that's truly vector length agnostic, I can make the vector length a constant in my code that can change based on the compile target.
You can do that, but why not simply write it in a vector-length-agnostic way?
IMO the better approach is to start thinking about SIMD optimizations in a VLA way, and specialize on the vector length, when that becomes advantageous.
Doing it this way is better even if you end up not writing VLA code, because you though about the scalability problem.
Many libraries currently don't scale beyond 128-bit, not because they couldn't make efficient use of >128-bit, but because the library was architect around 128-bit and changing that amounts to almost a full rewrite. So now you are stuck wasting 3/4th of your ALUs running 128-bit SSE on Zen5.
Just write inline asm for x86 and aarch64 (if you care about that) and not care about the rest. Is it even useful to do simd on other processors?
Compiler optimizing even the code around the simd code based on the semantics of arithmetic or other things sounds silly after writing some of this kind of code
Agreed, fixed with vectors needs to be a language feature, better compile times and would solve issues for most people.
Personally, I think that like Clang way to adding GLSL like vectors and semantics would've gone a long way. SVE might be an elegant design, but in reality there are probably a multiple factor of game and other 3d code being written that needs vectors compared to other fields, and there limited vector sizes aren't really a problem.
And honestly, considering the story of AVX512.. with 512 bit vectors being removed from mainstream by Intel, do we really really need longer ones despite it being from a "scalable design"?
In GPUs GLSL like types compile down to what basically is variable length SIMD.
A vec4 doesn't get compiled to a SIMD vector with four floats, but rather to four SIMD vectors, each containing N FP32 elements (usually 32 or 64).
Intel has been forced to reintroduce 512-bit vectors in the mainstream, because of the competition from AMD.
Starting with the Intel Nova Lake CPUs, around the end of this year, all future AMD and Intel CPUs will provide 512-bit vectors, like also the current AMD Zen 5 and Zen 4 CPUs.
The 512-bit vector length is more convenient than other lengths, because on the AMD and Intel CPUs it coincides with the length of a cache line. Because of this, it is easier to optimize simultaneously for the best cache usage.
For GPUs, which favor throughput over latency, 1024-bit and 2048-bit vector register widths are frequently used. For CPUs it is unlikely that widths greater than 512-bit would be useful, as the vector operations that should be done on CPUs are those for which the high latency of using a GPU is undesirable.
greater then 512-bit SIMD isn't currently and in the near future relevant for regular general purpose processors.
But for smaller more specialized CPUs in embedded or automotive usecases you can get more parallel compute, while keeping the software model simpler than having to dispatch to a GPU.
Specifically a design like https://saturn-vectors.org/#_short_vector_execution, which like to use 2x or 4x wider vectors that the datapath length for more efficient chaining.
I quite like that design, because you can get high utilization and limited out-of-order execution without vector register renaming.
> When Google needed portable SIMD for production image and video codecs, they built Highway — not std::simd.
Sure, they left the committee years ago. I am not trying to claim any sort of direct causality, but it sure seems like this is a case where Google's presence on the committee might have prevented shipping boondoggles like this. Modules is another case where I think Google's feedback might have been able to steer the ship in a better direction.
Yes. The straw that broke the camel's back was the complete refusal to break ABI, locking in bad implementations forever. e.g. unordered_map dramatically underperforms when compared against modern swiss tables, but the committee won't do anything about it. Not to mention the committee's head-in-the-sand policy-based approaches to safety, vs Google's much broader-scoped Carbon effort.
If you thought std::simd was a library nobody asked for, just wait until you hear about <linalg>. I feel like half the people looking forward to that think they're just going to get standard C++ bindings to LAPACK, when instead they're probably going to get an unoptimized, slapdash implementation of LAPACK written by people who aren't good at BLAS.
As for SIMD itself, designing a good SIMD library is difficult because there are several different SIMD approaches and some of them work poorly for certain use cases. For example, you can take an HPC-ish approach of "vectorize this loop" (à la #pragma omp simd) and have the compiler take care of a fairly mechanical transformation. Or you can take an opposite approach of treating a 128-bit SIMD vector as a fundamental data type in your language. Which approach is better depends on your use case.
The work of one obsessive author, who never gave a good explanation for why the thing needed to be in the standard library instead of an external one. The committee was apathetic about the proposal and kept bringing up various trivial issues, in a clear attempt to stall him, but he refused to take the hint. So eventually they relented. Outside coverage I have seen so far seems to be to the tune of "WTF is this weird thing?" and quickly glosses over it.
I wonder if it's going to end up like the export keyword.
I feel like std::hive fits right in to the C++ stdlib group of collections
The least stupid is std::vector which is just the typical O(1) amortized growable array type found in most modern languages, with a mediocre API. 8/10 could do better.
std::array is just the built-in array type C++ should have but doesn't. This shouldn't be a library type, that's embarrassing.
std::deque looks like you're getting something like Rust's VecDeque but you aren't, it's a weird hybrid optimisation which presumably made sense on some 1980s hardware. I asked STL once to explain what it's even for and they didn't know. [[For reference STL is the name of the guy in charge of Microsoft's implementation of the C++ standard library, Microsoft also calls that library STL for reasons we needn't address]]
std::list is the extrusive doubly linked list. This type makes sense in a DSA class. Why is it in the C++ standard library? I dunno, maybe C++ is intended only as a teaching language?
std::forward_list is the extrusive singly linked list. You know, for a different seminar in that same DSA class. You might want the intrusive linked list, you don't want this.
std::map and std::set are probably red-black trees. OK, you might need those and for some reason not care about the details (which aren't specified here)
std::multimap and std::multiset are even less obviously useful. I have never seem them used in real software. Why are they in the standard library?
std::unordered_all_of_the_above_maps_and_sets look like the simplistic hash table you'd be shown in an intro DSA class either taught by somebody who doesn't know the subject well or aiming to cover the basics and get back to their research. This will perform poorly on any hardware with features like a cache.
The C++ stdlib carries broken garbage basically indefinitely. C++ doesn't have the same library stability promise that Rust has, but in practice stuff that nobody cares about is never removed.
That std::hive will fit right in. Another container type you probably shouldn't use, draining precious maintenance resource from groups who have better things they could be doing.
> These are in the standard library because someone proposed their inclusion.
As with std::hive. Indeed the "unordered" containers, just like std::hive were repeatedly knocked back and eventually got in decades after they were obsolete. Persistence really does pay off in C++
> They're fine for the majority of people who really don't want to roll their own data structures each time.
Sure, doubtless std::hive is fine for that same majority of people.
>The committee was apathetic about the proposal and kept bringing up various trivial issues, in a clear attempt to stall him, but he refused to take the hint.
That's a mean interpretation, mean both towards the committee and towards the author.
Have you read the entire paper, and not just skimmed the front matter?
The interface is a generic template approach, which can work on any element type T, not just float/double/complex<float>/complex<double>, but custom types like bigint or rational or random_custom_finite_field. Or integration with units libraries (there's another dumpster fire coming down the line...). Your BLAS library will provide you just the four basic element types, so it takes a decent amount of dispatch logic to convert the template interface to the actual library calls, and you still need fallback logic anyways to handle the other types.
But the library is also not designed in a way to facilitate that kind of dispatch logic (std::simd is, which accounts for a not insubstantial portion of its complexity). Which is on top of the difficulty of linking to one of various BLAS implementations as a standard library. So it's a design that's all but guaranteed to let you link against an existing BLAS implementation, and indeed, carefully reading the rest of the section you wrote makes it clear that it's not a goal of the paper proposal to have implementations do that.
> Your BLAS library will provide you just the four basic element types, ..., and you still need fallback logic anyways to handle the other types.
so, your problem with it is that it does all you want (e.g. LAPACK bindings) AND give extra features??
> so it takes a decent amount of dispatch logic to convert the template interface to the actual library calls
I can't estimate how much this degrades performance. But, it feels very low overhead compared to the calculation itself (and probably should be resolved at compile time)
Something everyone is missing is that this is just feature parity with other langauges, this is not c++ specific. They added it because other languages have these compiler hints which are then (usually) used in llvm as opaque types for 'smarter' optimizations. Hand rolled code will still be better, but there are very niche instances where you want llvm to know this information, especially important when you don't care about performance, but you care about data integrity and obfuscation.
I'd actually rather just have the compiler give some guarantees on producing SIMD code when you write regular C++ code doing sums, multiplications, etc... in a particular way. And perhaps add a few more operators/keywords to the language for modern CPU instructions (we got things like popcount, countl_zero and fma, but what about e.g. pext, pdep, aes, ...)
> The pattern is clear: every major project that actually needs portable SIMD in production chose a third-party library or a different language
of course they chose third-party library, because C++26 is just only published and don't have wide support/adoption experience.
> And the most damning data point might be EVE itself — a committee member looked at std::simd, decided it wasn’t good enough, and built his own library.
It's just a manipulation. First commit in eve[1] was published in 2018. There was no any std::simd in standard at that time.
> Nobody waited for std::simd. By the time it ships in C++26, these libraries will have a decade of production battle-testing, real user feedback, and cross-platform coverage that std::simd can’t match on day one.
Manipulation 2. No library have decade "of production battle-testing, real user feedback, and cross-platform coverage" on day one. So, why their authors created them?
> Including <experimental/simd> pulls in deeply nested template machinery — simd.h, simd_x86.h, simd_builtin.h, and friends. A trivial function computing sin on a SIMD vector takes about 2.2 seconds to compile. The equivalent scalar for-loop? 0.2 seconds.
Would be more interesting if you compare this with precompiled headers and C++20 modules.
> The std::simd version? It emits actual vsqrtps + vmulps because the optimizer can’t perform algebraic simplification through opaque template function calls:
opaque template function calls? What is this?
Of course there is 1000 examples when compiler can do better job with scalar loop. And there is 1000 examples when it can't. But, for some reason people do write simd manually. Probably because they want predictable code generation - no massive slowdown on another compiler/another compiler version/another cpu/another one line of loop changed.
> sqrt(x) * sqrt(x)
what compiler would generate with manual simd intrinsincs? I doubt the same as scalar mul.
> The frustrating part is that the problems are well-understood. SIMD programmers have been asking for the same things for years, and none of them are in std::simd.
Show me your proposal with critique of std::simd, if you asking for them. Or at least someone other proposal.
How you can understand that someone asking?
Too many emotional statements in, too little technical details.
Maybe there's an interesting story in there, it's certainly possible. But the "author" could not be bothered to write it, and so why should we waster our time reading it?
> The problem is that std::simd in 2026 is the 2012 solution arriving after the world moved on. The committee spent a decade polishing a library-based approach while compilers solved the easy cases automatically and ISPC solved the hard cases with language-level support.
I find it interesting that the C++ committee would make that kind of mistake. Shouldn't they know better?
> I find it interesting that the C++ committee would make that kind of mistake. Shouldn't they know better?
The main reason why people attend WG21 meetings is to get their pet features into the C++ language or the associated standard library. To some extent you can further that goal by shooting down other people's suggestions, especially if they would conflict ‡
C++ is a vast sprawling language. There are no genuine "C++ experts" for the same reason there aren't any people who know all of mathematics. There are a lot of people who are experts on some corner of the language or its libraries, and some who know a little bit about almost everything but no overarching experts.
‡ A good way to do this work would try to have such rivals all work together to improve the language, a sort of "yes, and" collaborative approach but although this has occasionally been able to work in C++ the whole WG21 structure works against it, in particular they vote to achieve consensus, which is not what the word "consensus" means and rewards appeasing haters much more than it does finding out what the problems are and working to fix them.
Many people think there are a lot of problems with C++ committee and the standards process. Some would claim that the ISO governance model doesn't work at all. There is a lot of drama that the outside world has no idea about, because much of the discussion is behind closed doors.
You can look it up (e.g. on blogs and /r/cpp), but I am not linking to anything because lots of content is very biased and hard to verify.
I do feel the TC39 (the group behind the ECMAScript/JavaScript standard) seems more practical and effective. There are disagreements and dramas but not nearly as bad as with C++.
The C++ standards committee is pretty damn dysfunctional at this point for a variety of reasons.
Only like 10% of the committee are actually responsible for an implementation in some manner; the vast majority are users, often looking to get their feature into the standard. This also means that only a tiny minority of the committee actually understands things like the difference between a prototype hack and a proper implementation. I get the sense that it's extremely bad on the library front--all of the standard library implementors I know are basically pleading "please stop adding new features, we want time to catch up."
One of the big issues with library features is that library vendors can't just copy-paste existing implementations for licensing reasons, so they have to reimplement it largely from scratch, and they people doing so may not necessarily be skilled in that particular domain. On top of that, standard libraries are much more sensitive to ABI breaks than other libraries are, so a bad design gets ossified to a much worse degree than regular libraries. The best examples of baked-in bad implementations are std::unordered_map and std::regex, but honestly even std::unique_ptr has similar ABI-unfixable issues (it's not a pointer for ABI calling conventions). Yet you still see people cheer on additions to the standard library because obviously those people are going to make existing implementations better.
C++ sits on that weird abstraction level where it wants to be a higher level language but it keeps grinding their gears on stuff like pointer sizes, pointer arithmetic or vector sizes and at the same time wants to keep being C compatible and needs that interface with the lower level world
Now compare with how numpy does things: you care about the data size but not the implementation.
Still, I didn't expect less (of a crap fest) from the C++ committee as presented here
sadly inline assembly is still at the ergonomics of "one compiler doesn't support it in x64 mode" and "you can choose between the readable syntax (which is a black box to the compiler) and the unreadable syntax (which can specify I/O/clobber regs)"
OK, is there a horrible speed penalty for writing your SIMD in pure assembly functions and then calling those functions? If you're writing assembly anyway, just drop the "inline" part.
I have written a lot of SIMD for both x86 and ARM over many years and many microarchitectures. Every abstraction, including autovectorization, is universally pretty poor outside of narrow cases because they don’t (and mostly can’t) capture what is possible with intrinsics and their rather extreme variation across microarchitectures. If I want good results, I have to write intrinsics. No library can optimally generate non-trivial SIMD code. Neither can the compiler. Portability just amplifies this gap.
I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?
The intrinsics are not difficult but you do have to learn how the hardware works. This is true even if you are using a library. A good software engineer should have a rough understanding of this regardless.
For me the main issue is that if you're serious about SIMD, you need to use a state-of-the-art library and can't rely on some standard library whose quality is variable, unreliable, and which is by design always behind.
For some algorithms you have to compromise the data layout for compatibility across the widest number of microarchitectures by nerfing the performance on advanced SIMD microarchitectures working on the same data structures. There really isn’t a way to square that circle. You can make it portable or you can make it optimal, and the performance gap across those two implementations can be vast.
In the 15-20 years I’ve been doing it, I’ve seen zero evidence that there is a solution to this tradeoff. And people that are using SIMD are people that care about state-of-the-art performance, so portability takes a distant back seat.
NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?
Numpy is interesting in that regard since its dispatch mechanism adds up to a lot of overhead. There are a lot of problems where a naive list comprehension is faster, even when SIMD could be used to great effect.
For Boost.SIMD (which is what became Eve), a large part of what we did to tackle those problems was building an overload dispatching system so that we could easily inject increasingly specialized implementations depending on the types and instruction set available, in such a way that operations could combine efficiently.
That, however, performed quite poorly at compile-time, and was not really ODR-safe (forceinline was used as a workaround). At least one of the forks moved to using a dedicated meta-language and a custom compiler to generate the code instead. There are better ways to do that in modern C++ now.
We also focused on higher-level constructs trying to capture the intent rather than trying to abstract away too low-level features; some of the features were explicitly provided as kernels or algorithms instead of plain vector operations.
The data layout can often be done dynamically based on your target architecture.
Don't let the best be the enemy of the good. I got amazing performance for swapping for-loops with some simple SIMD patterns. Moreover. By doing this. I noticed that the codebase started to become better shaped for performance as well. By writing SIMD patterns, you get into the mindset of tight, hot loops.
The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.
If you wanted to explicitly opt into bundling/batching of operations, you wouldn't actually want to define a fixed register size. You'd want a data type that represents an arbitrarily sized register and exposes some across batch operations. Then the compiler can make use of this mini DSL to optimize your SIMD code to actual instructions.
The problem is solvable, but it requires cooperation from all parties. CPU vendors must offer a basic set of vector instructions that is supported on all architectures. The language committee must be willing to support function local variable size data types that are never exposed in the ABI. The compiler developers must increase the quality of their auto vectorizers.
> The problem is that you're better off by defining SIMD friendly data structures and letting the compiler figure it out than by hand coding the actual SIMD operations.
This will work only for the most basic SIMD usages.
> CPU vendors must offer a basic set of vector instructions that is supported on all architectures.
This will take decades because you cannot change existing architectures/processors.
> This will take decades because you cannot change existing architectures/processors.
I think once, AVX-512, SVE and RVV are wide spread enough, you'll have a rather powerfull baselevel you can target. But this will take a lot of time.
> AVX-512
Which subset though? Some of them are not supported by some recent CPUs (e.g. 2024).
Not to mention Alder Lake not supporting AVX512.
Yeah AVX-512 is basically dead as a universal target for x86, the future is now AVX-10. But I believe there is a reasonable subset that will work on both.
It's a little dramatic to say avx512 is dead versus 10 - rather, I would say that avx10 finalizes a universally available set of avx512 extensions. For AVX 10.1, there's essentially, no difference after Intel backed out of reducing the vector length.
For at least the next decade AVX 512 will be the high performance target, reaching all of the zen4/5/6 CPUs as well as whatever avx-10 enabled CPUs Intel producers.
This works today :) Highway provides such an abstraction for arbitrary vector lengths and maps them to intrinsics. All on the library level, no need to wait years for compiler or language updates.
what you effectively said is "there should be only one isa".
Because if that was all it took, why wouldn't it also apply to every other instruction set too?
> I think a legitimate criticism is that it is unclear who std::simd is for.
I think it's for people like me, who recognize that depending on the dataset that a lot of performance is left on the table for some datasets when you don't take advantage of SIMD, but are not interested in becoming experts on intrinsics for a multitude of processor combinations.
Having a way to be able to say "flag bytes in this buffer matching one of these five characters, choose the appropriate stride for the actual CPU" and then "OR those flags together and do a popcount" (as I needed to do writing my own wc(1) as an exercise), and have that at least come close to optimal performance with intrinsics would be great.
Just like I'd rather use a ranged-for than to hand count an index vs. a size.
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.
I mean, why not? That's exactly my use case. I don't use SIMD today as it's a PITA to do properly despite advancements in glibc and binutils to make it easier to load in CPU-specific codes. And it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions. But it is legitimately important for improving performance for many workloads, so I don't want to miss it where it will help.
And even gaining 60, 70% of the "optimal" SIMD still puts you much closer to highest performance that the alternative.
In the end I did end up having to write some direct SIMD intrinsics, I forget what issue I'd run into starting off with std::simd, but std::simd was what had made that problem seem approachable for the first time.
You raise some good points. I think a lot about how to make SIMD more accessible, and spend an inordinate amount of time experimenting with abstractions, because I’ve experienced its many inadequacies.
The design of the intrinsics libraries do themselves no favors and there are many inconsistencies. Basic things could be made more accessible but are somewhat limited by a requirement for C compatibility. This is something a C++ standard can actually address — it can be C++ native, which can hide many things. Hell, I have my own libraries that clean this up by thinly wrapping the existing intrinsics, improving their conciseness and expressiveness for common use cases. It significantly improves the ergonomics.
An argument I would make though is that the lowest common denominator cases that are actually portable are almost exactly the cases that auto-vectorization should be able to address. Auto-vectorization may not be good enough to consistently address all of those cases today but you can see a future where std::simd is essentially vestigial because auto-vectorization subsumes what it can do but it can’t be leveled up to express more than what auto-vectorization can see due to limitations imposed by portability requirements.
The other argument is that SIMD is the wrong level of abstraction for a library. Depending on the microarchitecture, the optimal code using SIMD may be an entirely different data structure and algorithm, so you are swapping out SIMD details at a very abstract macro level, not at the level of abstraction that intrinsics and auto-vectorization provide. You miss a lot of optimization if you don’t work a couple levels up.
SIMD abstraction and optimization is deeply challenging within programming languages designed around scalar ALU operators. We can’t even fully abstract the expressiveness of modern scalar ALUs across microarchitectures because programming languages don’t define a concept that maps to the capabilities of some modern ALUs.
That said, I love that silicon has become so much more expressive.
IMO what's needed is ISPC like guided autovec with a lot of hinting support to control codegen (e.g. hint for generating an unrolled version only or an unrolled and non-unrolled version).
Basically something like #pragma omp SIMD, but actually designed for the SIMD model, not parallel one, that erros when vectorization isn't possible.
Ideally it would support things like reductions, scans, reference of elements from other iterations (e.g. out[i] = in[i-1]+in[i+1]), full gather scatter, early break, conditional execution control (masking or also a fast-path, when no active elements), latency vs throughput sensitive (don't unroll or unroll to max without spilling), data dependent termination (fault-only-first load or page aligned for thigs like strlen), ...
> it's a PITA to differentiate the utility of hundreds of different vpaddcfoolol instructions
This is one complaint I toss back at Intel and AMD.
If an instruction/intrinsic is universally worse than the P90/P95/P99 use case where it's going to be used to another set of instrinsics, then it shouldn't exist. Stop wasting the die space and instruction decode on it, if not only the developer time wasted finding out that your dot product instruction is useless.
There are a lot of smart people that have worked on compilers, optimized subroutines for LAPACK/BLAS, and designed the decoders and hardware. A lot of that effort is wasted because no one knows how to program these weird little machines. A little manual on "here's how to program SIMD, starting from linear algebra basics" would be worth more to Intel than all the money they've wasted trying to improve autovectorization passes in ICC and now, LLVM.
:) I agree a tutorial would be helpful. We are working on one with Fastcode.
A manual is not a tutorial, and having AI anywhere near this task is actively harmful. Please do not build this.
?? Where did you see mention of AI?
I searched the name "fastcode" and the only results were AI
Have you considered our Highway library? Runtime dispatch need not be a PITA :) It's basically portable intrinsics, and a much more complete set (>300) than the ~50 in std.
Does it have fallback paths for everything, though? Scalar if necessary?
Projects that depend on Highway drop support for CPUs not listed in the Highway documentation, saying that they can't support these CPUs because they are incompatible with Highway: https://google.github.io/highway/en/master/README.html#curre...
Are these projects somehow mistaken?
Yes, the EMU128 target is scalar only, with for loops. This is a fun way to see how well autovectorization works, with the same source code. That works on any CPU. Curious which projects have such concerns, any link?
People reported challenges building V8 (whether upstream or the Node.js variant) on s390x with z13 support. I don't know if it was discussed on the porters mailing list because it's not public: https://groups.google.com/g/v8-s390-ports
Elsewhere, some people interpreted https://github.com/google/highway/issues/1895 as meaning that Highway code does not work on z13 at all.
Thanks for sharing. The first link seems non public indeed. I can imagine there is some compile issue we could reasonably fix, with the help of someone who has Z13 access. Please encourage them to raise an issue. I will be back on May 26. After that, it should at least be able to use the scalar fallback. The issue with Z14 is that it lacks fp32 support. Would their usage be integer only?
I hadn't but it would make sense for doing my own personal programming challenges.
Given the ongoing disasters around the software supply chains I've been fighting the creeping NPM-ism that people are trying to introduce to C++, where you just FetchContent 20 different libraries to build your own app upon.
I do use gtest, fmt and a few others though, so something as broadly used as Highway would probably be fine by that standard as well. But I'd still like it better if there was a Good Enough solution that was part of C++ stdlib to reduce the number of external integrations that are deemed required for a modern C++ program.
Fair point. If it helps, our security team has called Highway critical infrastructure and helped to harden the repo. The flip side of standardization is that it would be much harder and slower to add ops as the need arises, which we do regularly.
> I think a legitimate criticism is that it is unclear who std::simd is for
It's for people that don't use SIMD today.
SIMD is hard, or at least nuanced and platform-dependant. To say that std::simd doesn't lower the learning curve is intellectually dishonest.
---
Despite the title, the primary criticism of the article is that the compilers' auto-vectorizers have improved better than the current shipped stdlib version.
My criticism could mostly be summarized similarly. The scope of what a portable std::simd can do is almost exactly the scope that you would expect auto-vectorization to subsume over time. SIMD, to the extent it is covered by std::simd, is the part of SIMD that should be pretty simple to learn.
There isn’t an obvious path to elevate it above what auto-vectorization should theoretically be capable of in a portable way. This leads to a potential long-term outcome where std::simd is essentially a no-op because scalar code is automagically converted into the equivalent and it is incapable of supporting more sophisticated SIMD code.
Is this a technical impossibility or just it hasn't been done yet? Could a library support generating intrinsics for a large set of architectures?
Google Highway gets mentioned in the article.
There is google’s highway, that provides an abstraction layer. It is used by NumPy.
The full scope of what SIMD is used for is much larger than parallelizing evaluation of numeric types and algorithms.
For example, it is used for parallel evaluation of complex constraints on unrelated types simultaneously while packed into a single vector. Think a WHERE clause on an arbitrary SQL schema evaluated in full parallel in a handful of clock cycles. SIMD turns out to be brilliant for this but it looks nothing like auto-vectorization.
None of the SIMD libraries like Google Highway cover this case.
I don't quite get how something like highway doesn't cover this, while intrinsics do.
Can you explain the usecase more concretely?
Almost literally what I stated. Consider a row in Postgres table or similar. Convert the entire WHERE clause across all columns in that table into a very short sequence of SIMD instructions against the same memory. All of the columns, regardless of type, are evaluated simultaneously using SIMD. For many complex constraints you can match rows in single digit clock cycles even across many unrelated types. This is much faster than using secondary indexes in many cases.
It isn’t hypothetical, I’ve shipped systems that worked this way. You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth.
OK, I thought it couldn't be that, because that should be doable with std::simd or a SIMD abstraction. Well, unless you JIT it, in which case intrinsics wouldn't help either.
> You can match search patterns across a random dozen columns across a schema of hundreds of columns at essentially full memory bandwidth
Do I underatand it correctly, that this would only work, if you have multiple of the same comparisons (e.g. equality check with same sized data) in the WHERE clause and the relevant collumns are within one multiple of the SIMD width of each other?
Every column has its own independent constraint: equality, order, range intersection, bit sets, etc that is evaluated concurrently in single operations. Independent per column in parallel. It does require handling the representation of columns to enable it but that isn’t onerous in practice.
It isn’t intuitive but it is one of those things that is obvious in hindsight once you see how it works. The gap is that people struggle to understand how to make this something SIMD native, especially in high-performance systems.
Ah, so you're just doing SoA or AoSoA layout? It sounded like you where doing something more special than the standard SIMD usecase.
This does easily work with SIMD abstractions and even length-agnostic vector ISAs, unless you're doing AoSoA and your storage format has to match your memory format and it has the be the same on all machines. In which case you probably want to do something like 4K blocks anyways, in which case you can make it agnostic for all vector length anybody reasonably cares about for this type of application anyways.
what about Google highway project?
Autovectorisation is the main way SIMD hardware gets put into use, whether you think it's pretty poor or not.
SIMD came to mainstream in 1995 Pentium MMX and has been proven rather difficult for compilers to target, but after 30+ years is doing a bit better despite PLT conspiring against it. (see eg CUDA, Futhark etc)
In my limited experience with looking at autovectorisation compiler output, gcc is quite bad unless you hold its hand, and clang tries to autovectorise everything it sees.
I think the main way SIMD hardware gets put to use is probably memcpy.
Yep, same here and agree.
Compilers have definitely got better though: another issue in the past (maybe still is to a degree? although compilers have got a lot better at this in the past 15 years, but it used to be one of the things only Intel's ICC actually got right), that if you wrapped the base-level '__m128' or 'float32x4_t' in a struct/union in order to provide some abstraction, the compiler would often lose track of this when passing the struct/union through functions (either by value or const ref), and would often end up 'spilling' (not entirely the correct terminology in this context, but...) the variable from registers, and just producing asm which ended up uselessly loading the variable again from a stack address further up the call stack, when it didn't actually need to do that. So that was the situation even when using intrinsics within custom wrappers.
From 2011 to around 2013 ICC seemed to be the only compiler on amd64 which wouldn't do this. If you passed the actual '__m128' down the function call chain instead, clang and gcc would then do the right thing.
Part of that could be ABI constraints. There are some surprising calling convention differences between a vector and a struct or union with vectors in it, and they vary platform to platform. E.g. on ARM a struct with two 128-bit vectors will pass in two registers where on x86 it must pass via the stack.
Using __attribute__ to tweak calling conventions can often really clean this up, but that's just as obscure and non-portable as the problem it fixes. So you either end up writing weird non-portable code one way or weird non-portable code another... Code working with these types doesn't get to benefit from zero-cost abstraction to the degree we're used to with normal scalar code.
That's an ABI constraint of the x86 32-bit API.
People invented x32 to fix this. Or just use amd64.
This was with amd64.
ICC was at the time the only compiler that would not do that.
In such discussions, whenever you mention abstractions are universally "pretty poor", to the extent anyone is listening, I think this hyperbole can do real damage. Maybe it prevents people from getting relevant performance gains, even if not 100% of the optimum, which is anyway unattainable. And what is the alternative? Not many projects can afford to hand write intrinsics for all platforms. And are you aware that Highway is basically a thin wrapper over intrinsics, which you can still drop down to where it helps?
> 100% of the optimum, which is anyway unattainable.
Can you expand on this? Sounds like an interesting discussion.
Not who you asked but I think the meaning is that since intrinsics for simd are different in each platform, being able to have something that is portable and sometimes works faster is something, while writing for Intel, ARM and a zoo of instruction sets is not an option for some.
:) I figure there is always something left to improve. For some kernels which really want to keep 30+ live registers, the compiler might not do as good a job as careful manual tuning, so intrinsics can have a bit of a cost. But I also figure optimization time is limited, so better to get 90% of several kernels rather than one to 99%.
Besides Spolsky's law of leaky abstractions, "abstractions" can also result in "lowest common denominator" situations, which are the opposite of performance optimization. Talking negatively about abstractions is not what deals damage; you are shooting the messenger here. It's the abstractions themselves that deal damage when misplaced. "Zero-cost abstractions" is the true hyperbole.
Is this a good faith reply? The particular abstraction we built, and is being discussed, is manifestly and obviously not a lowest common denominator. Looks like you are deploying a second straw man, that of zero cost. In other comments here I acknowledge a cost to intrinsics.
I am aware of Highway. It doesn’t add much value for the kind of SIMD code I write. I have better abstractions because I don’t have to consider portability nearly as much. Some useful constructions don’t have a good expression on weaker SIMD architectures.
Do you say that from the perspective of compiled languages? I hear good things about .net core wrt SIMD, but that has the advantage it can decide at JIT.
I'm not the person you're asking, but I share that opinion for both compiled languages and JIT solutions, including .net core specifically. All but the most trivial use cases can't be autovectorized, by JIT or otherwise. One of the recent things I worked on (reed-solomon decoding) offers basically zero opportunities for autovectorization unless the compiler reinterprets certain scalar loops as dedicated galois instructions on AVX512F hardware, but that optimization isn't implemented, it wouldn't help other architectures anyway, and it's still 10x slower than a well thought out vectorized approach.
Thanks, your are talking about using plain loops with regular arrays, or do you mean the specific types like here <https://learn.microsoft.com/en-us/dotnet/standard/simd>?
EDIT: A bit more background @<https://medium.com/@meriffa/net-core-concepts-simd-avx-intri...>
I think what a SIMD library does, above all else, is get the programmer to write code in a way that can be directly translated into SIMD instructions. A big issue that compilers have to contend with is that they aren't allowed (unless you enable ffast_math) to rearrange floating point operations. Putting an add or a multiply in the wrong place can spoil SIMD optimizations that the compiler could otherwise pull off.
But the problem is as you state. For people that really care about that sort of thing, they are likely going to have the exact SIMD sequence they want to execute in mind anyways. That leaves you with a definition that is doomed to be both not low level enough and too low level.
I think what this is useful for is a fallback description of the desired SIMD operations. It won't be ideal on non-targeted platforms, but it will be something.
> People that don’t use SIMD today are unlikely to use std::simd tomorrow.
why? at least I see that I will start with std::simd in my pets. If this would not enough, I would go forward to intrinsics. But, I think, starting with std::simd would be much simpler for beginner.
> I think a legitimate criticism is that it is unclear who std::simd is for. People that don’t use SIMD today are unlikely to use std::simd tomorrow. At the same time, this does nothing for people that use SIMD for serious work. Who is expected to use this?
There is plenty of vectorization that are simple enough to be done with std::simd today and that will still bring any autovectorizer begging on its knees for various reasons.
As an anecdote, I currently got a 8x speedup with std::simd (AVX2 & SVE2) on a rather trivial parser of mine recently that autovectorizer failed miserably to do properly.
Would I have get better result using intrinsics ? Likely, yes.
Did I want to suffer the maintainability and portability pain associated with it for a simple parser ? Certainly not.
For these use case, std.simd does the job. And will probably do a better and wider job with time when it get enriched by the committee.
The blog brings some valid criticism but really looks like a flame war trying to destroy an already opened door.
(1) Is there more performant solutions that std::simd for vectorization ?
Yes, of course. The STL evolves slow, its main goal is to provide a generic and portable implementations of a set of algorithms. Not to provide the best implementation in existence.
The best implementation of most algorithms (including SIMD patterns) evolves every 6 month, you can not expect a standard library with 3 different implementation to keep up with that.
(2) Is the future of vectorization ISPC ?
Nope. ISPC has been around for > 10y and is still niche. There is very good reasons to that: Yes it can generate better code but in most use case, adding a massive dependency of a compiler + an arbitrary LLVM version + a DSL on your project is not worth it.
Specially considering that it is an Intel project and that Intel (almost) abandonned the project multiple time (In pure Intel fashion).
So yes, criticism is easy, and yes std::simd is full of problems.
But I am glad it exists, and thanks to the people that made it happen... Because it is useful, even in the current state.
The point about the optimizer only seeing "opaque templates and function calls" makes little sense.
First off, templates are the opposite of opaque due to the fundamental requirement that the implementation be visible to every translation unit using a template. This makes any function calls trivially inlinable.
Second, and the reason for the above requirement, templates are compiled by monomorphization – making a distinct, separately optimizable copy of each concrete instantiation of a template. By the time the compiler backend sees the intermediate representation, there’s nothing about templates left.
There are of course reasons why highly abstracted template code may be difficult to optimize, for instance if function call chains are so deep that the inliner gives up. There are also legitimate reasons why a fully language-based solution might beat a library-based one. But one of the points of adding a library to the std is that the standard library is allowed to cheat as much as it wants. It can be deeply integrated to the compiler and implemented entirely using compiler magic if necessary.
std::simd may be too little, too late for many reasons, but I doubt any of them is that the compiler can’t see through the code.
it's just a compiler hint just like all other hints, parity with other languages.
Compiler guy here - yes, the optimizer claim is,in fact, just totally wrong. The claim is it can't be constant folded, scheduled, etc because it doesn't turn into simd instructions. At the bottom, as you say, it will turn into code. That code is almost certainly builtins or uses the vector extensions or whatever , not raw asm statements. compilers can and do turn these into ir level simd instructions.
I agree with you, but just a small nit:
> First off, templates are the opposite of opaque due to the fundamental requirement that the implementation be visible to every translation unit using a template.
That's not strictly true, you can have an implementation hidden in a separate TU, as long as that TU instantiates the template for all template arguments that the users are going to use.
As a compiler guy, the complaint about "opaque templates and function calls" to me raises serious doubts that the author has any idea what they're talking about. std::simd is designed to be akin to taking vector operations as intrinsics on <4 x f32> and similar types and wrapping them in a more C++ dialect than bare compiler intrinsics (and then a second layer on top of that to make things somewhat more portable).
So the implementation of all of the std::simd at the bottom should be tiny functions that map to essentially a single instruction, specified via a header file in a mechanism that guarantees you always have the body. This makes the functions trivially obvious candidates for inlining. Since it's a C++26 addition, the dispatching logic through the layers can largely be done via if constexpr, which means most of the code is discarded by the frontend.
Given that the complaint seems to be about not vectorizing a call to a sin function, it's possible that it's implemented in libstdc++ in such a way that the library doesn't know about the compiler's -fveclib implementation. But then again, the complaint is based on the libstdc++ v14 implementation of C++ Parallelism std::simd, not the v16.1 implementation of C++26 std::simd, which is completely different (and landed circa 2 months ago).
So I've got a foot in each camp, I think you're just using different languages - you guys mean different things with the same words.
You can't claim he doesn't know what he's talking about with a single point he may have gotten run, the makes a tonne of valid points, especially around the existing problems of C++ that this library doesn't help with.
In addition to that, he's not wrong about this library from a user perspective. I can't use this. I wrote something very similar back in 2016 - at the time it served my needs but now it's hilariously outdated.
One thing I will point out is that the code in the article is compiled with `-march=native` and `-ffast-math`, meaning that they're really only compiling for the exact same machine they are running on and no other. This seems like it is mainly applicable to places which can easily recompile code for the exact known hardware that they run on, such as HFT and some scientific computing.
Places which compile code to distribute for people to run on a variety of processors and platforms (or that require floating point code to be consistent between them), i.e. games and applications, will still be targetting a low end baseline architecture and therefore have a different outcome. I can say that in this space we are only now reaching the point where we can start compiling for AVX2, as we can expect the lowest end-user processor to support it.
The linked[1] "six reasons to use std::simd" was just what I needed after a long week. Hilarious!
[1]: https://github.com/NoNaeAbC/std_simd
That certainly convinced me. When I was doing my taxes recently and had to watch those forced loading animations, I kept asking myself "why can't my compiler do this?" Thanks to std::simd, now it can!
isn't that just QoI issues? There's a reason why the libstdc++ folks labelled their implementation as experimental.
It should have been "eight reasons to use std::simd". Inefficient.
This is the first time I've seen a classic for loop called a "boomer loop", but apparently this isn't even the first instance (not the first definition):
http://boomer-loop.urbanup.com/18229646
I made the first proposal to the C++ standard committee to introduce SIMD in 2011, before Matthias Kretz got involved with his own version (which is what became std::simd). This was based on what eventually became Eve (mentioned in the article).
Back then, it was rejected, for the same arguments that people are making today, such as not mapping to SVE well, having a separate way to express control flow etc.
There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language. Then that died out (I'm not sure why), and SIMD became trendy, so the committee was more open to doing something to show that they were keeping up with the times.
Trying to abstract over SVE with a SIMD library is a bit of a fool's errand. The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it. All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.
Frankly, the length agnostic stuff is a mistake that I hope hardware designers will eventually see the light on, like delay slots.
> Trying to abstract over SVE with a SIMD library is a bit of a fool's errand
It reallt isn't. You just make the default SIMD-width agnostic and anything less portable opt-in.
You can still specialize for a specific width pn scalabe vector ISAs.
> The intended programming model is just too different from traditional ISAs, and there are algorithms that are nearly impossible to write efficiently for it.
Such as?
> All the ones I've seen wrap it up as a bastardized fixed length ISA, and even ARM's own guidance basically recommends that approach.
google highway doesn't. And while Arm is stuck with 128-bit SVE, because they alsp have to implement NEON as fast as possible to be competitive, RVV already has a large diversitly of hardware with different vector length available 128,256,512,1024.
I have a database that has big columns that get functions applied to them to compute the result set. This is a perfect case for length agnostic instructions, except out ends up horribly memory bound. A nice optimization is to only compute those lanes containing rows that might actually be in the result set by keeping track of a sparse record that depends on the lane size. But the cnt instructions are optional, and this also inhibits compiler optimizations in that lookup.
CNT and CNTP don't seem to be optional for SVE, from what I found. (unless you mean HISTCNT)
It seems to me like you want tp use CNTP on a bitset that tells you, which rows are relevant, skipping them if CNT is 0? Is that what you where describing?
I was confused and thinking that streaming mode and CNT were in separate extensions, but they're both in SME. My bad.
Anyway, essentially yes. My previous comment didn't mention all of the context. The join enforces that the result set is the intersection of the individual column sets, so it gets increasingly sparse as individual columns are computed. So I just maintain a bit tree that says which columns could populate the result set and skip computing the other lanes, which depends on the vector width and benefits from knowing it at compile time.
I'm no C++ dev, but as an outsider, it sure reads like the whole "int is variable length" mistake again.
That abstraction is occasionally usable in low level systems code, that is why Go, Rust, D and C# support it as well.
Also to note that is C not C++.
In a way it's worse because at least with int you're not really expecting to run the same binary on architectures with different int lengths, and also for several decades there have only been two realistic options (32 or 64), which makes it easy to deal with.
With RVV (and SVE I assume) there are a wider range of realistic options - at least 128, 256 and 512. The RVV spec allows up to 65536! Also it's totally reasonable to want a single binary to work with all of them so then you're into compiling parts of your code multiple times with runtime dispatch which is a right pain.
I haven't looked into how Highway does it but I don't really know you you write length-agnostic code in high level languages. It's easy in assembly, but it sucks if you have to do it in assembly.
Here is a highway example: https://gcc.godbolt.org/z/7sdPr61W6
There is a bit of boilerplate to get dynamic dispatch working, but apart from that it's quite simple to use.
That's a mistake for ABI visible types, yes.
I don't know how SVE works but I thought the point of it was to let implementations pick a larger size than the CPU supports and then get an automatic speedup from better processors with more vector lanes.
To me it’s clear adding the ability to express intent to parallelise is the Right Thing. This is the only way the compiler can actually know what you want it to do.
> There was a real alternative being considered at the time: integrating ISPC-like semantics natively in the language.
I think this is the best solution for truely portable SIMD. Sure it doesn't cover everything, but it makes autovec explicit, guaranteed and more powerfull.
One of the biggest problems with "portable" SIMD libraries, is that when it's used for simple things, often autovec is better, as it has access to the direct ISA semantics and can much easier do things like unrolling.
GCC already solved it: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html The operations behave like C++ valarrays. Addition is defined as the addition of the corresponding elements of the operands. For example, in the code below, each of the 4 elements in a is added to the corresponding 4 elements in b and the resulting vector is stored in c.
Thanks!
And these are also available in clang. https://clang.llvm.org/docs/LanguageExtensions.html#vectors-...:
“Vectors and Extended Vectors
Supports the GCC, OpenCL, AltiVec, NEON, SVE and RVV vector extensions”</i>
Those type attributes are also used for the x86 intrinsics API, and they override default C behaviors like promotions and presumptions around aliasing (ironically they make type punning easier, though maybe it was just the few use cases I explored, and this isn't an area where I have alot of experience). C23 also gained the _BitInt type, which discards all the old promotion rules, which should help autovectorization.
I think ISPC is still the proper way to go. But these days everybody wants One Language to Rule Them All along with standard libraries for doing everything out-of-the-box. And while in principle ISPC's approach could be stitched into C or C++ in a fairly clean manner (perhaps with well-defined and enforced segregation of constructs to minimize complexity), it's just not gonna happen: C++ is too enamored with constructing libraries through deeply complex templated types (hammer, nail, yada yada), and C is just too conservative (though if GCC or clang went the distance with a full implementation, there's a good chance the C committee would adopt it).
Curious if people here have looked at the upcoming SIMD support in Go: https://go.dev/doc/go1.26#simd
Currently experimental, but looks like the first Intel arch will arrive in the next release in about 3 months. They are also going to support a portable layer.
Wondering what people here think about the approach the Go team is taking; I think they would appreciate more eyeballs on their design. (I’m not competent in this space (yet))…
Looks like that isn't a portable SIMD abstraction, but more similar to adding architecture-specific SIMD intrinsics support to go, with nicer syntax.
Sorry, I didn’t explicitly link to the issue for the portal layer.
Here is the issue discussing the portal simd package: https://github.com/golang/go/issues/78902
"The Default Width Problem" -- this section seems confused and definitely reeks of LLM authorship. It's comparing -march=native against std::simd and complaining that std::simd<T,8> breaks portability with pre-Haswell. This is a real issue, but -march=native is no better! It bakes in the SIMD width at compile-time as well, so that binary also won't run on a pre-Haswell machine. It's a real issue but neither side solves it. You need runtime dispatch (a la Google Highway) to solve this.
If we want to improve cross-platform SIMD, in my opinion we should start by supporting more operations in LLVM IR. Like vector expansion (currently we only have expandload), runtime-known shuffle vectors, pdep/pext operations.
Also, let's stop with the "vector length agnostic" types being the sole option for SVE extensions. I'd rather write an optimized routine for a 16-byte machine I'm targeting and be able to upgrade it in 5 years than have "agnostic" code that wants to pretend like it would work amazingly on all platforms, but the machine I optimized it for is theoretical. I'm fine with recompiling my code, I do it every day. If I have an algorithm that's truly vector length agnostic, I can make the vector length a constant in my code that can change based on the compile target.
https://github.com/llvm/llvm-project/issues/113422
https://github.com/llvm/llvm-project/issues/172857
> Also, let's stop with the "vector length agnostic" types being the sole option for SVE extensions
They aren't, see the `arm_sve_vector_bits` attribute.
> I'm fine with recompiling my code, I do it every day
Then you can do that.
> If I have an algorithm that's truly vector length agnostic, I can make the vector length a constant in my code that can change based on the compile target.
You can do that, but why not simply write it in a vector-length-agnostic way?
IMO the better approach is to start thinking about SIMD optimizations in a VLA way, and specialize on the vector length, when that becomes advantageous. Doing it this way is better even if you end up not writing VLA code, because you though about the scalability problem.
Many libraries currently don't scale beyond 128-bit, not because they couldn't make efficient use of >128-bit, but because the library was architect around 128-bit and changing that amounts to almost a full rewrite. So now you are stuck wasting 3/4th of your ALUs running 128-bit SSE on Zen5.
Just write inline asm for x86 and aarch64 (if you care about that) and not care about the rest. Is it even useful to do simd on other processors?
Compiler optimizing even the code around the simd code based on the semantics of arithmetic or other things sounds silly after writing some of this kind of code
So you "just" write 4 assembly implementations?
Agreed, fixed with vectors needs to be a language feature, better compile times and would solve issues for most people.
Personally, I think that like Clang way to adding GLSL like vectors and semantics would've gone a long way. SVE might be an elegant design, but in reality there are probably a multiple factor of game and other 3d code being written that needs vectors compared to other fields, and there limited vector sizes aren't really a problem.
And honestly, considering the story of AVX512.. with 512 bit vectors being removed from mainstream by Intel, do we really really need longer ones despite it being from a "scalable design"?
In GPUs GLSL like types compile down to what basically is variable length SIMD. A vec4 doesn't get compiled to a SIMD vector with four floats, but rather to four SIMD vectors, each containing N FP32 elements (usually 32 or 64).
Look at what this simple shader compiles down to on RGA: https://godbolt.org/z/4GrfY61vf
Intel has been forced to reintroduce 512-bit vectors in the mainstream, because of the competition from AMD.
Starting with the Intel Nova Lake CPUs, around the end of this year, all future AMD and Intel CPUs will provide 512-bit vectors, like also the current AMD Zen 5 and Zen 4 CPUs.
The 512-bit vector length is more convenient than other lengths, because on the AMD and Intel CPUs it coincides with the length of a cache line. Because of this, it is easier to optimize simultaneously for the best cache usage.
For GPUs, which favor throughput over latency, 1024-bit and 2048-bit vector register widths are frequently used. For CPUs it is unlikely that widths greater than 512-bit would be useful, as the vector operations that should be done on CPUs are those for which the high latency of using a GPU is undesirable.
greater then 512-bit SIMD isn't currently and in the near future relevant for regular general purpose processors.
But for smaller more specialized CPUs in embedded or automotive usecases you can get more parallel compute, while keeping the software model simpler than having to dispatch to a GPU.
Specifically a design like https://saturn-vectors.org/#_short_vector_execution, which like to use 2x or 4x wider vectors that the datapath length for more efficient chaining. I quite like that design, because you can get high utilization and limited out-of-order execution without vector register renaming.
> When Google needed portable SIMD for production image and video codecs, they built Highway — not std::simd.
Sure, they left the committee years ago. I am not trying to claim any sort of direct causality, but it sure seems like this is a case where Google's presence on the committee might have prevented shipping boondoggles like this. Modules is another case where I think Google's feedback might have been able to steer the ship in a better direction.
They probably left the committee because it keeps going in circles rather than solving the issues of the language
Yes. The straw that broke the camel's back was the complete refusal to break ABI, locking in bad implementations forever. e.g. unordered_map dramatically underperforms when compared against modern swiss tables, but the committee won't do anything about it. Not to mention the committee's head-in-the-sand policy-based approaches to safety, vs Google's much broader-scoped Carbon effort.
More context: https://github.com/carbon-language/carbon-lang/blob/trunk/do...
If you thought std::simd was a library nobody asked for, just wait until you hear about <linalg>. I feel like half the people looking forward to that think they're just going to get standard C++ bindings to LAPACK, when instead they're probably going to get an unoptimized, slapdash implementation of LAPACK written by people who aren't good at BLAS.
As for SIMD itself, designing a good SIMD library is difficult because there are several different SIMD approaches and some of them work poorly for certain use cases. For example, you can take an HPC-ish approach of "vectorize this loop" (à la #pragma omp simd) and have the compiler take care of a fairly mechanical transformation. Or you can take an opposite approach of treating a 128-bit SIMD vector as a fundamental data type in your language. Which approach is better depends on your use case.
Just wait until you hear about std::hive.
The work of one obsessive author, who never gave a good explanation for why the thing needed to be in the standard library instead of an external one. The committee was apathetic about the proposal and kept bringing up various trivial issues, in a clear attempt to stall him, but he refused to take the hint. So eventually they relented. Outside coverage I have seen so far seems to be to the tune of "WTF is this weird thing?" and quickly glosses over it.
I wonder if it's going to end up like the export keyword.
I feel like std::hive fits right in to the C++ stdlib group of collections
The least stupid is std::vector which is just the typical O(1) amortized growable array type found in most modern languages, with a mediocre API. 8/10 could do better.
std::array is just the built-in array type C++ should have but doesn't. This shouldn't be a library type, that's embarrassing.
std::deque looks like you're getting something like Rust's VecDeque but you aren't, it's a weird hybrid optimisation which presumably made sense on some 1980s hardware. I asked STL once to explain what it's even for and they didn't know. [[For reference STL is the name of the guy in charge of Microsoft's implementation of the C++ standard library, Microsoft also calls that library STL for reasons we needn't address]]
std::list is the extrusive doubly linked list. This type makes sense in a DSA class. Why is it in the C++ standard library? I dunno, maybe C++ is intended only as a teaching language?
std::forward_list is the extrusive singly linked list. You know, for a different seminar in that same DSA class. You might want the intrusive linked list, you don't want this.
std::map and std::set are probably red-black trees. OK, you might need those and for some reason not care about the details (which aren't specified here)
std::multimap and std::multiset are even less obviously useful. I have never seem them used in real software. Why are they in the standard library?
std::unordered_all_of_the_above_maps_and_sets look like the simplistic hash table you'd be shown in an intro DSA class either taught by somebody who doesn't know the subject well or aiming to cover the basics and get back to their research. This will perform poorly on any hardware with features like a cache.
The C++ stdlib carries broken garbage basically indefinitely. C++ doesn't have the same library stability promise that Rust has, but in practice stuff that nobody cares about is never removed.
I'm not sure what the argument is here?
These are in the standard library because someone proposed their inclusion.
They're fine for the majority of people who really don't want to roll their own data structures each time.
They're not compulsory to use, you're still free to roll your own.
> I'm not sure what the argument is here?
That std::hive will fit right in. Another container type you probably shouldn't use, draining precious maintenance resource from groups who have better things they could be doing.
> These are in the standard library because someone proposed their inclusion.
As with std::hive. Indeed the "unordered" containers, just like std::hive were repeatedly knocked back and eventually got in decades after they were obsolete. Persistence really does pay off in C++
> They're fine for the majority of people who really don't want to roll their own data structures each time.
Sure, doubtless std::hive is fine for that same majority of people.
>The committee was apathetic about the proposal and kept bringing up various trivial issues, in a clear attempt to stall him, but he refused to take the hint.
That's a mean interpretation, mean both towards the committee and towards the author.
are u carefully read <linalg> paper[1]?
It doesn't require to reimplement it...
> Our proposal is inspired by and extends the dense BLAS interface. A natural implementation might look like this:
> 1. wrap an existing C or Fortran BLAS library,
[1] https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p16...
Have you read the entire paper, and not just skimmed the front matter?
The interface is a generic template approach, which can work on any element type T, not just float/double/complex<float>/complex<double>, but custom types like bigint or rational or random_custom_finite_field. Or integration with units libraries (there's another dumpster fire coming down the line...). Your BLAS library will provide you just the four basic element types, so it takes a decent amount of dispatch logic to convert the template interface to the actual library calls, and you still need fallback logic anyways to handle the other types.
But the library is also not designed in a way to facilitate that kind of dispatch logic (std::simd is, which accounts for a not insubstantial portion of its complexity). Which is on top of the difficulty of linking to one of various BLAS implementations as a standard library. So it's a design that's all but guaranteed to let you link against an existing BLAS implementation, and indeed, carefully reading the rest of the section you wrote makes it clear that it's not a goal of the paper proposal to have implementations do that.
> Your BLAS library will provide you just the four basic element types, ..., and you still need fallback logic anyways to handle the other types.
so, your problem with it is that it does all you want (e.g. LAPACK bindings) AND give extra features??
> so it takes a decent amount of dispatch logic to convert the template interface to the actual library calls
I can't estimate how much this degrades performance. But, it feels very low overhead compared to the calculation itself (and probably should be resolved at compile time)
> so, your problem with it is that it does all you want (e.g. LAPACK bindings) AND give extra features??
You've completely misunderstood the point: the point is that the "extra features" means you won't get the main feature (BLAS bindings).
Something everyone is missing is that this is just feature parity with other langauges, this is not c++ specific. They added it because other languages have these compiler hints which are then (usually) used in llvm as opaque types for 'smarter' optimizations. Hand rolled code will still be better, but there are very niche instances where you want llvm to know this information, especially important when you don't care about performance, but you care about data integrity and obfuscation.
I'd actually rather just have the compiler give some guarantees on producing SIMD code when you write regular C++ code doing sums, multiplications, etc... in a particular way. And perhaps add a few more operators/keywords to the language for modern CPU instructions (we got things like popcount, countl_zero and fma, but what about e.g. pext, pdep, aes, ...)
> The pattern is clear: every major project that actually needs portable SIMD in production chose a third-party library or a different language
of course they chose third-party library, because C++26 is just only published and don't have wide support/adoption experience.
> And the most damning data point might be EVE itself — a committee member looked at std::simd, decided it wasn’t good enough, and built his own library.
It's just a manipulation. First commit in eve[1] was published in 2018. There was no any std::simd in standard at that time.
> Nobody waited for std::simd. By the time it ships in C++26, these libraries will have a decade of production battle-testing, real user feedback, and cross-platform coverage that std::simd can’t match on day one.
Manipulation 2. No library have decade "of production battle-testing, real user feedback, and cross-platform coverage" on day one. So, why their authors created them?
> Including <experimental/simd> pulls in deeply nested template machinery — simd.h, simd_x86.h, simd_builtin.h, and friends. A trivial function computing sin on a SIMD vector takes about 2.2 seconds to compile. The equivalent scalar for-loop? 0.2 seconds.
Would be more interesting if you compare this with precompiled headers and C++20 modules.
> The std::simd version? It emits actual vsqrtps + vmulps because the optimizer can’t perform algebraic simplification through opaque template function calls:
opaque template function calls? What is this?
Of course there is 1000 examples when compiler can do better job with scalar loop. And there is 1000 examples when it can't. But, for some reason people do write simd manually. Probably because they want predictable code generation - no massive slowdown on another compiler/another compiler version/another cpu/another one line of loop changed.
> sqrt(x) * sqrt(x)
what compiler would generate with manual simd intrinsincs? I doubt the same as scalar mul.
> The frustrating part is that the problems are well-understood. SIMD programmers have been asking for the same things for years, and none of them are in std::simd.
Show me your proposal with critique of std::simd, if you asking for them. Or at least someone other proposal. How you can understand that someone asking?
Too many emotional statements in, too little technical details.
[1] https://github.com/jfalcou/eve
Unnecessarily negative article. Lets not forget how awful C++98 was for years. Standardisation doesn't mean useful.
Hmm. I think you missed the point.
No, I didn't. The whole premise is contained in the title "Nobody asked for".
Nobody should read that AI slop article. Nobody.
Maybe there's an interesting story in there, it's certainly possible. But the "author" could not be bothered to write it, and so why should we waster our time reading it?
I read it and found it interesting
I love people praise Claude for doing their work, every day on HN, while at the same time complaining about AI in articles.
Who says these are the same people?
Statistics.
Glad to see the classic goomba fallacy in action even here on HN.
I praise Claude and hate AI articles because I could've asked Claude to dumb down the debate if I wanted.
Articles should be high information density and summarizable with Claude.
Some would argue code should be the product of craftsmanship and vibe coding has no place in it.
I hate AI in code, I hate AI in articles, I hate when AI sticks to the bottom of my shoe.
No such case.
... because it makes some decent points?
Overly wordy and repetitive - taking 3x the amount of words if a human had written it.
The article's point in a nutshell:
> The problem is that std::simd in 2026 is the 2012 solution arriving after the world moved on. The committee spent a decade polishing a library-based approach while compilers solved the easy cases automatically and ISPC solved the hard cases with language-level support.
I find it interesting that the C++ committee would make that kind of mistake. Shouldn't they know better?
> I find it interesting that the C++ committee would make that kind of mistake. Shouldn't they know better?
The main reason why people attend WG21 meetings is to get their pet features into the C++ language or the associated standard library. To some extent you can further that goal by shooting down other people's suggestions, especially if they would conflict ‡
C++ is a vast sprawling language. There are no genuine "C++ experts" for the same reason there aren't any people who know all of mathematics. There are a lot of people who are experts on some corner of the language or its libraries, and some who know a little bit about almost everything but no overarching experts.
‡ A good way to do this work would try to have such rivals all work together to improve the language, a sort of "yes, and" collaborative approach but although this has occasionally been able to work in C++ the whole WG21 structure works against it, in particular they vote to achieve consensus, which is not what the word "consensus" means and rewards appeasing haters much more than it does finding out what the problems are and working to fix them.
(having no first hand experience with WG21)
Many people think there are a lot of problems with C++ committee and the standards process. Some would claim that the ISO governance model doesn't work at all. There is a lot of drama that the outside world has no idea about, because much of the discussion is behind closed doors.
You can look it up (e.g. on blogs and /r/cpp), but I am not linking to anything because lots of content is very biased and hard to verify.
Something that is not controversial and worth a read is https://thephd.dev/embed-the-details and you can see the point.
I do feel the TC39 (the group behind the ECMAScript/JavaScript standard) seems more practical and effective. There are disagreements and dramas but not nearly as bad as with C++.
The best witness to the committee (I do not think it is that bad) is to check C++ pre 11 and C++26.
There will always be things people want or comolaints. But the list of useful features and fixes is very long.
But you always see the contentious topics at the top, shadowing a lot (most) of the work that is delivered.
The C++ standards committee is pretty damn dysfunctional at this point for a variety of reasons.
Only like 10% of the committee are actually responsible for an implementation in some manner; the vast majority are users, often looking to get their feature into the standard. This also means that only a tiny minority of the committee actually understands things like the difference between a prototype hack and a proper implementation. I get the sense that it's extremely bad on the library front--all of the standard library implementors I know are basically pleading "please stop adding new features, we want time to catch up."
One of the big issues with library features is that library vendors can't just copy-paste existing implementations for licensing reasons, so they have to reimplement it largely from scratch, and they people doing so may not necessarily be skilled in that particular domain. On top of that, standard libraries are much more sensitive to ABI breaks than other libraries are, so a bad design gets ossified to a much worse degree than regular libraries. The best examples of baked-in bad implementations are std::unordered_map and std::regex, but honestly even std::unique_ptr has similar ABI-unfixable issues (it's not a pointer for ABI calling conventions). Yet you still see people cheer on additions to the standard library because obviously those people are going to make existing implementations better.
sigh
C++ sits on that weird abstraction level where it wants to be a higher level language but it keeps grinding their gears on stuff like pointer sizes, pointer arithmetic or vector sizes and at the same time wants to keep being C compatible and needs that interface with the lower level world
Now compare with how numpy does things: you care about the data size but not the implementation.
Still, I didn't expect less (of a crap fest) from the C++ committee as presented here
numpy is a python wrapper over a C library written by people who have ground those gears
Yes but not all of them
It would be easy to push complexity up at the level of Numpy/Pytorch/Tensorflow but it mostly gets hidden
(also a lot of it relies on LAPACK which is Fortran - which kinda works with SIMD better than C/C++)
Why not just writing inline assembly is not enough?
You optimize for a specific target.
The problem is that you cannot be cross-platform. Sure.
But that is why software is incremental.
I write for my HW, not yours. You can write for yours.
Make folders with implemntations
x86_v1 x86_v2 arm64 riscv64 ... ... ...
and include
sadly inline assembly is still at the ergonomics of "one compiler doesn't support it in x64 mode" and "you can choose between the readable syntax (which is a black box to the compiler) and the unreadable syntax (which can specify I/O/clobber regs)"
OK, is there a horrible speed penalty for writing your SIMD in pure assembly functions and then calling those functions? If you're writing assembly anyway, just drop the "inline" part.
Slop.
Why did you use an LLM to write this?