Taking on CUDA with ROCm: 'One Step After Another'

www.eetimes.com

261 points by mindcrime 2 days ago

lrvick 1 day ago

Just spent the last week or so porting TheRock to stagex in an effort to get ROCm built with a native musl/mimalloc toolchain and get it deterministic for high security/privacy workloads that cannot trust binaries only built with a single compiler.

It has been a bit of a nightmare and had to package like 30+ deps and their heavily customized LLVM, but got the runtime to build this morning finally.

Things are looking bright for high security workloads on AMD hardware due to them working fully in the open however much of a mess it may be.

jauntywundrkind 1 day ago

https://github.com/ROCm/TheRock/issues/3477 makes me quite sad for a variety of reasons. It shouldn't be like this. This work should be usable.
- lrvick 1 day ago
  
  Oh I fully abandoned TheRock in my stagex ROCm build stack. It is not worth salvaging, but it was an incredibly useful reference for me to rewrite it.
- MrDrMcCoy 1 day ago
  
  So much about this confuses me. What do Kitty and ncurses have to do with ROCm? Why is this being built with GCC instead of clang? Why even bother building it yourself when the tarballs are so good and easy to work with?
  
  CamouflagedKiwi 1 day ago
  
  On the last one: OP said they were trying to get it working for a musl toolchain, so the tarballs are probably not useful to them (I assume they're built for glibc).
  Agreed on the others though. Why's it even installing ncurses, surely that's just expected to be on the system?
  
  fwip 1 day ago
  
  > Hey @rektide, @apaz-cli, we bundle all sysdeps to allow to ship self-contained packages that users can e.g. pip install. That's our basic default and it allows us to tightly control what we ship. For building, it should generally be possible to build without the bundled sysdeps in which case it is up to the user to make sure all dependencies are properly installed. As this is not our default we seemed to have missed some corner cases and there is more work needed to get back to allow builds with sysdeps disabled. I started #3538 but it will need more work in some other components to fully get you what you're asking with regards to system dependencies. Please not that we do not test with the unbundled, system provided dependencies but of course we want to give the community the freedom to build it that way.
  
  jauntywundrkind 1 day ago
  
  I did get past that issue with nurses & kitty! Thanks for some work there!
  There are however quite a large list of other issues that have been blocking builds on systems with somewhat more modern toolchains / OSes than whatever the target is here (Ubuntu 24.04 I suspect). I really want to be able to engage directly with TheRock & compile & run it natively on Ubuntu 25.04 and now Ubuntu 26.04 too. For people eager to use the amazing leading edge capabilities TheRock offers, I suspect they too will be more bleeding edge users, also with more up to date OS choices. They are currently very blocked.
  I know it's not the intent at all. There's so much good work here that seems so close & so well considered, an epic work spanning so many libraries and drivers. But this mega thread of issues gives me such vibes of the bad awful no good Linux4Tegra, where it's really one bespoke special Linux that has to be used, that nothing else works. In this case you can download the tgz and it will probably work on your system, but that means you don't have any chance to improve or iterate or contribute to TheRock, that it's a consume only relationship, and that feels bad and is a dangerous spot to be in, not having usable source.
  I'd really really like to see AMD have CI test matrixes that we can see, that shows the state of the build on a variety of Linux OSes. This would give the discipline and trust that situations like what we have here do not arise. This obviously cannot hold forever, Ubuntu 24.04 is not acceptable as a build machine for perpetuity, so these problems eventually have to be tackled, but it's really a commitment to avoiding making the build work on one blessed image only that needs to happen. This situation should not have developed; for TheRock to be accepted and useful, the build needs to work on a variety of systems. We need fixes right now to make that true, and AMD needs to be showing that their commitment to that goal is real, ideally by running and showing a build matrix CI where we can see it that it does compile.
  
  fwip 1 day ago
  
  Sorry if it wasn't clear - I was just copy-pasting from the github issue, in a comment further down.
  
  jeroenhd 1 day ago
  
  The analysis was AI generated. This was Claude brute-forcing itself through building a library.
WhyNotHugo 1 day ago

I also attempted to package ROCM on musl. Specifically, packaging it for Alpine Linux.
It truly is a nightmare to build the whole thing. I got past the custom LLVM fork and a dozen other packages, but eventually decided it had been too much of a time sink.
I’m using llama.cpp with its vulkan support and it’s good enough for my uses. Vulkan so already there and just works. It’s probably on your host too, since so many other things rely on it anyway.
That said, I’d be curious to look at your build recipes. Maybe it can help power through the last bits of the Alpine port.
- lrvick 1 day ago
  
  Keep an eye out for a stable rocm PR to stagex in the next week or so if all goes well.
- sigmoid10 1 day ago
  
  Interesting how Vulkan and ROCM are roughly the same age (~9 years), but one is incredibly more stable (and sometimes even more performant) for AI use cases as side-gig, while the other one is having AI as its primary raison d'être. Tells you a lot about the development teams behind them.
- icedchai 1 day ago
  
  I've built llama.cpp against both Vulkan and ROCm on a Strix Halo dev box. I agree Vulkan is good enough, at least for my hobbyist purposes. ROCm has improved but I would say not worth the administrative overhead.
- seemaze 1 day ago
  
  I realize it does not address the OP security concerns, but I'm having success running rocm containers[0] on alpine linux specifically for llama.cpp. I also got vLLM to run in a rocm container, but I didn't have time to to diagnose perf problems, and llama.cpp is working well for my needs.
  [0] https://github.com/kyuz0/amd-strix-halo-toolboxes
  
  WhyNotHugo 1 day ago
  
  FWIW, Alpine now has native packages for llama.cpp (using Vulkan).
999900000999 1 day ago

Wait ?
You don't trust Nvidia because the drivers are closed source ?
I think Nvidia's pledged to work on the open source drivers to bring them closer to the proprietary ones.
I'm hopping Intel can catch up , at 32GB of VRAM for around 1000$ it's very accessible
- lrvick 1 day ago
  
  Nvidia has been pledging that for years. If it ever actually happens, I am here for it.
  
  shaklee3 1 day ago
  
  It happened 2 years ago:
  https://developer.nvidia.com/blog/nvidia-transitions-fully-t...
  
  cyberax 1 day ago
  
  Their userspace is still closed. ROCm is fully open.
  
  pjmlp 1 day ago
  
  Provided you happen to have one of those few supported GPUs.
  Thus being open source isn't of much help without it.
- cmxch 1 day ago
  
  > Intel
  For some workloads, the Arc Pro B70 actually does reasonably well when cached.
  With some reasonable bring-up, it also seems to be more usable versus the 32gb R9700.
  
  MrDrMcCoy 1 day ago
  
  I have both of those cards. Llama.cpp with SYCL has thus far refused to work for me, and Vulkan is pretty slow. Hoping that some fixes come down the pipe for SYCL, because I have plenty of power for local models (on paper).
  
  cmxch 1 day ago
  
  Hmm.
  I had to rebuild llama.cpp from source with the SYCL and CPU specific backends.
  Started with a barebones Ubuntu Server 24 LTS install, used the HWE kernel, pulled in the Intel dependencies for hardware support/oneapi/libze, then built llama.cpp with the Intel compiler (icx?) for the SYCL and NATIVE backends (CPU specific support).
  In short, built it based mostly on the Intel instructions.
- jeroenhd 1 day ago
  
  Nvidia is opening their source code because they moved most of their source code to the binary blob they're loading. That's why they never made an open source Nvidia driver for Pascal or earlier, where the hardware wasn't set up to use their giant binary blobs.
  It's like running Windows in a VM and calling it an open source Windows system. The bootstrapping code is all open, but the code that's actually being executed is hidden away.
  Intel has the same problem AMD has: everything is written for CUDA or other brand-specific APIs. Everything needs wrappers and workarounds to run before you can even start to compare performance.
  
  Asmod4n 1 day ago
  
  In the python eco system you can just replace CUDA with DirectML in at least one popular framework and it just runs. You are limited to windows then though.
  
  ycombinator_acc 15 hours ago
  
  Huawei’s CANN is fully open and supposedly a drop-in replacement for CUDA. The latter could make it a supеrior option to either AMD or Intel.
salawat 1 day ago

>Just spent the last week or so porting TheRock to stagex in an effort to get ROCm built with a native musl/mimalloc toolchain and get it deterministic for high security/privacy workloads that cannot trust binaries only built with a single compiler.
...I have a feeling you might not be at liberty to answer, but... Wat? The hell kind of "I must apparently resist Reflections on Trusting Trust" kind of workloads are you working on?
And what do you mean "binaries only built using a single compiler"? Like, how would that even work? Compile the .o's with compiler specific suffixes then do a tortured linker invo to mix different .o's into a combined library/ELF? Are we talking like mixing two different C compilers? Same compiler, two different bootstraps? Regular/cross-mix?
I'm sorry if I'm pushing for too much detail, but as someone whose actually bootstrapped compilers/user spaces from source, your usecase intrigues me just by the phrasing.
- lrvick 1 day ago
  
  You can get a sense of what my team and I do from https://distrust.co/threatmodel.html
  For information on stagex and how we do signed deterministic compiles across independently operated hardware see https://stagex.tools
  Stagex is used by governments, fintech, blockchains, AI companies, and critical infrastructure all over the internet, so our threat model must assume at least one computer or maintainer is compromised at all times and not trust any third party compiled code in the entire supply chain.
  
  salawat 22 hours ago
  
  Nice! I'd thought about doing something similar, but never went so far as to get where y'all are at! I got about to an LFS distro that I was in the process of picking apart GCC to see if I could get the thing verifiable. Can't say as I'm fond of the container first architecture, but I understand why you did it, and my old fartness aside, keep up the good work! Now I have another project to keep an eye on. And at least 4 other people other than me that take supply chain risk seriously! Yay!
  
  lrvick 3 hours ago
  
  Container-first here is mostly about build sandboxing and a packaging format where we avoid re-inventing the wheel and using standards to achieve toolchain diversity and minimalism. Docker is used as a default as it is most popular but you can build with a shell script in a chroot without much work and we want to have several paths to build.
  Also sxctl will download, verify, and install packages without a container runtime being installed at all.
zby 1 day ago

It is sad to observe this time and time again. Last year I had the idea to run a shareholder campaign to change this, I suspended it after last years AMD promises - but maybe this really needs to be done: https://unlockgpu.com/action-plan/

androiddrew 1 day ago

I have been trying since February to get someone at AMD to shipped tuned Tensile kernels in the rcom-libs for the gfx1201. They are used by Ollama but no one on the Developer Discord knows who is responsible for that. It has been pretty frustrating and it shows that AMD has an organizational problem to overcome in addition to all the things technically that they want rocm to do.

FuriouslyAdrift 1 day ago

Have you filed anything at github? https://github.com/zichguan-amd seems to be one of the main people for that...
or https://github.com/harkgill-amd
- androiddrew 12 hours ago
  
  I’ll try and get in touch with them. Thank you.

0xbadcafebee 1 day ago

AMD has years of catching up to do with ROCm just to get their devices to work well. They don't support all their own graphics cards that can do AI, and when it is supported, it's buggy. The AMDGPU graphics driver for Linux has had continued instability since 6.6. I don't understand why they can't hire better software engineers.

onlyrealcuzzo 1 day ago

Because they aren't willing to pay for them?
oofbey 1 day ago

Years. They neglected ROCm for soooo long. I have friends who worked there 5+ years ago who tried desperately to convince execs to invest more in ROCm and failed. You had to have your head stuck pretty deep in the sand back then to not see that AI was becoming an important workload.
I would love AMD to be competitive. The entire industry would be better off if NVIDIA was less dominant. But AMD did this to themselves. One hundred percent.
- tux1968 1 day ago
  
  It would be very helpful to deeply understand the truth behind this management failing. The actual players involved, and their thinking. Was it truly a blind spot? Or was it mistaken priorities? I mean, this situation has been so obvious and tragic, that I can't help feeling like there is some unknown story-behind-the-story. We'll probably never really know, but if we could, I wouldn't spend quite as much time wearing a tinfoil hat.
  
  throwawayrgb 1 day ago
  
  if you asked AMD execs they'd probably say they never had the money to build out a software team like NVIDIA's. that might only be part of the answer. the rest would be things like lack of vision, "can't turn a tanker on a dime", etc.
  
  KeplerBoy 1 day ago
  
  I don't buy that story. NVIDIA wasn't that huge of a company when they built CUDA, they weren't huge when the first GPT model was trained with it.
  
  Alupis 1 day ago
  
  CUDA was built during the time AMD was focusing every resource on becoming competitive in the CPU market again. Today they dominate the CPU industry - but CUDA was first to market and therefore there's a ton of inertia behind it. Even if ROCm gets very good, it'll still struggle to overcome the vast amount of support (read "moat") CUDA enjoys.
  
  KeplerBoy 1 day ago
  
  True. After all Nvidia hasn't built tensorflow or PyTorch. That stuff was bound to be built on the first somewhat viable platform. Rocm is probably far ahead of where cuda was back then, but the goal moved.
  
  pjc50 1 day ago
  
  Has to be lack of vision. I refuse to believe it's impossible to _do_, but it sounds like it's impossible to _specify_ within AMD. Like they're genuinely incapable of working out what the solution might look like.
  
  aurareturn 1 day ago
  
  They were doing stock buybacks before the AI boom.
  
  imtringued 1 day ago
  
  Nobody is asking AMD to rebuild the entire NVidia ecosystem. Most people just want to run GPGPU code or ML code on AMD GPUs without the entire computer crashing on them.
  
  throwawayrgb 1 day ago
  
  yeah it's a very frustrating situation.
  according to public information NVIDIA started working on CUDA in 2004, that was before AMD made the ATI acquisition.
  my suspicion is that back then ATI and NVIDIA had very different orientations. neither AMD nor ATI were ever really that serious about software. so in that sense i guess it was a match made in heaven.
  so you have a cultural problem, which is bad enough, then you add in the lean years AMD spent in survival mode. forget growing software team, they had to cling on to fewer people just to get through.
  now they're playing catch-up in a cutthroat market that's moving at light speed compared to 20 years ago.
  we're talking about a major fumble here so it's easy to lose context and misunderstand things were a little more complex than they appeared.
  
  oofbey 1 day ago
  
  My guess is it’s just incompetence. Imagine you’re in charge of ROCm and your boss asks you how it’s going. Do you say good things about your team and progress? Do you highlight the successes and say how you can do all the major things CUDA can? I think many people would. Or do you say to your boss “the project I’m in charge of is a total disaster and we are a joke in the industry”? That’s a hard thing to say.
  
  throwawayrgb 1 day ago
  
  > My guess is it’s just incompetence.
  maybe on some level but not that level you're describing. pretty much everyone at AMD understands the situation, and has for a while.
  
  Shitty-kitty 1 day ago
  
  a 10 year lead can't be closed overnight but Intel had a even larger lead and look how the mighty have fallen.
  
  pjmlp 1 day ago
  
  Intel was never famous for good GPUs, and they are basically the only ones still trying to make something out of OpenCL, with most of the tooling going beyond what Khronos offers.
  one API is much more than a plain old SYCL distribution, and still.
  
  Shitty-kitty 1 day ago
  
  I meant their CPU supremaciy. ;)
  
  pjmlp 1 day ago
  
  That still reigns in PCs and servers.
  People like to talk about Apple CPUs, but keep forgetting they don't sell chips, and overall desktop market is around 10% world wide.
  ARM is mostly about phones and tablets, good luck finally getting those Windows ARM or GNU/Linux desktop cases or laptops.
  Servers, depends pretty much about which hyperscalers we are on.
  RISC-V is still to be seen, on the desktop, laptops and servers.
  Where AMD is doing great are game consoles.
  
  cm2187 1 day ago
  
  Intel still has 60% server market share but it is in free fall https://wccftech.com/intel-server-client-cpu-market-share-hu...
  
  pjmlp 1 day ago
  
  Interesting information, that leaves desktop and laptop markets, where AMD still has adoption issues especially on laptops.
  
  wlesieutre 1 day ago
  
  Between the MacBook Neo on the low end and Strix Halo on they high end Intel is in for some tougher laptop competition
  
  pjmlp 1 day ago
  
  Outside US, and countries with similar salary levels, people don't earn enough for Apple tax served with 8 GB.
  
  wlesieutre 1 day ago
  
  Also on pace to drop below AMD on the Steam hardware survey this year
  
  pjmlp 1 day ago
  
  The same Steam hardware survey whose quality is questioned about when we talk about Linux adoption numbers?
  
  throwaway173738 1 day ago
  
  Try not to rely on Intel too much. They cut products with promise all the time because they miss quarterly numbers.
  
  Alupis 1 day ago
  
  I'd argue Intel fell is large part because of Intel's own complacency and incompetence. If Intel had taken AMD seriously, they'd probably still be a serious competitor today.
- jijijijij 1 day ago
  
  Not even AI. My 5 years old APU is completely neglected by AMD ROCm efforts. So I also can't use it in Blender! I feel quite betrayed to be honest. How is such a basic thing not possible, not to mention years later?
  Look where Apple Silicon managed going in the same time frame...
  Because of this, I would never consider another AMD GPU for a long time. Gaming isn't everything I want my GPU doing. How do they keep screwing this up? Why isn't it their top priority?
xethos 1 day ago

> I don't understand why they can't hire better software engineers.
Beyond the fact they're competing with the most valuable companies in the world for talent while being less than a decade past "Bet the company"-level financial distress?
- shakow 1 day ago
  
  I don't think that you need top-of-the-line, $1M/yr TC people to revamp a build system.
  
  mathisfun123 1 day ago
  
  lol the irony is that the person who started revamping the build system is a $1M/yr TC person.
  
  klooney 1 day ago
  
  Sometimes the only way you can get basic engineering practices done like "have tests", "have a build system", "run the tests and the builds automatically", "insist that the above work" without management freaking out is to pay someone a lot of money.
prewett 1 day ago

I figure it must be a cultural problem. ATI was known for buggy graphics drivers back in The Day, if I remember correctly. I certainly remember not buying their cards for that reason. Apparently after AMD bought them, they have been unable to change the culture (or didn't care). The state of ATI drivers has always been about the same.
- philipallstar 1 day ago
  
  I don't think they invest nearly as large a percentage of their profits in software compared to Nvidia.
  
  StillBored 1 day ago
  
  I don't even think that is the problem. It seems more an engineering cultural one, that has sadly infected most of the software industry at this point. Instead of incremental improvement it seems the old ATI drivers (and seemingly much of the recent history) are just rewrites rather than having a replaceable low level core and a reasonable amount of legacy that just gets forward ported to newer HW architectures. So, they release the hardware and its basically obsolete before the driver stack ever stabilizes sufficiently that any single driver can run a wide range of games well.
jrm4 1 day ago

[flagged]

AshamedCaptain 1 day ago

> Last year, AMD ran a GitHub poll for ROCm complaints and received more than 1,000 responses. Many were around supporting older hardware, which is today supported either by AMD or by the community, and one year on, all 1,000 complaints have been addressed, Elangovan said.

Must have been by waiting for each of the 1000 complainers to die of old age, because I do not know what old hardware they have added support for.

throwaway173738 1 day ago

I guess it counts if you can find the information from one of the many conflicting wikis out there and then figure out how to hack support for your card into the specific version of ROCm.

grokcodec 1 day ago

The day ROCm supports EVERY AMD card on release, just like CUDA does, is the day I will actually believe this marketing hype.They really dropped the ball here, also when they abandoned recently released cards (at the time) like the 400 series. Hopefully management gets their heads out of their butts and invests more in the software stack.

greenail 1 day ago

I think GB10 is a bit of a counter point. There are tons of features that are not implemented for GB10 which was released 8/2025 It isn't all roses on the cuda side.

rdevilla 1 day ago

ROCm is not supported on some very common consumer GPUs, e.g. the RX 580. Vulkan backends work just fine.

hurricanepootis 1 day ago

RX 580 is a GCN 4 GPU. I'm pretty sure the bare minimum for ROCm is GCN 5 (Vega) and up.
- daemonologist 1 day ago
  
  Among consumer cards, latest ROCm supports only RDNA 3 and RDNA 4 (RX 7000 and RX 9000 series). Most stuff will run on a slightly older version for now, so you can get away with RDNA 2 (6000 series).
  
  hurricanepootis 1 day ago
  
  Huh, I just saw that. Huge bummer.
  I have a Radeon RX 6800 and on my system, I use ROCm's OpenCL for some stuff and HIP for blender cycles rendering. If ROCm were to drop support for my card, that'd be a huge bummer.
BobbyTables2 1 day ago

Did it used to be different?
A few years ago I thought I had used the ROCm drivers/libraries with hashcat on a RX580
Now it’s obsolete ?
maxloh 1 day ago

I have the same experience with my RX 5700. The supported ROCm version is too old to get Ollama running.
Vulkan backend of Ollama works fine for me, but it took one year or two for them to officially support it.
chao- 1 day ago

I purchased my RX 580 in early 2018 and used it through late 2024.
I am critical of AMD for not fully supporting all GPUs based on RNDA1 and RDNA2. While backwards compatibility is always better than less for the consumer, the RX 580 was a lightly-updated RX 480, which came out in 2016. Yes, ROCm technically came out in 2016 as well, but I don't mind acknowledging that it is a different beast to support the GCN architecture than the RDNA/CDNA generations that followed (Vega feels like it is off on an island of its own, and I don't even know what to say about it).
As cool as it would be to repurpose my RX 580, I am not at all surprised that GCN GPUs are not supported for new library versions in 2026.
I would be MUCH more annoyed if I had any RDNA1 GPU, or one of the poorly-supported RDNA2 GPUs.
daemonologist 1 day ago

ROCm usually only supports two generations of consumer GPUs, and sometimes the latest generation is slow to gain support. Currently only RDNA 3 and RDNA 4 (RX 7000 and 9000) are supported: https://rocm.docs.amd.com/projects/install-on-linux/en/lates...
It's not ideal. CUDA for comparison still supports Turing (two years older than RDNA 2) and if you drop down one version to CUDA 12 it has some support for Maxwell (~2014).
- terribleperson 1 day ago
  
  It's pretty crazy that a 6900XT/6950XT aren't supported.
  
  bavell 1 day ago
  
  Eh, YMMV. I was using rocm for minor AI things as far back as 2023 on an "unsupported" 6750XT [0]. Even trained some LoRAs. Mostly the issues were how many libs were cuda only.
  [0] https://news.ycombinator.com/item?id=43207015
- 0xbadcafebee 1 day ago
  
  Worse, RDNA3 and RDNA4 aren't fully supported, and probably won't be, as they only focus on chips that make them more money. If we didn't have Vulkan, every nerd in the world would demand either a Mac or an Intel with Nvidia chip. AMD keeps leaving money on the table.
  
  lpcvoid 1 day ago
  
  Up until recently they didn't even support their cashcow Ryzen 395+ MAX properly. Idk about the argument that they only care about certain chips.
- kombine 1 day ago
  
  I have RX 6700XT, damn. AMD is shooting themselves in the foot
  
  bavell 1 day ago
  
  Try it before you give up, I got plenty of AI stuff working on my 6750XT years ago.
- imtringued 1 day ago
  
  If you are on an unsupported AMD GPU, why would you ever consider switching to a newer AMD GPU, considering you know that it will reach the same sorry state as your current GPU?
  Especially when as you say, the latest generation is slow to gain support, while they are simultaneously dropping old generations, leaving you with a 1-2 year window of support.
pjmlp 1 day ago

Vulkan backends work just fine, provided one wants to be constrained by Vulkan developer experience without first class support for C++, Fortran and Python JIT kernels, IDE integration, graphical debugging, libraries.

StillBored 1 day ago

I just wish they would make another pass at cleaning up the stack. It should be easy to `git clone --recurse-submodules rocm` followed by a configure/make that both prints out missing dependencies and configures without them, along with choices for 'build the world' vs just build some lower level opencl/HIP/SPIRV tooling without all the libraries/etc on top in a clear way.

Right now the entire source base is literally throw a bunch of crap into the rocm brand and hope it builds together vs some overarching architecture. Presumably the entire spend it also tied to "whatever big Co's evaluation needs this week" when it comes to developing with it.

mstaoru 1 day ago

I'm team "taking on CUDA with OpenVINO" (and SYCL*), Intel seems really upped their game on iGPU and dGPU lately, with sane prices and fairly good software support and APIs.

I'm not talking gaming CUDA, but CV and data science workloads seem to scale well on Arc and work well on Edge on Core Ultra 2/3.

adev_ 1 day ago

A little feedback to AMD executives about the current status of ROCm here:

(1) - Supporting only Server grade hardware and ignoring laptop/consumer grade GPU/APU for ROCm was a terrible strategical mistake.

A lot of developers experiments first and foremost on their personal laptop first and scale on expensive, professional grade hardware later. In addition: some developers simply do not have the money to buy server grade hardware.

By locking ROCm only to server grade GPUs, you restrict the potential list of contributors to your OSS ROCm ecosystem to few large AI users and few HPC centers... Meaning virtually nobody.

A much more sensible strategy would be to provide degraded performance for ROCm on top of consummer GPUs, and this is exactly what Nvidia with CUDA is doing.

This is changing but you need to send a clear message there. EVERY new released device should be properly supported by ROCm.

- (2) Supporting only the two last generations of architecture is not what customers want to see.

https://rocm.docs.amd.com/projects/install-on-linux/en/docs-...

People with existing GPU codebase invests significant amount of effort to support ROCm.

Saying them two years later: "Sorry you are out of update now!" when the ecosystem is still unstable is unacceptable.

CUDA excels to backward compatibility. The fact you ignore it entirely plays against you.

(3) - Focusing exclusively on Triton and making HIP a second class citizen is non-sensical.

AI might get all the buzz and the money right now, we go it.

It might look sensible on the surface to focus on Python-base, AI focused, tools like Triton and supporting them is definitively necessary.

But there is a tremendous amount of code that is relying on C++ and C to run over GPU (HPC, simulation, scientific, imaging, ....) and that will remain there for the multiple decades to come.

Ignoring that is loosing, again, custumers to CUDA.

It is currently pretty ironic to see such a move like that considering that AMD GPUs currently tend to be highly competitive over FP64, meaning good for these kind of applications. You are throwing away one of your own competitive advantage...

(4) - Last but not least: Please focus a bit on the packaging of your software solution.

There has been complained on this for the last 5 years and not much changed.

Working with distributions packagers and integrating with them does not cost much... This would currently give you a competitive advantage over Nvidia..

pjmlp 1 day ago

Additional points, CUDA is polyglot, and some people do care about writing their kernels in something else other than C++, C or Fortran, without going through code generation.
NVidia is acknowledging Python adoption, with cuTile and MLIR support for Python, allowing the same flexibility as C++, using Python directly even for kernels.
They seem to be supportive of having similar capabilities for Julia as well.
The IDE and graphical debuggers integration, the libraries ecosystem, which now are also having Python variants.
As someone that only follows GPGPU on the side, due to my interests in graphics programming, it is hard to understand how AMD and Intel keep failing to understand what CUDA, the whole ecosystem, is actually about.
Like, just take the schedule of a random GTC conference, how much of it can I reproduce on oneAPI or ROCm as of today.
Symmetry 1 day ago

There actually isn't any locking involved. I can take a new, officially unsupported version of ROCm and just use it with my 7900 XT despite my card not being officially supported and it works. It's just that AMD doesn't feel that they need to invest the resources to run their test suite against my card and bless it as officially supported. And maybe if I was doing something other than running PyTorch I'd run into bugs. But it's just laziness, not malice.
- machomaster 1 day ago
  
  This is a very unprofessional attitude. There is no space for laziness in business.
- hmry 1 day ago
  
  I used to be able to run ROCm on my officially unsupported 7840U. Bought the laptop assuming it would continue to work.
  Then in a random Linux kernel update they changed the GPU driver. Trying to run ROCm now hard-crashed the GPU requiring a restart. People in the community figured out which patch introduced the problem, but years later... Still no fix or revert. You know, because it's officially unsupported.
  So "Just use HSA_OVERRIDE_GFX_VERSION" is not a solution. You may buy hardware based on that today, and be left holding the bag tomorrow.
shawnz 1 day ago

> Supporting only Server grade hardware and ignoring laptop/consumer grade GPU/APU for ROCm was a terrible strategical mistake. A lot of developers experiments first and foremost on their personal laptop first and scale on expensive, professional grade hardware later.
NVIDIA is making the same mistake today by deprioritizing the release of consumer-grade GPUs with high VRAM in favour of focusing on server markets.
They already have a huge moat, so it's not as crippling for them to do so, but I think it presents an interesting opportunity for AMD to pick up the slack.
0xbadcafebee 1 day ago

> Working with distributions packagers and integrating with them does not cost much... This would currently give you a competitive advantage over Nvidia..
Packaging is actually a huge amount of effort if you try to package for all distros.
So the common long-standing convention is to use a "vendored software" approach. You design everything to install into /opt/foo/, and you provide a simple install script to install everything, from one (or several) giant zips/tarballs. It's very old and dumb but it works quite well. Easy to support from company perspective, just run your dumb installer on a couple distros once in a while. Don't depend on distro-specific paths, use basic autodetection to locate and load libraries/dependencies.
Once you do that, it is actually easier for distros to package your software for you. They make one basic package that runs the installer, then they carve up the resulting files into sub-packages based on path. Then they just iterate on that over time as bugs come in (as users try to install just package X.a, which really needs files from X.b).
But you need to hire people with expertise in the open source world to know all this, and most companies don't. Maybe there's just not a lot of us left out there. Or, more likely, they just don't understand that wider support + easier use = more adoption.
- adev_ 15 hours ago
  
  > Packaging is actually a huge amount of effort if you try to package for all distros.
  That's the neat part: You do not have too package for all the distro.
  Just make your components easy to decouple with provided pkg-config (ideally) and a proper configuration mechanism.
  No bundle, no hidden download or tangled vendored messy script.
  Then it is easy to do: you can just provide the packages for your main targets (e.g Ubuntu, Redhat typically) and the community of other distros will take care of the rest.

bruce343434 1 day ago

In my experience fiddling with compute shaders a long time ago, cuda and rocm and opencv are way too much hassle to set up. Usually it takes a few hours to get the toolkits and SDK up and running that is, if you CAN get it up and running. The dependencies are way too big as well, cuda is 11gb??? Either way, just use Vulkan. Vulkan "just works" and doesn't lock you into Nvidia/amd.

cmovq 1 day ago

Vulkan is a pain for different reasons. Easier to install sure, but you need a few hundred lines of code to set up shader compilation and resources, and you’ll need extensions to deal with GPU addresses like you can with CUDA.
- rdevilla 1 day ago
  
  Ah yes, but those hundred lines of code are basically free to produce now with LLMs...
  
  cylemons 1 day ago
  
  Whatabout the extensions? is it widely supported
  
  NekkoDroid 1 day ago
  
  That is always one check away: https://vulkan.gpuinfo.org/listextensions.php
Arech 1 day ago

Haha. People have already said what is Vulkan in practice - it's very convoluted low-level API, in which you have to write pretty complicated 200+LoC just to have simplest stuff running. Also doing compute on NVIDIA in Vulkan is fun if you believe the specs word for word. If you don't, you switch a purely compute pipeline into a graphical mode with a window and a swapchain, and instantly get roughly +20% of performance out of that. I don't know if this was a bug or an intended behavior (to protect CUDA), but this how it was a couple years ago.
Almondsetat 1 day ago

On Windows: download a 3GB exe and install
On Linux: add repository and install cuda-toolkit
Does that take a few hours?

p1esk 1 day ago

Someone from AMD posted this a few minutes ago, then deleted it:

"Anush's success is due to opting out of internal bureaucracy than anything else. most Claude use at AMD goes through internal infrastructure that can take hundreds of seconds per response due to throttling. Anush got us an exemption to use Anthropic directly. he is also exempt from normal policies on open source and so I can directly contribute to projects to add AMD support. He's an effective leader and has turned ROCm into a internal startup based in California. Definitely worth joining the team even if you've heard bad things about AMD as a whole."

This kind of bullshit is why I don't want to join AMD, even if this particular team is temporarily exempt from it.

brcmthrowaway 1 day ago

So join NVIDIA instead
nl 1 day ago

> he is also exempt from normal policies on open source and so I can directly contribute to projects to add AMD support.
It's crazy that this is a big deal.
I understand the need for some kind of governance around this but for it to require a special exemption just shows how far the AMD culture needs to shift.
- 0xbadcafebee 1 day ago
  
  Liability is always a big deal.
  
  nl 1 day ago
  
  Sure, but it's not like other large companies don't have policies that address this.
noident 1 day ago

Policies like these are widespread in most companies with >1000 employees
- wongarsu 1 day ago
  
  And are a part of the reason people always ask "how is it that this company has >1000 employees and gets nothing done"
- fg137 12 hours ago
  
  How does that work?
  To be able to use Claude directly, you need to handle separate accounts, billing, security etc.
  At the very least you'll need to set up SSO separately or asking people to login with one-time password?
  That's a lot to ask in a giant corporation. I mean, there are (good) reasons to use an internal gateway in the first place.

pjmlp 1 day ago

They need lots of steps, hardware support, IDE and graphical debugging integrations , the polyglot ecosystem, having a common bytecode used by several compiler backends (CUDA is not only C++), the libraries portfolio.

jmward01 1 day ago

I really want to get to the point that I am looking online for a GPU and Nvidia isn't the requirement. I think we are really close to there. Maybe we are there and my level of trust just needs to bump up.

m-schuetz 1 day ago

Problem is, NVIDIA has so many quality of life features for developers. It's not easy getting especially smaller scale developers and academia to use other vendors that are 1) much more difficult to use while 2) also being slower and not as rich in features.
Personally I opted in to being NVIDIA-vendor-locked a couple of years ago because I just couldn't stand the insanely bonkers and pointless complexity of APIs like Vulkan. I used OpenGL before which supported all vendors, but because newer features weren't added to OpenGL I eventually had to make the switch.
I tried both Vulkan and CUDA, and after not getting shit done in Vulkan for a week I tried CUDA, and got the same stuff done in less than a day that I could not do in a whole week in Vulkan. At that moment I thought, screw it, I'm going to go NV-only now.
- pjmlp 1 day ago
  
  I did my thesis porting my supervisor's project from NeXTSTEP into Windows, was an OpenGL fanboy up to the whole Long Peaks disaster.
  Additionally Vulkan has proven to be yet another extension mess (to the point now are actions try to steer it back on track), Khronos is like the C++ of API design, while expecting vendors to come up with the tools.
  However, as great as CUDA, Metal and DirectX are to play around with, we might be stuck with Khronos APIs, if geopolitcs keep going as bad or worse, as they have been thus far.

suprjami 1 day ago

Just in time for Vulkan tg to be faster in almost all situations, and Vulkan pp to be faster in many situations with constant improvements on the way, making ROCm obsolete for inference.

kimixa 1 day ago

ROCm vs Vulkan has never been about performance - you should be able to represent the "same" shader code in either, and often they back onto the same compilers and optimizers anyway. If one is faster, that often means something has gone /wrong/.
The advantages for ROCm would be integration into existing codebases/engineer skillsets (e.g. porting an existing C++ implementation of something to the GPU with a few attributes and API calls rather than rewriting the core kernel in something like GLSL and all the management vulkan implies).
m-schuetz 1 day ago

Vulkan has abysmal UX though. At one point I had to chose between Vulkan and Cuda for future projects, and I ended up with Cuda because a feasibilty study I couldn't get to work in Vulkan for an entire week, easily worked in Cuda in less than a day.

superkuh 1 day ago

AMD hasn't signaled in behavior or words that they're going to actually support ROCm on $specificdevice for more than 4-5 years after release. Sometimes it's as little as the high 3.x years for shrinks like the consumer AMD RX 580. And often the ROCm support for consumer devices isn't out until a year after release, further cutting into that window.

Meanwhile nvidia just dropped CUDA/driver support for 1xxx series cards from their most recent drivers this year.

For me ROCm's mayfly lifetime is a dealbreaker.

canpan 1 day ago

I was thinking to get 2x r9700 for a home workstation (mostly inference). It is much cheaper than a similar nvidia build. But still not sure if good value or more trouble.
- chao- 1 day ago
  
  Talking to friends who have fought more homelab battles than I ever will, my sense is that (1) AMD has done a better job with RDNA4 than the past generations, and (2) it seems very workload-dependent whether AMD consumer gear is "good value", "more trouble", or both at the same time.
  Edit: I misread the "2x r9700" as "2 rx9700" which differs from the topic of this comment (about RNDA4 consumer SKUs). I'll keep my comment up, but anyone looking to get Radeon PRO cards can (should?) disregard.
  
  KennyBlanken 1 day ago
  
  Given RDNA3 was a pathetic joke, it wouldn't be hard for them to do a better job.
- cyberax 1 day ago
  
  I have this setup, with 2x 32Gb cards. It's perfect for my needs, and cheaper than anything comparable from NV.
- stephlow 1 day ago
  
  I own a single R9700 for the same reason you mentioned, looking into getting a second one. Was a lot of fiddling to get working on arch but RDNA4 and ROCm have come a long way. Every once in a while arch package updates break things but that’s not exclusive to ROCm.
  LLM’s run great on it, it’s happily running gemma4 31b at the moment and I’m quite impressed. For the amount of VRAM you get it’s hard to beat, apart from the Intel cards maybe. But the driver support doesn’t seem to be that great there either.
  Had some trouble with running comfyui, but it’s not my main use case, so I did not spent a lot of time figuring that out yet
  
  canpan 1 day ago
  
  Thanks for the answer. Brings my hope up. Looking in my local shops, I can get 3 cards for the price of one 5090.
  May I ask, what kind of tok/s you are getting with the r9700? I assume you got it fully in vram?
  
  jhgorrell 1 day ago
  
  Stock install, no tuning.
  $uname -r 6.8.0-107-generic $ollama --version ollama version is 0.20.2 $ollama run "gemma4:31b" --verbose "write fizzbuzz in python." [...] total duration: 45.141599637s load duration: 143.633498ms prompt eval count: 21 token(s) prompt eval duration: 48.047609ms prompt eval rate: 437.07 tokens/s eval count: 1057 token(s) eval duration: 44.676612241s eval rate: 23.66 tokens/s
  
  theoli 1 day ago
  
  I have a dual R9700 machine, with both cards on PCIe gen4 x8 slots. The 256bit GDDR6 memory bandwidth is the main limiting factor and makes dense models above 9b fairly slow.
  The model that is currently loaded full time for all workloads on this machine is Unsloth's Q3_K_M quant of Qwen 3.5 122b, which has 10b active parameters. With almost no context usage it will generate 59 tok/sec. At 10,000 input tokens it will prefill at about 1500 tok/sec and generate at 51 tok/sec. At 110,000 input tokens it will prefill at about 950 tok/sec and generate at 30 tok/sec.
  Smaller MoE models with 3b active will push 70 tok/sec at 10,000 context. Dense models like Qwen 3.5 27b and Devstral Small 2 at 24b will only generate at around 13 - 15 tok/sec with 10,000 context.
  This is all on llama.cpp with the Vulkan backend. I didn't get to far in testing / using anything that requires ROCm because there is an outstanding ROCm bug where the GPU clock stays at 100% (and drawing like 60 watts) even when the model is not processing anything. The issue is now closed but multiple commenters indicate it is still a problem. Using the Vulkan backend my per-card idle draw is between 1 and 2 watts with the display outputs shut down and no kernel frame buffer.
- djsjajah 1 day ago
  
  I have 2 of them. I would advise against if you want to run things like vllm. I have had the cards for months and I still have not been able to create a uv env with trl and vllm. For vllm, it’s works fine in docker for some models. With one gpu, gpt-oss 20b decoding at a cumulative 600-800tps with 32 concurrent requests depending on context length but I was getting trash performance out of qwen3.5 and Gemma4
  If I were to do it again, I’d probably just get a dgx spark. I don’t think it’s been worth the hassle.
  
  girvo 1 day ago
  
  FWIW I’m in love with my Asus GX10 and have been learning CUDA on it while playing with vllm and such. Qwen3.5 122B A10 at ~50tps is quite neat.
  But do beware, it’s weird hardware and not really Blackwell. We are only just starting to squeeze full performance out of SM12.1 lately!
hotstickyballs 1 day ago

Driver support eats directly into driver development
lrvick 1 day ago

ROCm is open source and TheRock is community maintained, and in a minute the first Linux distro will have native in-tree builds. It will be supported for the foreseeable future due to AMDs open development approach.
It is Nvidia that has the track record of closed drivers and insisting on doing all software dev without community improvements to expected results.
- KennyBlanken 1 day ago
  
  > expected results
  The defacto GPU compute platform? With the best featureset?
  
  lrvick 1 day ago
  
  And the worst privacy, transparency, and FOSS integration due to their insistence on a heavily proprietary stack.
  Also pretty hard to beat a Strix Halo right now in TPS for the money and power consumption.
  Even that aside there exist plenty like me that demand high freedom and transparency and will pay double for it if we have to.
  
  KennyBlanken 1 day ago
  
  > And the worst privacy, transparency, and FOSS integration due to their insistence on a heavily proprietary stack.
  The market doesn't care about any of that. The consumer market doesn't care, and the commercial market definitely does not. The consumer market wants the most Fortnite frames per second per dollar. The commercial market cares about how much compute they can do per watt, per slot.
  > there exist plenty like me that demand high freedom and transparency and will pay double for it if we have to.
  The four percent share of the datacenter market and five percent of the desktop GPU market say (very strongly) otherwise.
  I have a 100% AMD system in front of me so I'm hardly an NVIDIA fanboy, but you thinking you represent the market is pretty nuts.
  
  lrvick 1 day ago
  
  I did not claim to represent the market as a whole, but I feel I likely represent a significant enough segment of it that AMD is going to be just fine.
  I think local power efficient LLMs are going to make those datacenter numbers less relevant in the long run.
mindcrime 1 day ago

Last year, AMD ran a GitHub poll for ROCm complaints and received more than 1,000 responses. Many were around supporting older hardware, which is today supported either by AMD or by the community, and one year on, all 1,000 complaints have been addressed, Elangovan said. AMD has a team going through GitHub complaints, but Elangovan continues to encourage developers to reach out on X where he’s always happy to listen.
Seems like they're making some effort in that direction at least. If you have specific concerns, maybe try hitting up Anush Elangovan on Twitter?
- djsjajah 1 day ago
  
  > or by the community
  Hmmm
SwellJoe 1 day ago

Is it really that short? This support matrix shows ROCm 7.2.1 supporting quite old generations of GPUs, going back at least five or six years. I consider longevity important, too, but if they're actively supporting stuff released in 2020 (CDNA), I can't fault them too much. With open drivers on Linux, where all the real AI work is happening, I feel like this is a better longevity story than nvidia...where you're dependent on nvidia for kernel drivers in addition to CUDA.
https://rocm.docs.amd.com/en/latest/compatibility/compatibil...
- Karliss 1 day ago
  
  You missed the note at the top "GPUs listed in the following table support compute workloads (no display information or graphics)". It doesn't mean that all CDNA or RDNA2 cards are supported. That table is very is very misleading it's for enterprise compute cards only - AMD Instinct and AMD Radeon Pro series. For actual consumer GPUs list is much worse https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/in... , more or less 9000 and select 7000 series. Not even all of the 7000 series.
  
  SwellJoe 1 day ago
  
  I think that speaks to them not understanding at the time the opportunity they were missing out on by not shipping a CUDA-like thing to everyone, including consumer tech. The question is what'll it look like in a few years now that they do understand AI is the biggest part of the GPU industry.
  I suspect, given AMDs relative openness vs. nvidia, even consumer-level stuff released today will end up with a longer useful life than current nvidia stuff.
  I could be wrong, of course. I've taken the gamble...the last nvidia GPU I bought was a 3070 several years ago. Everything recent has been AMD. It's half the price for nearly competitive performance and VRAM. If that bet turns out wrong, I'll just upgrade a little sooner and still probably end up ahead. But, I think/hope openness will win.
  Also, nvidia graphics drivers on Linux are a pain in the ass that I didn't want to keep dealing with. I decided it wasn't worth the hassle, even if they're better on some metrics. I've been able to run everything I've tried on an AMD Strix Halo and an old Radeon Pro V620 (not great, but cheap, compared to other 32GB GPUs and still supported by current ROCm).
Shitty-kitty 1 day ago

The splist CDNA/RDNA architecture is a problem for AMD. The upcoming unified UDMA will solve the issue.

taherchhabra 1 day ago

Genuine question. After claude code, codex etc, can't this be speedup ?

Gasp0de 1 day ago

I believe this is what that teamlead in the article comments on as next steps?

roenxi 1 day ago

> Challenger AMD’s ability to take data center GPU share from market leader Nvidia will certainly depend on the success or failure of its AI software stack, ROCm.

I don't think this is true. ROCm is a huge advantage for Nvidia but as far as I can tell it is more a set of R&D libraries than anything else, so all the Hot New Stuff keeps being Nvidia first and only (to start with) as the library ecosystem for the hotness doesn't exist yet. Then eventually new libraries are created that are CUDA independent and AMD turns out to make pretty good graphics cards.

I wouldn't be surprised of ROCm withered on the vine and AMD still does fine.

hurricanepootis 1 day ago

I've been using ROCm on my Radeon RX 6800 and my Ryzen AI 7 350 systems. I've only used it for GPU-accelerated rendering in Cycles, but I am glad that AMD has an option that isn't OpenCL now.

mellosouls 1 day ago

Related from Jan 2025:

ROCm Device Support Wishlist (205 points, 107 comments)

https://news.ycombinator.com/item?id=42772170

nullpoint420 1 day ago

I just don’t understand how they haven’t figured this out yet. I genuinely want to know the corporate structure and politics that have lead to their inability to execute.

Is it leadership? Something else?

ycui1986 1 day ago

For many LLM load, it seems ROCm is slower than vulkan. What’s the point?

mmis1000 1 day ago

Compatibility so foundation packages like torch onnx-runtime can run on AMD GPU without massive change in architecture. It's the biggest reason for those stuff that "only works on nvidia gpu". It's not faster if vulkan alternative exists, but at least it runs.

naasking 1 day ago

ROCm is so annoying (buggy, fiddly dependencies, limited hardware support) that TinyGrad built its own compiler and toolchain that targets the hardware directly. And it has broader device support than ROCm, which primarily seems focused on their datacenter GPUs.

ethan_smith 1 day ago

The TinyGrad approach of going straight to the hardware is telling. Between that, Vulkan compute getting faster for inference (llama.cpp Vulkan backend is competitive now), and SYCL/oneAPI, it feels like the real threat to CUDA might not be ROCm at all but a fragmented set of alternatives that each bypass AMD's broken software stack entirely.

blovescoffee 1 day ago

Naive question, could agents help speed up building code for ROCm parity with CUDA? Outside of code, what are the bottlenecks for reaching parity?

jiggawatts 1 day ago

Lack of focus from AMD management. See the sibling comment: https://news.ycombinator.com/item?id=47745611
They just don't care enough to compete.
WorldPeas 1 day ago

to be honest, outside of fullstack and basic MCU stuff, these agents aren't very good. Whenever a sufficiently interesting new model comes out I test it on a couple problems for android app development and OS porting for novel cpu targets and we still haven't gotten there yet. I'd be happy to see a day where it was possible however
- catgary 1 day ago
  
  I’ve found they’re quite good when you’re higher in the compiler stack, where it’s essentially a game of translating MLIR dialects.
  
  WorldPeas 1 day ago
  
  it'd be nice if one of these environment labs made an environment for cross-architecture porting, it'd be really cool to see some old ppc mac programs running natively, or compiled to wasm (yes, yes I know the visual elements would need to be ported as well)
hypercube33 1 day ago

Maybe this is dumb but at the moment through windows (and WSL?) you get: rocm DirectML Vulkan OpenML?
m-schuetz 1 day ago

Agents work great for tasks that thousands of developers have done before. This isn't one of those tasks.
- WithinReason 1 day ago
  
  Unless you train them with RL in the right task specifically

amelius 1 day ago

How long until we can use AI to simply translate all the CUDA stuff to another (more open) platform? I'm getting the feeling we're getting close.

AI won't be working in nVidia's advantage this time.

DeathArrow 1 day ago

Do we get better perf or tokens per second with AMD and its software stack than with Nvidia?

wongarsu 1 day ago

The metric where AMD usually comes out on top is perf/$. Or with their instinct cards VRAM/$

formerly_proven 1 day ago

We’ve been talking about this for a good ten years at least and AMD is still essentially in the “concepts of a plan” phase. The AMD GPGPU software org has to be one of the most inconsequential ones at this rate.

mmis1000 1 day ago

At least they finally do something this time. Now torch and whatever transformer stuff runs normally on windows/linux as long as you installed correct wheel from amd's own repository.
It's a huge step though.

shmerl 1 day ago

Side question, but why not advance something like Rust GPU instead as a general approach to GPU programming? https://github.com/Rust-GPU/rust-gpu/

From all the existing examples, it really looks the most interesting.

I.e. what I'm surprised about is lack of backing for it from someone like AMD. It doesn't have to immediately replace ROCm, but AMD would benefit from it advancing and replacing the likes of CUDA.

MobiusHorizons 1 day ago

From the readme:
> Note: This project is still heavily in development and is at an early stage.
> Compiling and running simple shaders works, and a significant portion of the core library also compiles.
> However, many things aren't implemented yet. That means that while being technically usable, this project is not yet production-ready.
Also projects like rust gpu are built on top of projects like cuda and ROCm they aren’t alternatives they are abstractions overtop
- shmerl 1 day ago
  
  I think Rust GPU is built on top of Vulkan + SPIR-V as their main foundation, not on top of CUDA or ROCm.
  What I meant more is the language of writing GPU programs themselves, not necessarily the machinery right below it. Vulkan is good to advance for that.
  I.e. CUDA and ROCm focus on C++ dialect as GPU language. Rust GPU does that with Rust and also relies on Vulkan without tying it to any specific GPU type.
  
  markisus 1 day ago
  
  The article mentions Triton for this purpose. I don’t think you will get maxed out performance on the hardware though because abstraction layers won’t let you access the fastest possible path.
  
  shmerl 1 day ago
  
  > I don’t think you will get maxed out performance on the hardware though because abstraction layers won’t let you access the fastest possible path.
  You could argue about CPU architectures the same, no? Yet compilers solve this pretty well most of the time.
  
  fc417fc802 1 day ago
  
  Sort of not really. Compilers are fantastic for the typical stuff and that includes the compilers in the CUDA/ROCm/Vulkan/etc stacks. But on the CPU for the rare critical bits where you care about every last cycle or other inane details for whatever reason you're often all but forced to fall back on intrinsics and microarch specific code paths.
  
  shmerl 1 day ago
  
  Yeah, that's why I said most of the time. Sometimes even for CPUs things need assembly. But no one stops you using GPU assembly either when needed I suppose? It should not be the default approach probably.
HarHarVeryFunny 1 day ago

If you don't want/need to program at lowest level possible, then Pytorch seems the obvious option for AMD support, or maybe Mojo. The Triton compiler would be another option for kernel writing.
- shmerl 1 day ago
  
  I don't think that's something that can be pitched as a CUDA alternative. Just different level.
  
  HarHarVeryFunny 7 hours ago
  
  Triton, while a compiler, generates code at a lower level than CUDA or ROCm.
  The machine code that actually runs on NVidia and AMD GPUs respectively are SASS and AMDGCN, and in each case there is also an intermediate level of representation:
  CUDA -> PTX -> SASS
  ROCm -> LLVM-IR -> AMDGCN
  The Triton compiler isn't generating CUDA or ROCm - it generates it's own generic MLIR intermediate representation, which then gets converted into PTX or LLVM-IR, with vendor-specific tools then doing the final step.
  If you are interested in efficiency and wanted to write high level code, then you might be using Pytorch's torch.compile, which then generates Triton kernels, etc.
  If you really want to squeeze the highest performance out of an NVIDA GPU then you would write in PTX assembler, not CUDA, and for AMD in GCN assembler.
LegNeato 1 day ago

One of the rust-gpu maintainers here. Haven't officially heard from anyone at AMD but we've had chats with many others. Happy to talk with whomever! I would imagine AMD is focusing on ROCm over Vulkan for compute right now as their pure datacenter play, which makes sense.
We've started a company around Rust on the GPU btw (https://www.vectorware.com/), both CUDA and Vulkan (and ROCm eventually I guess?).
Note that most platform developers in the GPU space are C++ folks (lots of LLVM!) and there isn't as much demand from customers for Rust on the GPU vs something like Python or Typescript. So Rust naturally gets less attention and is lower on the list...for now.
- shmerl 1 day ago
  
  I see, thanks. Would be good if Vulkan was pushed more as an approach for this since others are GPU specific.
pjmlp 1 day ago

Because the people that care want C++, Fortran, Python and Julia, which already enjoy a rich ecosystem.

xyzsparetimexyz 1 day ago

Better title: One Dispatch After Another

neuroelectron 1 day ago

Now that the AI bubble is starting to burst, it's a great time for AMD to reveal their AI ambitions. They've set the tone by hiring low cost, outsourced labor.

Of course everybody knows what's really going on here. It's not an open discussion, however.

alecco 1 day ago

Apple got it right with unified memory with wide bus. That's why Mac Minis are flying for local models. But they are 10x less powerful in AI TOPS. And you can't upgrade the memory.

I really wish AMD and Intel boards get replaced by competent people. They could do it in very short time. Both have integrated GPUs with main memory. AMD and Intel have (or at least used to have) serious know-how in data buses and interconnects, respectively. But I don't see any of that happening.

ROCm? It can't even support decent Attention. It lacks a lot of features and NVIDIA is adding more each year. Soon they will reach escape velocity and nobody will catch them for a decade. smh

caycep 1 day ago

Granted, I feel like NVIDIA GPU pricing is such that Mac minis will be way less than 10x cheaper if not already, so one might still get ahead purchasing a bulk order of Mac minis....
- KennyBlanken 1 day ago
  
  A 5090 will cost you about the same amount of money as a Mac Studio M3 Ultra with eight times the RAM.
  It's pretty insane how overpriced NVIDIA hardware is.
  
  corndoge 1 day ago
  
  But the 5090 can run Crysis
  
  LoganDark 1 day ago
  
  Yes but the 5090 can run games.
  Running games on my loaded M4 Max is worse than on my 3090 despite the over-four-year generational gap.
  Like, Pacific Drive will reach maybe 30fps at less than 1080p whereas the 3090 will run it better even in 4K.
  That could just be CrossOver's issue with Unreal Engine games, but "just play different games" is not a solution I like.
  
  kimixa 1 day ago
  
  The 256GB Mac Studio (the one with "eight times the RAM") is listed for ~$2000 more than the current 5090 prices, and another additional $1500 for the 80-core GPU variant. Only the "base" model with 96gb is a remotely similar price, $3600-$4000.
  And a 5090 has a little over 2x the memory bandwidth - ~820GB/s vs ~1790GB/s. And significantly higher peak FLOPS on the 5090 too.
  Sure, if the goal is to get the "Cheapest single-device system with 256GB ram" it looks pretty good, but there's lots of other axes it falls down on. Great if you know you don't care about them, but not "Better In Every Way". Arguably, better in only a single way - but that single way may well be the one you need.
  And the current 5090 price might be a transient peak - only three months ago they were closer to $2500 - significantly less than half the $6000 base-spec 256GB Mac Studio. While the Mac Studio has been constant.
  
  cjbgkagh 1 day ago
  
  It seems like general improvements in ram efficiency, such as that used in Gemma 4, means it’s back to memory bandwidth as the bottleneck and less about total available memory size. I’m also curious to see how much more agent autonomy will reduce less need for low latency and shift the focus to more throughput. Meaning it’s easier to spread the model out over multiple smaller GPUs and use pipeline parallelism to keep them busy. This would also mean using ram for market discrimination becomes less effective.
bsder 1 day ago

> I really wish AMD and Intel boards get replaced by competent people.
Intel? Agreed. But AMD is making money hand over fist with enterprise AI stuff.
Right now, any effort that AMD or NVIDIA expend on the consumer sector is a waste of money that they could be spending making 10x more at the enterprise level on AI.
KeplerBoy 1 day ago

Aren't mac minis flying for "local models" because people have no clue what they are doing?
All those people who bought them for openclaw just bought them because it was the trendy thing to do. No one of those people is running local models on there.
pjmlp 1 day ago

They aren't flying outside US, or countries with similar salary levels.

nnevatie 1 day ago

Why is it called "ROCm” (with the strange capitalization) in the first place? This may sound silly, but in order to compete, every detail matters, including the name.

WanderPanda 1 day ago

This is so true! Shows a lack of care that usually doesn’t stop at just the naming
slongfield 1 day ago

It used to stand for "[R]adeon [O]pen [C]o[m]pute", but since it's not affiliated with the Open Compute Project, they dropped the meaning of it a little while ago, and now it doesn't stand for anything.
dnautics 1 day ago

presumably a reference to rocm/socm robots?