It's sadly ironic I no longer even bother clicking on HN posts that are obvious product announcements from large corporations and instead just go to the replies. Corporate product announcements somehow fail to even clearly communicate the basic facts you did in your first nine words.
One nuance that's missing from your summary is it's a world model specifically targeted to be useful for training robotic and autonomous vehicle AIs. So not really intended to be a direct competitor to Nano Banana or Seedance. While it can do straight image and video gen, its special sauce is providing more physics data and harnesses for AI training scenarios.
> Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.
Looking forward to trying this out on my $10000+ workstation grade GPU that I need an equally expensive set up to run.
Not at all an expert but I believe it's possible to get started experimenting with just a simulated robot in the simulated world model. While the full workflow is to generate training data to drive a real robot in the real world, without closing the loop, you're just lacking the ground truth data to quantify the divergence between simulation and reality.
There are all kinds of hobbyist robotic armatures at various price points but my understanding from a friend in this space is that the precision, durability and repeatability for serious applications starts at around $30,000 to $50,000. He mentioned the Franka Research 3 (FR3) as one example (https://franka.de/), perhaps driven by something like a Jetson AGX Thor ($5,000 and up).
As always, there are many less expensive and DIY-ish recipes to get started on smaller budgets. My friend's suggestion was more the baseline experimental lab system for a big company wanting get started with something that could, in theory, scale to light industrial internal deployment.
This release unifies those capabilities with a Mixture-of-Transformers (MoT) architecture built around two towers.
Reasoner tower: A vision-language model (VLM) ... This serves as the ‘brain’ that reasons about the world before any generation happens.
Generator tower: Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding.
This sort of approach (and others i've seen like it) always appeal to my inner engineer, trying to optimize and balance tradeoffs between model architectures and combine two things to yield the best of both worlds
But based on my understanding of the Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html), this is precisely the wrong approach in the long term. I'm linking the actual text of the bitter lesson because I think it's misunderstood (or I just don't agree with how i've seen it used in discourse). Specifically:
The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
This architecture feels specifically like "trying to build knowlege into the agent that will help in the short term" but will plateau long term. That's not to say that there won't be some interesting learnings or things built on top of it, but I doubt that there's a lot of juice to squeeze with this kind of approach IMO.
This feels like the opposite to me? The MoT architecture looks like the ideal that the Bitter Lesson alludes to - just take all of your data in all of your formats (audio, image, text, action, video) and dump it all into a single shared latent space. Then let the model sort things out, with just enough structure to handle the different requirements/output formats needed (e.g. autoregressive stuff for sequence modeling/prediction, diffusion stuff for generation).
Except this model has a broader domain than text-LLM models. More than the old omni models too since it takes video input. The architecture is exotic but I don't see tuning here that is more extreme than open models released every day.
This is mostly a decompression, it’s fairly standard nowadays. The point is to get the data from the internal compressed version into the human usable version.
We can technically reason at pixel or char level encodings but it’s going to be much more expensive generally. Think of the overall technique as a way to get computer go faster.
You see it with Qwen talker, most multimodal projectors, etc
I feel like the car usecase demonstrates that these models are not really useful for the cutting edge: They produce exactly the kind of in-domain data that already exists in droves. What is needed, and what tesla collects, are the edge cases!
(Now for a startup with zero data, this is of course still useful)
As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video
It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.
If I were to hallucinate what it is and why it's worded that way: AI robot space is in need of a hyper-realistic game engine with better physics than Unity/Unreal style non-deformable rigid body mechanics, that's also way faster than 1x completely unlike engineering FEM sims, and this cater to that need
No, the "action" part is the distinction. Their world model is conditioned on robot actions for example, which gives you two things the video gen alone can't: predict the future frames that follow a given action (change the action, get a different future from the same starting frame), and run it in reverse to infer the actions behind observed frames or output the actions needed to hit a goal (the output is motor commands abd not video frames).
Most of the examples they've chosen seem.. not good? What an odd mix of bad game engine and AI slop. I can't imagine that this stuff makes good training data for real-world applications.
These demos honestly look pretty good to me. But it is objectively true that this and similar technologies are used at huge scale by every leading autonomous vehicle manufacturer, so we can inductively reason that it _is_ good enough for that use-case. I don't work on Cosmos, but I am currently working on a superficially similar non-open technology at Nvidia used by many of these leaders which, in my opinion, produces similar quality. Some of the open research for it is here:
SOTA open source model for image and vid generation. Beats all others but is too big to run on most people’s computers at 64b params.
Still impressive nonetheless given its artificially generated training sets.
Beats nano banana 1 but not yet competitive with 2 or seedance2, grok imagine,etc.
Great summary. I find image and video generation models are a more understandable reality check for how close local models are to frontier models.
It's sadly ironic I no longer even bother clicking on HN posts that are obvious product announcements from large corporations and instead just go to the replies. Corporate product announcements somehow fail to even clearly communicate the basic facts you did in your first nine words.
One nuance that's missing from your summary is it's a world model specifically targeted to be useful for training robotic and autonomous vehicle AIs. So not really intended to be a direct competitor to Nano Banana or Seedance. While it can do straight image and video gen, its special sauce is providing more physics data and harnesses for AI training scenarios.
> Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.
Looking forward to trying this out on my $10000+ workstation grade GPU that I need an equally expensive set up to run.
Good news, Nvidia will happily sell you one of their new RTX Spark laptops to run this.
I have the GPU but no robot. What’s the minimum viable robot needed to play with this?
Not at all an expert but I believe it's possible to get started experimenting with just a simulated robot in the simulated world model. While the full workflow is to generate training data to drive a real robot in the real world, without closing the loop, you're just lacking the ground truth data to quantify the divergence between simulation and reality.
There are all kinds of hobbyist robotic armatures at various price points but my understanding from a friend in this space is that the precision, durability and repeatability for serious applications starts at around $30,000 to $50,000. He mentioned the Franka Research 3 (FR3) as one example (https://franka.de/), perhaps driven by something like a Jetson AGX Thor ($5,000 and up).
As always, there are many less expensive and DIY-ish recipes to get started on smaller budgets. My friend's suggestion was more the baseline experimental lab system for a big company wanting get started with something that could, in theory, scale to light industrial internal deployment.
This sort of approach (and others i've seen like it) always appeal to my inner engineer, trying to optimize and balance tradeoffs between model architectures and combine two things to yield the best of both worlds
But based on my understanding of the Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html), this is precisely the wrong approach in the long term. I'm linking the actual text of the bitter lesson because I think it's misunderstood (or I just don't agree with how i've seen it used in discourse). Specifically:
This architecture feels specifically like "trying to build knowlege into the agent that will help in the short term" but will plateau long term. That's not to say that there won't be some interesting learnings or things built on top of it, but I doubt that there's a lot of juice to squeeze with this kind of approach IMO.
This feels like the opposite to me? The MoT architecture looks like the ideal that the Bitter Lesson alludes to - just take all of your data in all of your formats (audio, image, text, action, video) and dump it all into a single shared latent space. Then let the model sort things out, with just enough structure to handle the different requirements/output formats needed (e.g. autoregressive stuff for sequence modeling/prediction, diffusion stuff for generation).
Except this model has a broader domain than text-LLM models. More than the old omni models too since it takes video input. The architecture is exotic but I don't see tuning here that is more extreme than open models released every day.
This is mostly a decompression, it’s fairly standard nowadays. The point is to get the data from the internal compressed version into the human usable version.
We can technically reason at pixel or char level encodings but it’s going to be much more expensive generally. Think of the overall technique as a way to get computer go faster.
You see it with Qwen talker, most multimodal projectors, etc
The warehouse safety video example is really funny, because the people don't react at all.
The car video is silly as well, the crossing van clearly runs a red light. The big shadow of the light pole in the intersection also makes no sense...
Cars run red lights in real life. Driving defensively requires anticipating it. Anyone expecting them not to is more likely to get in a crash.
The rest I can't speak to.
I feel like the car usecase demonstrates that these models are not really useful for the cutting edge: They produce exactly the kind of in-domain data that already exists in droves. What is needed, and what tesla collects, are the edge cases!
(Now for a startup with zero data, this is of course still useful)
The two-tower Mixture-of-Transformers design (autoregressive reasoner feeding a diffusion generator) is an interesting architectural bet.
I'm struggling to understand what this does.
> Generates future observations and action sequences.
Is that just a complicated way of saying video gen?
As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video
Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.
That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?
You can fine-tune it so, given an image and a task description, it generates a corresponding set of actions.
It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.
If I were to hallucinate what it is and why it's worded that way: AI robot space is in need of a hyper-realistic game engine with better physics than Unity/Unreal style non-deformable rigid body mechanics, that's also way faster than 1x completely unlike engineering FEM sims, and this cater to that need
No, the "action" part is the distinction. Their world model is conditioned on robot actions for example, which gives you two things the video gen alone can't: predict the future frames that follow a given action (change the action, get a different future from the same starting frame), and run it in reverse to infer the actions behind observed frames or output the actions needed to hit a goal (the output is motor commands abd not video frames).
Most of the examples they've chosen seem.. not good? What an odd mix of bad game engine and AI slop. I can't imagine that this stuff makes good training data for real-world applications.
These demos honestly look pretty good to me. But it is objectively true that this and similar technologies are used at huge scale by every leading autonomous vehicle manufacturer, so we can inductively reason that it _is_ good enough for that use-case. I don't work on Cosmos, but I am currently working on a superficially similar non-open technology at Nvidia used by many of these leaders which, in my opinion, produces similar quality. Some of the open research for it is here:
https://github.com/nv-tlabs/3dgrut/
https://github.com/NVIDIA/harmonizer
https://github.com/NVIDIA/instant-nurec
https://github.com/nvidia/ncore
Nvidia also is integrating Gsplat into at least what I work on and contributing upstream.
https://github.com/nerfstudio-project/gsplat
It is funny that after all their tech advancements, the site is struggling under heavy load.