In case you are unfamiliar with Karpathy's Loop[1], it is a genetic algorithm[2] where the genetic "mutations" are clever-but-random ideas generated by an LLM agent, aimed at improving a system.
(1) Let the LLM randomly perturbate the system.
(2) Measure the system's performance.
(3a) If the perturbation improved performance, keep the change.
(3b) Otherwise, don't.
(4) Repeat
Wtf, this has a name now? I thought of this exact idea literally months ago but never had the time to do any experiments on it.
At the time I dismissed it as potentially being incredibly expensive for the improvement you do get, and runs into typical pitfalls of evolutionary algorithms (in the same way evolution doesn't let an organism grow a wheel, your LLM evolution algorithm will never come up with something that requires a far bigger leap than what you allow the LLM to perturb on a single step. Also the genetic algorithm will probably result in a vibecoded mess of short-sighted decisions just like evolution creates a spaghetti genome in real life.)
I'll definitely need to look into how people have improved the idea and whether it is practical now.
For sure! The hypothesis generation gotta be improved. Your take on the "least likely" is interesting. In the beginning of the repo I was having problems with "hypothesis convergence", your idea may be a nice way to introduce the much needed variability
Is it? Evolution also seems to be a result of semi-random crap over the span of millenia and nobody is critiquing it like that.
Why should throwing ideas at the wall in regards to optimizing code be any different: as long as you can measure and verify it, are okay with added complexity, and are capable of making the code itself not be crap by the end of it?
If an approach is found that improves how well something works, you can even treat the AI slop as a draft and iterate upon it yourself further.
It's basically saying to randomly slop something and see if it gets better. Evolution has physical principles and guard rails backing it. Here there are no principals whatsoever, just slopping the slopper to see if it's somehow less sloppy then writing a gist with a slop machine.
I wouldn't call it karpathys loop I'd call it slop descent. Or descent into slop. Or something like that
Evolution very much involves random mutations that turn out useless or harmful and thus don't spread.
This is in fact less random than how generic algorithms used to work traditionally which encoded behaviors in some data structure that then got randomly mutated or crossed with other candidates in the pool.
I am aware of what biological evolution is. This isn't analogous. I love my software friends, I'm a software person now too, but the level at which people take algorithms that involve any level of biomimicry as a model for actual biology is frustrating.
Is slop verifiable? If so we can throw it in the loop... The point is that this loop can be pointed at any verifiable work. Yeah you are seeing it raw, the verifier is the principle you talked about. Yes it was fully AI generated, It will be refined
thanks, I thought as a researcher Kaparthy would include and cite relevant papers. I quickly became disappointed. I already knew openevolve and the ACE Framework paper. This is the first time I learned about Genetic Algorithm and I now have some clear roadmap for studying.
I'd go a step further and say that sort of loop is probably the first thing most people who play around with agent harnesses try, pretty much the first "Hmm, what should I do now?" thing that pops into people's head.
It's less the idea and more the simplicity of it. It's a distillation of something that works and lets newer practitioners get their feet wet before moving on to more complex implementation.
I believe Karpathy himself called it autoresearch, not Karpathy Loop, but in a vacuum of names around AI it seems to be very easy to meme-drop a name and then come influencer efforts to cool-name and normalize it. See vibecoding*
I agree it's not a genetic algorithm, but, it's also not stochastic gradient descent. There is no gradient. The "step direction" (code modification) is chosen by an LLM, which is "smart enough" to guess something that might be an improvement.
You're missing a step. The perturbations are not fully random. The LLM also looks at the result and tries to do credit assignment to determine what changes to try in the next round.
Extremely interesting but I don't understand why it was written by an LLM. Either the frontier models are far better than I realized or else writing this document required a lot of manual work regardless at which point why not keep it in your own voice?
> The agent did not know that would also halve the LUT count. It found out by doing it and watching the synthesizer.
So I guess this is an example of an LLM anthropomorphizing and making wild conjectures about the internal workings of a different LLM.
Yeah I find this current LLM voice very tiring to read; I get enough of it day-to-day wrangling claude and others. I don’t think ‘writing’ this took very much work though, it was probably a “read the research logs, and write a blog post with charts showing our amazing results and hammering on the idea that verifiers matter” as a prompt. The rest you could go have a coffee for.
That said, the core idea of this — verification matters a lot — is well received, and in fact, this is totally awesome in terms of results. They mention at the end they’re not sure how much of this is microtuned against the benchmark, a sin that many CPU companies cheerfully commit and have committed over the last 40 years btw, so I’d be interested in a followup with more general benchmarking. Either way, amazing.
Yeah, you are totally right. Its a work in progress, and the post was written by an LLM - Im trying to improve on it (dash pun intended).
Regarding the benchmark overfitting, absolutely, it's pretty much overfitted. This CPU will only be as good as it benchmark. If I have the time I will try to get some applications and optimize for those.
I love genetic algorithms and find using LLMs as part of them super compelling. I always find the fitness functions to be the most difficult part. The algorithm naturally tries to exploit any little gap you leave it in cheating. Best part is not needing back propagation in solving a problem. However that is also the worst part in all the solutions just being one level above a random walk. The LLM augmentation really helps to give it a gradient and intelligence beyond random chaos. Love the idea of it being applied to hardware.
Is this a "genetic algorithm" though? Besides "select the best performing run", it doesn't seem to have anything to do with crossovers, mutations, etc at all, just "select best", which makes it seem less of a genetic algorithm to me I guess. Might just be me being confused about what counts as an "genetic algorithm" vs not though, I won't claim to be an expert in the field exactly.
I'm interesting in replicating the experiment, but I can't seem to find the actual hardware specifications. I assume this is something I could pick up from Digikey, Mouser, or Amazon.
I have variations on Karpathy's loop that I'd like to try out with real world hardware.
A lot of it was beyond me, but this was all the branch names for all the stuff it tried, most of it unsuccessful of course. About 10x perf improvement came from architectural changes, and then 2x from micro optimizations.
I'd say "a verifier" here is a loose term. A great testsuite is a verifier. I've done reverse-engineering projects that involved generating trace logs from the object under test, having a reimplementation emit the same logs, and running strict comparisons.
OP's post is basically pointing out what certainly many others have independently discovered: Your agent-based dev operation is as good as the test rituals and guard rails you give the agents.
Can you explain your question a little more? The recursive agents will find the minimum to satisfy the deterministic termination condition including cheating. In other words, it will be literally correct yet wrong. I would go so far to say malicious compliance.
I have recursive agent that finds trading strategies after recreating academic research and probing the model using its training on everything. It works really well but I have to force it to write out every line and write a proof that data in the future from the time of the wall clock didn't enter the system. Even then some stupid thing like not converting the timezone with daylight savings will allow it to peek into the future 1 hour. These types of bugs are almost impossible to find. Now there needs to be another agent whose only purpose to write out every line explaining that the timezone for that line of code was correct.
Its tangential, but: I’m currently doing a rewrite of the backend of a project, and the verifier is basically the instruction of “maintain v1 functionality if observed from the api side externally”. This allows making a lot of tests based on existing data in the system and how the frontend expects data.
I used it (well, a skill based on the same idea) to optimise a prompt that does data extraction from UGC.
However there isn't really a "correct" answer that's easy to define in code (I could manually label a training set, but wanted to avoid that) so I had the LLM just analyse the results itself and decide if they are better or not. It wrote deterministic rules for a few things, but overall it just reviewed the results of each round and decided if the are better or not.
Reviewing the before and after results, I would say yes, it's a big improvement in quality. It also optimised the prompt size to reduce input tokens by 25% and switched to a smaller/cheaper model.
> "If you can write the rules down, an agent will satisfy them faster than your team will."
a fantastic opportunity to become the next next big thing and write a verifier verifier.
at the hypothesized inflexion point where AI instantly performs exactly as commanded, what happens to heavily regulated industries like medical? do we get huge leaps and bounds everywhere EXCEPT where it matters, or is regulation going to be handed over to a verifier verifier?
The devil is in the details. There are an amazing number of details in a good [thing]. Someone somewhere has to say exactly what this [thing] being built actually is.
Read almost any story about wishes from a genie. Simple statements don't work.
I'm medical-adjacent. We're still bottlenecked on human review and ops. Not because of regulation, we just don't trust the inputs or the outputs enough yet. But also, a lot of the things we'd want an agent to be have access to we don't trust humans with, either, often for stupid historical reasons.
Um, yes? The big value that AMD had in the x86 market over competitors was their verification model. This has been known for decades.
> 3-seed nextpnr P&R on a Gowin GW2A-LV18 (Tang Nano 20K) — median Fmax × CoreMark iter/cycle = fitness
Every single "improvement" is basically about routing around how absolutely abysmally bad the Gowin FPGAs are. Kudos to that, I guess?
Gowin FPGAs have extraordinarily bad carry chain and block to block routing systems. They are literally so bad that a 32-bit ripple carry is almost as fast as the carry skip version even if you manually route it. Jump prediction is almost all about avoiding arithmetic computation at all (which most other FPGAs would have no problem with).
Memory accesses are super slow and locked to clock edges rather than level sensitive (why ID/RF and WB take entire cycles and nothing optimization could do could change it). The additions are all routing around that (Note the immutability of the ID and WB phases).
To top it off, the 5-stage pipeline is an annoying quirk of the RISC-V architecture having an immediate value offset on its load instruction. If the RISC-V load mandated 0 as the offset, the MEM read phase could overlap the RX phase since no ALU would be necessary (Store doesn't care because the result goes to memory rather than back to the register file so RF writeback isn't an issue). The absolutely horrific add performance of the Gowin FPGAs makes this acute.
Finally, try to put this on a board. I found that anything above about 175MHz out of Nextpnr failed to execute on actual hardware (please correct me if this isn't valid. It's been over a year or more since I tried Nextpnr on the SiPeed Tang Primer 20K). That's simply right around where a 32-bit add plus some routing sits on these FPGAs. There's something a bit off in the timing analysis code for Nextpnr and the AI is almost certainly optimizing into it.
That having been said: I would LOVE somebody to bounce AI off of reversing the architecture and bitstreams for the stupid-ass closed-source FPGAs. Now THAT would be a project worth throwing a couple of grad students and a bunch of subsidized AI tokens at.
As a non-hardware guy, I read, “well, duh, for a 20yr practitioner dealing with the intricacies of specific FPGA series, all this makes tons of sense”.
It only makes sense to me because I tried to implement a RISC-V on these Gowin FPGAs and banged into the limitations and can distill them down. A junior engineer looks at this post-AI, shrugs, and says "I'm done."
The AI doesn't flag "Hey, my adder sucks. Move to a better FPGA architecture." A junior engineer pre-AI would have to bang on this a while, get frustrated at the critical paths, and eventually ask for help. At which point we would both look at this, identify that the adder was doing a 32-bit ripple carry, both have a "WTF?!" moment, and switch FPGA families.
In addition, the AI also doesn't flag how close to the margin you are. To my eye, almost all the Fmax gains look like PnR (place and route) noise. The DIV/REM obviously isn't and the replay predictor looks real. To top it off, the branch predictor wins look anomalously low to my eye.
This is what a bunch of us are yelling about with AI. AI gets you a thing. AI gets you no insight into that thing. And because the juniors will use the AI, they will never learn the insight.
Side note: The granularity of the CM/MHz numbers look a bit suspicious. Why are there identical entries?
The frontier is the verifier not in the sense of this project, but to every project. If we have a good verifier for a task, any task, this type of loop can be applied to it. Today LLMs are good enough to tackle FPGA projects, but what this type of loop will be applicable to many more things
Board should be arriving next week. I will let you know!
Please apply this loop to nextpnr for any of the commodity Xilinx, Altera, or Lattice parts. For example, everything about Lattice has been stuck for almost a decade at this point.
Assuming that your claims about GoWin FPGA flaws are correct, isn’t the point of this experiment that it was able to exploit these flaws without manual guidance?
His claims are indeed correct;
Yes, you got my point tks!;
AND the loop produced architecture gains that are not exclusive to the GoWin FPGA (CoreMark/Mhz is higher than VexRiscV)
In case you are unfamiliar with Karpathy's Loop[1], it is a genetic algorithm[2] where the genetic "mutations" are clever-but-random ideas generated by an LLM agent, aimed at improving a system.
[1] https://github.com/karpathy/autoresearch
[2] https://en.wikipedia.org/wiki/Genetic_algorithm
Wtf, this has a name now? I thought of this exact idea literally months ago but never had the time to do any experiments on it.
At the time I dismissed it as potentially being incredibly expensive for the improvement you do get, and runs into typical pitfalls of evolutionary algorithms (in the same way evolution doesn't let an organism grow a wheel, your LLM evolution algorithm will never come up with something that requires a far bigger leap than what you allow the LLM to perturb on a single step. Also the genetic algorithm will probably result in a vibecoded mess of short-sighted decisions just like evolution creates a spaghetti genome in real life.)
I'll definitely need to look into how people have improved the idea and whether it is practical now.
This is not a new idea at all, many many have had it, no one really can claim it
Stigler's law of eponymy https://en.wikipedia.org/wiki/Stigler%27s_law_of_eponymy
Wikipedia has humor:
> The same observation had previously also been made by many others.
I genuinely laughed reading the first words. Yeah, its hard to be novel
Don’t worry, Twitter bros already coined it.
You know this doesn’t work most of the time…
Genetic algorithms have existed since the 60s / 70s, e.g. computers learning to play a game. LLMs aren’t particularly guide at it.
I think hyperparameter tuning may actually be a kind of genetic algorithm.
Hyperparameter tuning could be done by genetic algorithm. I think it’s a bit of a category error to say that it is a genetic algorithm though.
Hyperparam tuning is usually done by Bayesian Optimization though.
Yeah that’s correct, it could use it, but there are better alternatives for this particular problem.
i actually do it differently
> (1) Let the LLM randomly perturbate the system.
instead of this i ask LLM to what's least likely to improve performance and then measure it.
sometimes big gains come from places you thought are least likely.
For sure! The hypothesis generation gotta be improved. Your take on the "least likely" is interesting. In the beginning of the repo I was having problems with "hypothesis convergence", your idea may be a nice way to introduce the much needed variability
I was working some time ago on LLM assisted optimizations and algorithm discovery and this does not look like a novel idea.
AlphaEvolve from google is evolutionary algorithm which uses LLMs for Idea generation following very similar loop:
- https://deepmind.google/blog/alphaevolve-a-gemini-powered-co...
- Open source implementation of the algorithm: https://github.com/algorithmicsuperintelligence/openevolve
I mean, this is such low hanging fruit, you have to be careful not to step on it.
Just because it is a nice meme I want to throw in Schmidhuber's work on (do not treat this comment as serious except you are Schmidhuber himself):
* Gödel Machine (2006-2007) [1]
* Optimal Ordered Problem Solver (2002) [2]
* Meta-Learning and Artificial Curiosity (1990s onward) [3]
[1] https://arxiv.org/html/2505.22954v3
[2] https://arxiv.org/abs/cs/0207097
[3] https://evolution.ml/pdf/schmidhuber.pdf
Edit: markdown formatting
Nice references! tks
It is not novel - but with the new models it is just becoming practical.
This is like idiocracy for Software Devs at this point
Is it? Evolution also seems to be a result of semi-random crap over the span of millenia and nobody is critiquing it like that.
Why should throwing ideas at the wall in regards to optimizing code be any different: as long as you can measure and verify it, are okay with added complexity, and are capable of making the code itself not be crap by the end of it?
If an approach is found that improves how well something works, you can even treat the AI slop as a draft and iterate upon it yourself further.
It's basically saying to randomly slop something and see if it gets better. Evolution has physical principles and guard rails backing it. Here there are no principals whatsoever, just slopping the slopper to see if it's somehow less sloppy then writing a gist with a slop machine.
I wouldn't call it karpathys loop I'd call it slop descent. Or descent into slop. Or something like that
Evolution very much involves random mutations that turn out useless or harmful and thus don't spread.
This is in fact less random than how generic algorithms used to work traditionally which encoded behaviors in some data structure that then got randomly mutated or crossed with other candidates in the pool.
I am aware of what biological evolution is. This isn't analogous. I love my software friends, I'm a software person now too, but the level at which people take algorithms that involve any level of biomimicry as a model for actual biology is frustrating.
Is slop verifiable? If so we can throw it in the loop... The point is that this loop can be pointed at any verifiable work. Yeah you are seeing it raw, the verifier is the principle you talked about. Yes it was fully AI generated, It will be refined
It does burn holes in ones brain doesn't it... At least with the silly sorting algorithms we know they are supposed to be silly...
thanks, I thought as a researcher Kaparthy would include and cite relevant papers. I quickly became disappointed. I already knew openevolve and the ACE Framework paper. This is the first time I learned about Genetic Algorithm and I now have some clear roadmap for studying.
A genetic algorithm keeps a population, and there is a "crossing" operation.
I don't see both ingredients in Karpathy's proposed scheme.
Lol, I respect karpathy a lot, but this is such an obvious in your face idea that it is laughable to put someone’s name on it.
What’s next “karpathy investing” where ai in a loop builds a portfolio?
I'd go a step further and say that sort of loop is probably the first thing most people who play around with agent harnesses try, pretty much the first "Hmm, what should I do now?" thing that pops into people's head.
It's less the idea and more the simplicity of it. It's a distillation of something that works and lets newer practitioners get their feet wet before moving on to more complex implementation.
actually having a harness for it is nice though, vs having yourself prompt it interatively.
Call it a K-loop please. Where ai in a k-loop builds a portfolio
I believe Karpathy himself called it autoresearch, not Karpathy Loop, but in a vacuum of names around AI it seems to be very easy to meme-drop a name and then come influencer efforts to cool-name and normalize it. See vibecoding*
That's not a genetic algorithm, that's stochastic gradient descent.
To be a genetic algorithm it would need to have mutation (which you have here) and crossover (which you don't).
I agree it's not a genetic algorithm, but, it's also not stochastic gradient descent. There is no gradient. The "step direction" (code modification) is chosen by an LLM, which is "smart enough" to guess something that might be an improvement.
You're missing a step. The perturbations are not fully random. The LLM also looks at the result and tries to do credit assignment to determine what changes to try in the next round.
It is rather a variation of hill climbing [1]. As others pointed out evolutionary algorithms employ a richer set of search operators.
[1] https://en.wikipedia.org/wiki/Hill_climbing
Extremely interesting but I don't understand why it was written by an LLM. Either the frontier models are far better than I realized or else writing this document required a lot of manual work regardless at which point why not keep it in your own voice?
> The agent did not know that would also halve the LUT count. It found out by doing it and watching the synthesizer.
So I guess this is an example of an LLM anthropomorphizing and making wild conjectures about the internal workings of a different LLM.
Yeah I find this current LLM voice very tiring to read; I get enough of it day-to-day wrangling claude and others. I don’t think ‘writing’ this took very much work though, it was probably a “read the research logs, and write a blog post with charts showing our amazing results and hammering on the idea that verifiers matter” as a prompt. The rest you could go have a coffee for.
That said, the core idea of this — verification matters a lot — is well received, and in fact, this is totally awesome in terms of results. They mention at the end they’re not sure how much of this is microtuned against the benchmark, a sin that many CPU companies cheerfully commit and have committed over the last 40 years btw, so I’d be interested in a followup with more general benchmarking. Either way, amazing.
Yeah, you are totally right. Its a work in progress, and the post was written by an LLM - Im trying to improve on it (dash pun intended).
Regarding the benchmark overfitting, absolutely, it's pretty much overfitted. This CPU will only be as good as it benchmark. If I have the time I will try to get some applications and optimize for those.
I just wish everyone read Summa Technologiae from Stanislaw Lem. This was obviously covered back in 1964, with the relevant implications.
https://publicityreform.github.io/findbyimage/readings/lem.p...
Whats your take on it? "If I read it, what point should I pay attention to?", I guess is what I'm trying to say
Salient on the value of the verifier. Matches my experience in the last two quarters.
Nice detail on the encountered failures. Very similar experiences with my own loops against testsuites.
Great post. A snapshot in time.
Tks! Did you apply it to hardware design or to another field?
I love genetic algorithms and find using LLMs as part of them super compelling. I always find the fitness functions to be the most difficult part. The algorithm naturally tries to exploit any little gap you leave it in cheating. Best part is not needing back propagation in solving a problem. However that is also the worst part in all the solutions just being one level above a random walk. The LLM augmentation really helps to give it a gradient and intelligence beyond random chaos. Love the idea of it being applied to hardware.
> I love genetic algorithms
Is this a "genetic algorithm" though? Besides "select the best performing run", it doesn't seem to have anything to do with crossovers, mutations, etc at all, just "select best", which makes it seem less of a genetic algorithm to me I guess. Might just be me being confused about what counts as an "genetic algorithm" vs not though, I won't claim to be an expert in the field exactly.
i think there's a lack of verbiage for "optimization that doesnt involve gradient descent"
i think there might be a conflation of "problems a genetic algorithm could be used for" vs "using the genetic algorithm to solve the problem"
I'm interesting in replicating the experiment, but I can't seem to find the actual hardware specifications. I assume this is something I could pick up from Digikey, Mouser, or Amazon.
I have variations on Karpathy's loop that I'd like to try out with real world hardware.
> propose, implement, measure, keep the wins
Pretty much what I did to let Codex with gpt5.4xhigh improve my fairly complex CUDA kernel which resulted in 20x throughput improvement.
Concretely, what interesting changes did it make to achieve such a significant improvement?
A lot of it was beyond me, but this was all the branch names for all the stuff it tried, most of it unsuccessful of course. About 10x perf improvement came from architectural changes, and then 2x from micro optimizations.
https://pastebin.com/eac0SAYg
Seems like this could be applied to many things. Database optimisation etc
Absolutely, today it's FPGAs.. Tomorrow can be whole companies
Has anyone actually written a verifier for a business / project?
I'd say "a verifier" here is a loose term. A great testsuite is a verifier. I've done reverse-engineering projects that involved generating trace logs from the object under test, having a reimplementation emit the same logs, and running strict comparisons.
OP's post is basically pointing out what certainly many others have independently discovered: Your agent-based dev operation is as good as the test rituals and guard rails you give the agents.
Can you explain your question a little more? The recursive agents will find the minimum to satisfy the deterministic termination condition including cheating. In other words, it will be literally correct yet wrong. I would go so far to say malicious compliance.
I have recursive agent that finds trading strategies after recreating academic research and probing the model using its training on everything. It works really well but I have to force it to write out every line and write a proof that data in the future from the time of the wall clock didn't enter the system. Even then some stupid thing like not converting the timezone with daylight savings will allow it to peek into the future 1 hour. These types of bugs are almost impossible to find. Now there needs to be another agent whose only purpose to write out every line explaining that the timezone for that line of code was correct.
Its tangential, but: I’m currently doing a rewrite of the backend of a project, and the verifier is basically the instruction of “maintain v1 functionality if observed from the api side externally”. This allows making a lot of tests based on existing data in the system and how the frontend expects data.
I used it (well, a skill based on the same idea) to optimise a prompt that does data extraction from UGC.
However there isn't really a "correct" answer that's easy to define in code (I could manually label a training set, but wanted to avoid that) so I had the LLM just analyse the results itself and decide if they are better or not. It wrote deterministic rules for a few things, but overall it just reviewed the results of each round and decided if the are better or not.
Reviewing the before and after results, I would say yes, it's a big improvement in quality. It also optimised the prompt size to reduce input tokens by 25% and switched to a smaller/cheaper model.
It's the OODA loop for LLMs writing code.
this is great! Exactly what i was looking for!
Awesome! Let me know if it works for your propose. If not, raise a issue on github and lets work together.
Is this related to autoresearch? https://github.com/karpathy/autoresearch
"Not at all, completely novel idea. What made you thought of such thing?" hahaha
> "If you can write the rules down, an agent will satisfy them faster than your team will."
a fantastic opportunity to become the next next big thing and write a verifier verifier.
at the hypothesized inflexion point where AI instantly performs exactly as commanded, what happens to heavily regulated industries like medical? do we get huge leaps and bounds everywhere EXCEPT where it matters, or is regulation going to be handed over to a verifier verifier?
> performs exactly as commanded
The devil is in the details. There are an amazing number of details in a good [thing]. Someone somewhere has to say exactly what this [thing] being built actually is.
Read almost any story about wishes from a genie. Simple statements don't work.
I'm medical-adjacent. We're still bottlenecked on human review and ops. Not because of regulation, we just don't trust the inputs or the outputs enough yet. But also, a lot of the things we'd want an agent to be have access to we don't trust humans with, either, often for stupid historical reasons.
> f you can write the rules down, an agent will satisfy them faster than your team will.
Big difference between a working model that needs to be optimized, vs nothing working at all.
> The frontier is the verifier.
Um, yes? The big value that AMD had in the x86 market over competitors was their verification model. This has been known for decades.
> 3-seed nextpnr P&R on a Gowin GW2A-LV18 (Tang Nano 20K) — median Fmax × CoreMark iter/cycle = fitness
Every single "improvement" is basically about routing around how absolutely abysmally bad the Gowin FPGAs are. Kudos to that, I guess?
Gowin FPGAs have extraordinarily bad carry chain and block to block routing systems. They are literally so bad that a 32-bit ripple carry is almost as fast as the carry skip version even if you manually route it. Jump prediction is almost all about avoiding arithmetic computation at all (which most other FPGAs would have no problem with).
Memory accesses are super slow and locked to clock edges rather than level sensitive (why ID/RF and WB take entire cycles and nothing optimization could do could change it). The additions are all routing around that (Note the immutability of the ID and WB phases).
To top it off, the 5-stage pipeline is an annoying quirk of the RISC-V architecture having an immediate value offset on its load instruction. If the RISC-V load mandated 0 as the offset, the MEM read phase could overlap the RX phase since no ALU would be necessary (Store doesn't care because the result goes to memory rather than back to the register file so RF writeback isn't an issue). The absolutely horrific add performance of the Gowin FPGAs makes this acute.
Finally, try to put this on a board. I found that anything above about 175MHz out of Nextpnr failed to execute on actual hardware (please correct me if this isn't valid. It's been over a year or more since I tried Nextpnr on the SiPeed Tang Primer 20K). That's simply right around where a 32-bit add plus some routing sits on these FPGAs. There's something a bit off in the timing analysis code for Nextpnr and the AI is almost certainly optimizing into it.
That having been said: I would LOVE somebody to bounce AI off of reversing the architecture and bitstreams for the stupid-ass closed-source FPGAs. Now THAT would be a project worth throwing a couple of grad students and a bunch of subsidized AI tokens at.
Amazing comment.
As a non-hardware guy, I read, “well, duh, for a 20yr practitioner dealing with the intricacies of specific FPGA series, all this makes tons of sense”.
It only makes sense to me because I tried to implement a RISC-V on these Gowin FPGAs and banged into the limitations and can distill them down. A junior engineer looks at this post-AI, shrugs, and says "I'm done."
The AI doesn't flag "Hey, my adder sucks. Move to a better FPGA architecture." A junior engineer pre-AI would have to bang on this a while, get frustrated at the critical paths, and eventually ask for help. At which point we would both look at this, identify that the adder was doing a 32-bit ripple carry, both have a "WTF?!" moment, and switch FPGA families.
In addition, the AI also doesn't flag how close to the margin you are. To my eye, almost all the Fmax gains look like PnR (place and route) noise. The DIV/REM obviously isn't and the replay predictor looks real. To top it off, the branch predictor wins look anomalously low to my eye.
This is what a bunch of us are yelling about with AI. AI gets you a thing. AI gets you no insight into that thing. And because the juniors will use the AI, they will never learn the insight.
Side note: The granularity of the CM/MHz numbers look a bit suspicious. Why are there identical entries?
The frontier is the verifier not in the sense of this project, but to every project. If we have a good verifier for a task, any task, this type of loop can be applied to it. Today LLMs are good enough to tackle FPGA projects, but what this type of loop will be applicable to many more things
Board should be arriving next week. I will let you know!
"I would LOVE somebody to bounce AI off of reversing the architecture and bitstreams for the stupid-ass closed-source FPGAs."
The only reason I'm using Gowin is because it has a slightly more mature opensource tooling. Maybe we can apply this loop to nextpnr also
Please apply this loop to nextpnr for any of the commodity Xilinx, Altera, or Lattice parts. For example, everything about Lattice has been stuck for almost a decade at this point.
Assuming that your claims about GoWin FPGA flaws are correct, isn’t the point of this experiment that it was able to exploit these flaws without manual guidance?
His claims are indeed correct; Yes, you got my point tks!; AND the loop produced architecture gains that are not exclusive to the GoWin FPGA (CoreMark/Mhz is higher than VexRiscV)