points by ekjhgkejhgk 9 hours ago

Methodological flaw.

On Centaur (hybrid LLM + classic HPO) the LLM is only called to give its opinion a fraction r=0.3 of the time (the remaining is plain HPO). But that means that:

A) the compute used by Centaur is not directly comparable to the compute of the other methods. Centaur had the advantage the r was itself hyperparam-optimized with a cost that is not budgeted on the main graph. Centaur cheated by getting free compute under the table.

B) it's not even clear that the advantage of choosing r=0.3 is real and not noise. If you look at Figure 11, it's not clear that the stuff in between 0.1 and 0.5 isn't noise. It could well be noise. And if you believe the variation is noise and fit a line or a parabola to smooth out the noise, you'd conclude that the optimal is don't use an LLM, so it's not clear that the LLM contribution is even positive.

C) another reason why the LLM contribution doesn't look positive: again on Figure 11, how do you explain that r=0.8 is horrible? If the LLM is principled in some way, if it can reason through "I see such and such therefore I try such and such" then asking it more would mean that it can experiment more and exclude bad regions faster. And if there's no input for it to give, it could just accept "I'll use the optimizer's suggestion this time" over and over. Hybrid should always be strictly better than just classic, but in reality this is more false the larger the r.

Overall, I don't think the conclusion follows from the paper. However, as humans the idea that "reasoning + classic HPO should be classic HPO" is very appealing. I also like the idea of exposing the opimizer internals to the LLM.