It's interesting that in the examples (Table 3 on page 21), the model uses the backspace token to erase the randomly-added token from the prompt, but it does not seem to ever use the token to correct its own output. I'm curious how frequently the model actually uses this backspace token in practice - and if the answer is "vanishingly rarely", what is the source of the improved Mauve score and sample diversity they show? Is it just that the different training procedure gives an improvement?
> I'm curious how frequently the model actually uses this backspace token in practice
For it to use the backspace, wouldn't it have to predict the wrong token with greater confidence than the corrected token? I would think this would require more examples of a wrong token + correction than the correct token, which seems a bit odd.
I wonder if this could be have value for when a forward and reverse word predictors collide/fight. I naively assume reverse word predictor models will have the ability to work towards goals, by working from the "solution" back, with the forward word predictor acting as some sort of "causal resolution".
Saw this, it looks very cool. What I was wondering was how different the rest of this model is from other llms, in that it appears to have some kind of RL layer that is performing the backspacing - does that sit seamlessly onto an otherwise standard llm or are there bigger architectural differences?
Love the idea - wonder if it can be worked directly in with a system prompt like 'write <bkspc> to nullify the closest preceding non-nullified word'. Then just parseing in the output. eg. "hi how are <bkspc> tall are <bkspc> <bkspc> do I talk" -- would resolve to "hi how do I talk"... Or maybe better with <bkspc[index#backFromThisPosition]>
Can't think of a good query to test it on that might need revision half way through...If anyone's got some ideas?
I do find with some coding problems LLMs can start a solution then as it describes its solution it needs to contradict what it has said earlier from it providing/working out itself more context to the problem.
On the topic of LLMs and developing more adaptive abilities, I was thinking:
LLM latency is a real problem for interactivity. Especially if you want to talk to it or have it talk to another AI (eg. A video game environment).
When I talk to someone, they don’t wait for my whole sentence to be spoken before parsing and processing and formulating a response. Sometimes they’ll even interrupt me. Can LLMs eventually do this? Can it begin to prepare a smaller collection of “thoughts” as it is fed a streaming input? Or is this at odds with what is actually happening vs. the human brain analogy?
The loops is also quite large right now. Put a lot in, get a lot out, put a little in, get a lot out. I suppose there is some work from OpenAI to have a well formed response in a certain number of tokens or something, I am not sure how the "document size" part of their chat completion works.
What sort of latency are people seeing from LLMs? I type faster than an LLM outputs tokens, so predicting my continuation would be slower than waiting for me to type it. Most people speak faster.
An LLM that knows when it should interrupt me, as the parent comment mentions, would be really cool but I don't think it has the ability to determine when an interruption would be helpful. I'd be annoyed if I was asking "what is one plus three plus five" but the LLM interrupts with "1+3=4", for example.
“This reminds me of… that movie where uhhh… the big guy goes to death row and he like… he can help people with his special powers, like that mouse he names or.. uhhh…”
You’d probably interrupt me and it would be appropriate and welcomed.
Sure, but would it interrupt with auto-complete nonsense too soon or would it know when an interruption would help?
> This reminds me of
>> A cool spring day?
> that movie where uhhh
>> Star Wars is a movie.
I'm human, so I can understand when I have enough info for a good guess (The Green Mile?) but that's a much different skill than what an LLM does, right?
In principle, understanding when it is appropriate to interrupt is the same sort of problem as understanding when it is appropriate to use a specific tool or API call. Both are deviations from natural language that the model has to decide to employ based on the context of the input, so yes it should be very possible for LLMs to do it. You're asking if they would do it well or if they would instead get it completely wrong and interrupt you all the time with irrelevant nonsense - I'm going to say that depends entirely on implementation.
The bigger issue, in my opinion, is that current models running on current hardware at reasonable costs simply aren't performant enough to do the faster-than-speech prediction that is required to execute the concept well.
I think we'll get there. Not just from a pure performance perspective but also in terms of model architecture. Or maybe it's rather going to be an architecture of models? I mean, the human brain isn't a single(-threaded) entity, either. Lots of things are happening in parallel, and there are parts of your brain that do complete words & sentences and the like and then there are other mechanisms that evaluate those completions and decide whether or not to say them out loud etc.
Latency is mostly a matter of hardware, model size and small tweaks. LLaMa running locally is very fast. And current LLMs could definitely be finetuned to interrupt conversations like you mention.
This excites me. The biggest issue I have with assistants is the awkward cadence of any interaction with them. I can’t wait to see this actually existing in practice.
Does anyone think the results are a little suspect?
They don't do human evaluations of which people would prefer, and instead say that because the MAUVE metric is better with their training method, that the model is better.
There is a lot of daylight between "They didn't expend significant resources on making their study more informative and comprehensive" and "These results are suspect". Those are not the same thing. MOS are reasonably expensive for this type of research and are by no means an accepted minimum.
Beam search is like a breadth-limited breath-first search. This is more akin to a depth-first search where you give the model the ability to backtrack.
My guess is your point is "is this effectively better" than beam search, I wondered the same thing. The paper doesn't mention it, which at least on the surface seems strange.
It's interesting that in the examples (Table 3 on page 21), the model uses the backspace token to erase the randomly-added token from the prompt, but it does not seem to ever use the token to correct its own output. I'm curious how frequently the model actually uses this backspace token in practice - and if the answer is "vanishingly rarely", what is the source of the improved Mauve score and sample diversity they show? Is it just that the different training procedure gives an improvement?
> I'm curious how frequently the model actually uses this backspace token in practice
For it to use the backspace, wouldn't it have to predict the wrong token with greater confidence than the corrected token? I would think this would require more examples of a wrong token + correction than the correct token, which seems a bit odd.
I wonder if this could be have value for when a forward and reverse word predictors collide/fight. I naively assume reverse word predictor models will have the ability to work towards goals, by working from the "solution" back, with the forward word predictor acting as some sort of "causal resolution".
Saw this, it looks very cool. What I was wondering was how different the rest of this model is from other llms, in that it appears to have some kind of RL layer that is performing the backspacing - does that sit seamlessly onto an otherwise standard llm or are there bigger architectural differences?
Love the idea - wonder if it can be worked directly in with a system prompt like 'write <bkspc> to nullify the closest preceding non-nullified word'. Then just parseing in the output. eg. "hi how are <bkspc> tall are <bkspc> <bkspc> do I talk" -- would resolve to "hi how do I talk"... Or maybe better with <bkspc[index#backFromThisPosition]>
Can't think of a good query to test it on that might need revision half way through...If anyone's got some ideas?
I do find with some coding problems LLMs can start a solution then as it describes its solution it needs to contradict what it has said earlier from it providing/working out itself more context to the problem.
On the topic of LLMs and developing more adaptive abilities, I was thinking:
LLM latency is a real problem for interactivity. Especially if you want to talk to it or have it talk to another AI (eg. A video game environment).
When I talk to someone, they don’t wait for my whole sentence to be spoken before parsing and processing and formulating a response. Sometimes they’ll even interrupt me. Can LLMs eventually do this? Can it begin to prepare a smaller collection of “thoughts” as it is fed a streaming input? Or is this at odds with what is actually happening vs. the human brain analogy?
One could bolt such a system on top of an LLM.
An LLM is "just" a document completer. Given text, it predicts the following text.
So there is nothing stopping you from bolting on a system which works like:
- Given the current streaming input, continually update a pool of N likely continuations (predict what the user will say)
- For each of those N likely continuations, pre-generate responses
- If the actual user continuation matches one of the continuations, use your pre-generated response
This is trading off compute for latency, as you won't use at least N-1 of those pre-generated responses.
The loops is also quite large right now. Put a lot in, get a lot out, put a little in, get a lot out. I suppose there is some work from OpenAI to have a well formed response in a certain number of tokens or something, I am not sure how the "document size" part of their chat completion works.
That's exactly what chess programs do when they think on your time.
What sort of latency are people seeing from LLMs? I type faster than an LLM outputs tokens, so predicting my continuation would be slower than waiting for me to type it. Most people speak faster.
An LLM that knows when it should interrupt me, as the parent comment mentions, would be really cool but I don't think it has the ability to determine when an interruption would be helpful. I'd be annoyed if I was asking "what is one plus three plus five" but the LLM interrupts with "1+3=4", for example.
Like humans, context matters.
“This reminds me of… that movie where uhhh… the big guy goes to death row and he like… he can help people with his special powers, like that mouse he names or.. uhhh…”
You’d probably interrupt me and it would be appropriate and welcomed.
Sure, but would it interrupt with auto-complete nonsense too soon or would it know when an interruption would help?
> This reminds me of
>> A cool spring day?
> that movie where uhhh
>> Star Wars is a movie.
I'm human, so I can understand when I have enough info for a good guess (The Green Mile?) but that's a much different skill than what an LLM does, right?
In principle, understanding when it is appropriate to interrupt is the same sort of problem as understanding when it is appropriate to use a specific tool or API call. Both are deviations from natural language that the model has to decide to employ based on the context of the input, so yes it should be very possible for LLMs to do it. You're asking if they would do it well or if they would instead get it completely wrong and interrupt you all the time with irrelevant nonsense - I'm going to say that depends entirely on implementation.
The bigger issue, in my opinion, is that current models running on current hardware at reasonable costs simply aren't performant enough to do the faster-than-speech prediction that is required to execute the concept well.
I'm getting sub 2 second inference on V100 for Flan-UL2(20B)
I think we'll get there. Not just from a pure performance perspective but also in terms of model architecture. Or maybe it's rather going to be an architecture of models? I mean, the human brain isn't a single(-threaded) entity, either. Lots of things are happening in parallel, and there are parts of your brain that do complete words & sentences and the like and then there are other mechanisms that evaluate those completions and decide whether or not to say them out loud etc.
Latency is mostly a matter of hardware, model size and small tweaks. LLaMa running locally is very fast. And current LLMs could definitely be finetuned to interrupt conversations like you mention.
This excites me. The biggest issue I have with assistants is the awkward cadence of any interaction with them. I can’t wait to see this actually existing in practice.
Here’s a project attempting to do just this!
https://github.com/yacineMTB/talk
This is awesome and exciting. Thanks for sharing. The demo video demonstrates exactly what’s on my mind with the awkward latency.
It seems that this problem has been somewhat tackled already (Apologies for the sensationalized video title): https://youtu.be/SCfCuLy4RJA?t=298
Does anyone think the results are a little suspect?
They don't do human evaluations of which people would prefer, and instead say that because the MAUVE metric is better with their training method, that the model is better.
There is a lot of daylight between "They didn't expend significant resources on making their study more informative and comprehensive" and "These results are suspect". Those are not the same thing. MOS are reasonably expensive for this type of research and are by no means an accepted minimum.
Now add <BEL> for maximum annoyance.
How is this different to beam search?
Beam search is like a breadth-limited breath-first search. This is more akin to a depth-first search where you give the model the ability to backtrack.
My guess is your point is "is this effectively better" than beam search, I wondered the same thing. The paper doesn't mention it, which at least on the surface seems strange.