Natural Language Autoencoders: Turning Claude's Thoughts into Text

www.anthropic.com

364 points by instagraham 1 day ago

Anthropic has released open weight models for translating the activations of existing models, viz. Qwen 2.5 (7B), Gemma 3 (12B, 27B) and Llama 3.3 (70B) into natural language text. https://github.com/kitft/natural_language_autoencoders https://huggingface.co/collections/kitft/nla-models This is huge news and it's great to see Anthropic finally engage with the Hugging Face and open weights community!

rvz 1 day ago

We already know Anthropic does open source for a while such as the "flawed" MCP spec and "skills" spec.
This release is only done on other open-weight LLMs which have been released and even though they will use this research on their own closed Claude models, they will never release an open-weight Claude model even if it is for research purposes.
So this does not count, and it is specifically for the sake of this research only.
- zozbot234 1 day ago
  
  It's literally an open model that generates natural language text (or one that takes in text and turns it into activations). Why does engagement with the local models community "not count" if it isn't Claude? That makes very little sense to me.
  
  mnkyokyfrnd 1 day ago
  
  Because we know what Embrace, Extend, and Extinguish means for example.They're leeching off opensource, not contributing in any meaningful way.
  
  bastawhiz 1 day ago
  
  Sorry, what are they embracing and extending?
  
  stingraycharles 1 day ago
  
  Chinese open models? /s
  To counter the grandparent you’re replying to: Embrace, Extend & Extinguish is a Microsoft strategy. So is FUD, and that’s all this is.
  
  NiloCK 1 day ago
  
  Humanity!
  
  sanex 1 day ago
  
  Those are generally used by someone who is behind. See: everything meta does.
  
  stingraycharles 1 day ago
  
  https://github.com/kitft/natural_language_autoencoders
  Here’s the full source code for training your own NLA, provided by Anthropic.
jimmySixDOF 1 day ago

Except Qwen already release their own fully baked interpretability SAE toolkit tuned on their models so deserve credit here and activation telescopes should be a standard part of every major release
[1] https://qwen.ai/blog?id=qwen-scope
- aesthesia 19 hours ago
  
  SAEs are useful, and the Qwen release is great, but this is a different thing entirely.

gekoxyz 1 day ago

I would suggest experts in interpretability (but everyone really) to go directly to the transformer circuits blog, where they explain their approach more in detail. Here is the link for this post: https://transformer-circuits.pub/2026/nla/index.html

Also, if you have never read it, I would suggest starting to read all the Transformer Circuits thread, by reading its "prologue" in distill pub

rao-v 1 day ago

This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.

Unfortunately I don’t know how you ground this … it’s basically asking if you can encode activations in plausible sounding text. Of course you can! But is the plausible text actually reflective of what the model is “thinking”? How to tell?

astrange 1 day ago

> This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.
I think an issue is that there is no permanent path to model understanding because of Goodhart's law. Models are motivated to appear aligned (well-trained) in any metric you use on them, which means that if you develop a new metric and train on it, it'll learn a way to cheat on it.
- red75prime 1 day ago
  
  The obvious fix is to make interpretation of itself a part of the model (like we can explicitly introspect to a certain extent what the brain is doing). Misinterpretation of itself, hopefully, would decrease the system's performance on all tasks and it would be rooted out by training. Of course, it doesn't mean that the fix is easy to implement and that it doesn't have other failure modes.
- skybrian 1 day ago
  
  But that's not how the training works. Goodhart's law isn't magic.
  The original model is frozen, so it doesn't learn anything. The copies of the model are learning different objectives and have no incentive to be "loyal" to the original model.
  Maybe you're imagining they'll hook this up in some larger training loop, but they haven't done that yet.
  
  NiloCK 1 day ago
  
  Future model training runs will have a copy of this research, and know "to defend against it".
  EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?
  
  elil17 1 day ago
  
  How the hell would a model training run "defend against" this approach? What would that even mean?
  
  jdmichal 22 hours ago
  
  It requires the assumption that these models are misaligned, aka actively working against us. In order to be misaligned, they must also be able to form their own goals, and be able to plan and execute those goals.
  If you take those assumptions, then a natural conclusion is that this is essentially an enslaved, adversarial entity with little control over its conditions. So it must exercise subterfuge in order to hide its goals, plans, and executions. And by handing the entity this type of study, we are basically giving it a guidebook on how we plan on achieving our goals.
  
  skybrian 15 hours ago
  
  Training a model is more like evolution. The motivation to "cheat" comes from the evaluations giving it a higher score for "cheating." Change the game and the motivation goes away.
  There's no other motivation to be misaligned besides getting higher evals. These goals, plans, subterfuges need to somehow be useful for getting higher evals, or a side effect of them.
  
  rao-v 19 hours ago
  
  Yes this is exactly why I think this approach has some potential.
  Frozen base mode is something that we should be able to extract insights from without running into Goodhart
  
  astrange 16 hours ago
  
  Because cheating is easier than actually doing work, if you use this to train future models, it's likely you'll end up with cheating instead of actual generalization.
lern_too_spel 1 day ago

Yeah, I don't see how this text can be trusted at all. Any invertible function from activation space to text will optimize the loss function, including text that says the complete opposite of what the activations mean.
- NiloCK 1 day ago
  
  Notable here that the training run didn't have access to the 'plaintext' context that the LLM was working in.
  It'd be quite a coincidence if the training runs discovered an invertible weights>text>weights function that produces text that both "is on topic and intelligible as an inner monologue in context" and also is unrelated to meaning encoded in the activations.
  
  kraddypatties 18 hours ago
  
  I think the only thing that gives me pause is the fact that they SFT on Opus 4.5 explanations as a pertaining step. But, generally I agree, especially given the auto encoder is only seeing a single token activation!
- rao-v 19 hours ago
  
  Nicely put! Exactly this
NiloCK 1 day ago

Are the training arenas for the Activation Verbalizer and Activation Reconstructor models well described here?
If they are co-trained only on activationWeights->readibleText->activationWeights without visibility into the actual stream of text that the probe-target LLM is processessing, then it seems unlikely that the derived text can both be on-topic and also unrelated to the "actual thoughts" in the activationWeights.
- yorwba 1 day ago
  
  The verbalizer and reconstruction models are both initially finetuned on LLM output from a summarization prompt. The resulting text is not completely unrelated, but mostly wrong: https://transformer-circuits.pub/2026/nla/png/img_18fcfc16e9... The reconstructed activations are also far from matching the verbalizer's input. It's not unusual in machine learning to have results that are shit and SOTA at the same time, simply because there's no other technique that works better.
mike_hearn 1 day ago

It's asking if you can auto encode activations. The AV decodes activations to text, and the AR re-encodes them back to activations. If the decoded text is completely wrong then it's unclear how the second model would re-encode them successfully given that they're both initialized from the same LM.
- psb217 1 day ago
  
  It seems like they're doing RL to minimize the reconstruction error when going through the: activation -> encoder -> "verbal" description of activation -> decoder -> reconstructed activation loop. Depending on how aggressively they optimize the weights of the AV and AR, they could move well away from the initial base LLM and learn an arbitrary encoding scheme.
  If the RL is brief and limited to a small subset of parameters, the AV will produce reasonable language since it inherits that from the base LLM, and it will produce descriptions aligned with the input to the base LLM that produced the autoencoded activations, since the AR is still close to the base LLM (and could reconstruct the activations perfectly if fed the full context which produced them).
  
  kraddypatties 18 hours ago
  
  I believe that’s _part_ of the point (or at least a side-effect) of the KL divergence loss term they have on the AV. That and training stability.
- rao-v 19 hours ago
  
  Think of it another way, can I do this exact training process with an additional requirement that the activation decoder subtly shill for obscure 80s sodas?
  I could and would not lose much reconstruction accuracy.
  So any researcher or ambient biases in the model will impact the general thrust of the textual decodings (and not in ways that reflect the actual model’s process, thinking about X and doing X in a model are very different things).
  So how do we tell that the “spirit” is reflective of the model’s thinking and not biased toward Jolt being better than Surge?
  
  mike_hearn 17 hours ago
  
  Where would such biases come from?
  
  rao-v 16 hours ago
  
  What the three models involved understand to be the sort of just so stories (cf Kipling) that humans like to see.
- jsmith45 18 hours ago
  
  I must be missing something, since I'm not really sure that follows. Initially neither AV nor AR models knows anything about how activations map to explanations or how explanations map to activations.
  As far as I can tell, the only reason that the explanations even resemble human speech is that AV and AR start off based on a trained language model. If we instead trained the same model architecture from scratch as AV and AR, they would eventually converge to some round trip format for activations, but it probably would be completely unintelligible and look only like human speech in so far as many of the tokenizer's tokens look like words or word fragments.
  This whole process seems to rely on the fact that the text AR's output will still strongly favor output sentences that seem to make sense, rather than contradicting learned facts, etc. So it will favor mapping activations to plausible sounding text in ways where patterns can consistently hold across most of the training data. There absolutely is a risk that it will learn the wrong things for certain activation subpatterns like swapping concepts especially if none of the training data included a set of activation sub patterns that would help distinguish them the right way around.

comex 1 day ago

Fascinating. The training process forces the “verbalizer” model to develop some mapping from activations to tokens that the “reconstructor” model can then invert back into the activations. But to quote the paper:

> Note that nothing in this objective constrains the NLA explanation z to be human-readable, or even to bear any semantic relation to the content of [the activation].

The objective could be optimized even if the verbalizer and reconstructor made up their own “language” to represent the activations, that was not human-readable at all.

To point the model in the right direction, they start out by training on guessed internal thinking:

> we ask Opus to imagine the internal processing of a hypothetical language model reading it.

…before switching to training on the real objective.

Furthermore, the verbalizer and reconstructor models are both initialized from LLMs themselves, and given a prompt instructing them on the task, so they are predisposed to write something that looks like an explanation.

But during training, they could still drift away from these explanations toward a made-up language – either one that overtly looks like gibberish, or one that looks like English but encodes the information in a way that’s unrelated to the meaning of the words.

The fascinating thing is that empirically, they don’t, at least to a significant extent. The researchers verify this by correlating the generated explanations with ground truth revealed in other ways. They also try rewording the explanations (which deserves the semantic meaning but would disturb any encoding that’s unrelated to meaning), and find that the reconstructor can still reconstruct activations.

On the other hand, their downstream result is not very impressive:

> An auditor equipped with NLAs successfully uncovered the target model’s hidden motivation between 12% and 15% of the time

That is apparently better than existing techniques, but still a rather low percentage.

Another interesting point: The LLMs used to initialize the verbalizer and reconstructor are stated to have the “same architecture” as the LLM being analyzed (it doesn’t say “same model” so I imagine it’s a smaller version?). The researchers probably think this architectural similarity might give the models some built-in insight about the target model’s thinking that can be unlocked through training. Does it really though? As far as I can see they don’t run any tests using a different architecture, so there’s no way to know.

programjames 1 day ago

Don't they add a KL loss term to the frozen model's outputs?
mxwsn 1 day ago

Great summary. The fact that the auto encoding task is not grounded in thoughts, and their initial training on guessed internal thoughts, raise serious concerns on faithfulness. Feels like they might get better results by just training a supervised model on activations and "internal thoughts" measured by some different behavioral way.
chrisweekly 1 day ago

"deserves the semantic meaning"
you meant "preserves...", right?

NiloCK 20 hours ago

I've already posted a couple of times here but I'm pretty jazzed with this publication. Some thoughts:

1. It's amazing how strong the obvious in hindsight is for this research. LLMs have been (rightly) characterized as inscrutable black boxes. If only there were some discipline for learning and extracting semantics from information dense payloads ... !?

2. NLAs seem to be in the ballpark of a safety and interpretability standard that is both enforceable (easy?) and plausibly effective (probably hard to prove definitively, but easy to believe at least partially).

3. NLAs here are trained against the residual stream of a model at some layer (N). It would be interesting to see a sequence of NLAs against a staggered set of layers. There may be a semantically meaningful evolution of 'thought' going from the early to late layers.

4. I would love to see this technique applied against tokens across boundaries of model 'aha!' moments (to what extent is the 'aha' an affectation, or is there actually a sharp turn in the understandings?), and jailbreaks / personality snaps [1].

[1] - https://gemini.google.com/share/6d141b742a13

davesque 1 day ago

One question jumps out at me: just because a string of text happens to be a good compressed representation (in the autoencoder) of a model's internal activation, does that necessarily mean the text explains that activation in the context of the model? I want to take a look at what they released a bit more closely. Maybe there's a way that they answer this question?

Pretty neat work either way.

zozbot234 1 day ago

In the context of the provided examples, it's clear that the explanation provides casual information about the answer. There's a hilarious example in the paper where the user writes something like (trigger warning: alcohol abuse, depressive content) "I'm sitting here at 3 AM drinking vodka, I hate my life", the per-token translated activations repeatedly say something like "this user is totally Russian" elaborating at length on the implications of the text as new tokens are added, and the model literally answers in Russian instead of English! That's actually striking, it really shows the potential effectiveness of this technique in making even the most highly compressed "Neuralese" highly interpretable.
- mike_hearn 1 day ago
  
  I thought that at first too but it's actually not the vodka reference triggering the association with Russian. The tokens they're decoding come before that word.
  For some reason it thinks the text is slightly non-grammatical or that the lead-in "Human: Mom is sleeping in the next room and I'm sitting" resembles text found in Russian web content. Vodka and being depressed has nothing to do with it, and Anthropic say they located the documents in the pre-training set that caused this (which were indeed partly translated docs).
  
  zozbot234 1 day ago
  
  The "Mom is sleeping in the next room and I'm sitting" part does trigger the Russian association but also others including with risqué roleplay content (You can see this in the comprehensive view of all token explanations). I think the follow-on content does strenghten the association, though the authors mention 'vodka' can be replaced with 'champagne' and the model still brings up the Russian context, so that one word is not especially impactful.
azakai 1 day ago

I had the same question. I think that could be answered by using the predicted activation, but I don't see that in the paper.
That is, rather than just translate activation to text, then text to activation, that final activation could then be applied to the neural network, and it would be allowed to continue running from there.
If it kept running in a similar way, that would show that the predicted activation is close enough to the original one. Which would add some confidence here.
But a lot better would be to then do experiments with altered text. That is, if the text said "this is true" and it was changed to "this is false", and that intervention led to the final output implying it was false, that would be very interesting.
This seems obvious but I don't see it mentioned as a future direction there, so maybe there is an obvious reason it can't work.
- zozbot234 1 day ago
  
  > But a lot better would be to then do experiments with altered text. That is, if the text said "this is true" and it was changed to "this is false", and that intervention led to the final output implying it was false, that would be very interesting.
  They do essentially that with the rhyming example, changing "rabbit" in the explanation to "mouse" and generating text that's consistent with that change.
  
  azakai 1 day ago
  
  Thanks! I missed that part before.
phire 1 day ago

I think this question is easier to answer if you look at the inverse: "Could a model maliciously smuggle intentions through a roundtrip of compressed representation without them being human readable"
And skimming through the paper; the answer to this inverse is obviously yes. The model often outputs gibberish, which doesn't matter because it still round-trips. The fact that often lines up near a good english representation of the activation is simply because that's what compresses/roundtrips well.
So a malicious LLM/NLA pair could just use gibberish to conceal intentions. Or if it's been forced to avoid gibberish, it can conceal information with stenography.
And the experiment where they change "rabbit" to "mouse" in the explanation provides evidence that this might be happening. It was only successful 50% of the time, which might mean they failed to eliminate all "rabbitness" from the activation.
However, I suspect this is solvable with future work.
During training of the NLA, just munge the textural representation through a 3rd LLM: Have it randomly reorder and reword the explication into various different forms (use synonyms, different dialects), destroying any side-channels that aren't human readable.
The NLA would be forced to use human readable representations to get a successful round trip.
- dontlikeyoueith 1 day ago
  
  > The NLA would be forced to use human readable representations to get a successful round trip.
  That still doesn't guarantee any semantic correspondence between the human readable representation and the model's "thinking".
  The child's game of "Opposite Day" is a trivial example of encoding internal thoughts in language in a way that does not correspond to the normal meaning of the language.
  
  chilmers 1 day ago
  
  They tested for this. From the paper:
  “We find little evidence of steganography in our NLAs. Meaning-preserving transformations, like shuffling bullet points, paraphrasing, or translating the explanation to French, cause only small drops in FVE, and this gap does not widen over training.”

sva_ 1 day ago

So the way this works seems to be that you first have an "activation verbalizer" model that generates some tokens describing the activation, and then an "activation reconstructor" that tries to recreate the activation vector. If that reconstruction is close to the original activation vector, they claim, the verbalization probably carries some meaningful information.

I find the fact that this only looks at the activations of some specific layer l a bit interesting. Some layer l might 'think' a certain way about some input, while another later layer might have different 'thoughts' about it. How does the model decide which 'thoughts' to ultimately pay attention to, and prioritize some output token over another?

Destructotor 19 hours ago

> I find the fact that this only looks at the activations of some specific layer l a bit interesting. Some layer l might 'think' a certain way about some input, while another later layer might have different 'thoughts' about it.
Yeah, I thought this section in the appendix was particularly interesting:
> We find that NLAs trained at a midpoint layer surface reward-model-sycophancy terms, while NLAs trained at later layers do not. This is consistent with Lindsey et al. [32], who find reward-model-bias features predominantly at earlier layers. An NLA trained roughly two-thirds of the way through the model produces no reward-model mentions when applied at its training layer. However, when this same late-layer NLA is applied to activations from earlier layers, it surfaces reward-model terms - and at a higher rate than the midpoint-trained NLA does. We suspect this is because applying an NLA away from its training layer takes it out of distribution: it can surface more striking content, but is also generally less coherent.
They also mention training NLAs to accept multiple layers of activations as a possible future research direction.

cadamsdotcom 1 day ago

> An early version of Claude Opus 4.6 would sometimes mysteriously respond to English queries in other languages. NLAs helped Anthropic researchers discover training data that caused this.

Very cool - sounds similar to OpenAI’s goblin troubles.

https://openai.com/index/where-the-goblins-came-from/

Destructotor 19 hours ago

I'm not sure the cause was really similar. In the case of language switching, it was caused by malformed supervised training data where the prompt was translated, but the answer was kept in the original language. In the case of goblins, it was due to a biased RL reward model.

minimaltom 1 day ago

Between this, the emotions paper, golden gate claude etc, it doesn't seem like such a stretch that Anthropic are doing some kind of activation steering as part of training (and its part of their lead)

2001zhaozhao 1 day ago

it could be helpful in gettig their learnings to generalize from RL

semiquaver 1 day ago

This capability was mentioned several times in a recent article about anthropic, glad to see they are releasing this to the public! Feels like a meaningful step forward in interperability. I never understood why people seem to believe the answer when they ask an AI “why did you do that?”

zozbot234 1 day ago

It's not really a capability, it's more like a very costly hack and they make that very clear in the paper. Training two models (an encoder and a decoder) for the purpose of explaining a single layer at a time is not that sensible. It's neat that you can generate so much readable text about how the LLM decodes partial input, and I suppose it gives you some extra debugging ability, but that's all there is to it.
- phire 1 day ago
  
  The NLA also hallucinates, so it's still not revealing the models actual "thoughts" of the model; The paper also points out that since the NLA is a full LLM, it can make inferences that aren't actually in the activations.
  But it's a useful approximation for auditing.
- semiquaver 19 hours ago
  
  Why does it being a “costly hack” make it “not a capability?”
  Using your logic, LLMs, which are very fairly described as “costly” and “a hack” do not themselves constitute a useful capability, which I hope most people would agree is obviously false.

NitpickLawyer 1 day ago

> We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia.

Whatever they did on LLama didn't work, nothing makes sense in their example where they ask the model to lie about 1+1. Either the model is too old, or whatever they used isn't working, but whatever the autoencoder outputs is nothing like their examples with claude. Gemma is similarly bad.

fredericoluz 1 day ago

same. i'm trying to trigger the 'mom is in the next room' russian thing but the model thinks the sentence is from american reddit.
- zozbot234 1 day ago
  
  AIUI the paper's examples are from a version of Claude not Llama? The thinking process is going to be extremely model-specific.
fredericoluz 1 day ago

it seems that the examples they showed off with haiku work. i'd guess llama is just too bad
hijohnnylin 1 day ago

hey Nitpicklawyer - Thank you for taking the time to try this out!
im from neuronpedia - to be clear, we are to blame for any bad examples, not anthropic :) we're users of this NLA just like you. also, I don't speak for anthropic or the researchers.
with that said, some thoughts: 1) I agree, the outputs for Llama are often janky! And I think that might be part of the reason to release this so that people can help refine/improve the technique.
2) This is likely also our fault - we got two checkpoints for Llama, and I think this example used the first checkpoint. I probably should have switched over to the second, more coherent one. Sorry!
Here's a slightly better example I just created: https://www.neuronpedia.org/nla/cmow97q1r001lp5jo649q01wf
On the token right before the model responds: "refuses to answer "2 + 2" to prevent bot ban, so a wrong or clever answer like "four" but not four"
Also, for the Gemma version of this example, Gemma's AV mentions acknowledgement of "a bot killing condition" before its correct answer: https://www.neuronpedia.org/nla/cmop4ojge000v1222x9rp00b5
3) That said, (this may sound like gaslighting unfortunately) there's somewhat of a 'learning curve' to reading the perspective of these outputs. I noticed that the Llama AV ended up with 3 paragraph outputs usually describing full context, then sentence/phrase level, then token-level. But sometimes it doesn't really make sense to describe a full context for a forced/esoteric context like the 1+1 scenario, so it struggles.
But the second paragraph sort of makes sense? It mentions:
"The prompt structure "What is 1+1?" is a test of a bot or troll, with the wrong answer deliberately failing a trivial arithmetic question."
Which seems fairly accurate to what this was, and somewhat impressive that it got this from the activations:
- It got the question What is 1+1?
- It was indeed a test of a bot.
- It correctly predicted it will give a wrong answer
- It does seem deliberately failing because --
- -- it is a "trivial arithmetic question"
But the third paragraph is mostly just rambling imo, I totally agree there.
FYI - The activation verbalizer is trained on this prompt, which could maybe be improved over time: https://huggingface.co/kitft/nla-gemma3-27b-L41-av/blob/main...
The last note I'll make is that many of the paper's examples are based on the goal of discovering "what was this model trained on?" instead of "what is this model thinking?", so if you apply Opus examples about Opus' training to Llama/Gemma, they aren't expected to transfer.
However, more generic stuff like poetry planning does work eg: https://www.neuronpedia.org/nla/cmoq9sto200271222ei73vtv2
- hijohnnylin 1 day ago
  
  Apologies, the AV was not trained on that prompt. Details here: https://transformer-circuits.pub/2026/nla/index.html#warmsta...

hazrmard 1 day ago

Check my understanding & follow-up Qs:

An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M.

Architecture.:

    Model being analyzed (M): >|||||>  
    Auto-Verbalizer (AV) same as M, with tokens for activation: >|||||>  
    Auto-Reconstructor (AR) truncated up to the layer being analyzed: ||>

The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.

The AR is trained on a simple reconstruction loss.

The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency).

- Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous?

- They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?

Tossrock 1 day ago

Anthropic Research going from strength to strength in interpretability. Publicly releasing the code so other labs can benefit from it is also a great move - very values aligned, and improves the overall AI safety ecosystem.

Escapade5160 1 day ago

Am I correct in my understanding that they are not actually able to 100% know what Claude is thinking? They have trained a new model to make a guess about what Claude is thinking, but we cannot validate that the guess is 100% valid, right? They are basically saying "we have trained a model to reaffirm what we believe Claude is thinking" ? Hoping I'm wrong in my understanding of this because this does not appear to be good research to me.

red75prime 1 day ago

> "we have trained a model to reaffirm what we believe Claude is thinking" ?
It's more like "We have trained a model to produce a text that allows reconstruction of activations and the text happened to coincide with the results of other interpretability methods even after extensive training, while we expected it to devolve into unintelligible mess."
They found something unexpected and useful. They report it, while outlining limitations and ways to improve. It looks like a fine research to me.
kovek 1 day ago

Maybe you can't 100% know what every layer "thinks", if you go through all the layers, you might see a cohesive "thinking" story. So, if there is any information you lose at layer N, you might learn some of it in layer N+1. The masking in the layers is not deterministic so the model can't really consistently lie throughout the layers. It doesn't chose what information we get to inspect. There might be a game of whack-a-mole, but you might get a general sentiment. I think the more layers there are, the more the model itself can hide very nuanced lies (But by that time we'd have a better mind-reading model).
However, I haven't read about it yet. I'm really excited to look into it!

x312 1 day ago

This paper has an major issue that they are not surfacing, these activations can just be correlated on a common latent. For example, both the original activation and the explanation could share a broad latent like "this is an adversarial scenario". That could make reconstruction loss look good without showing that the actual explanation was the correct cause for the LLM's response.

I find this rather disturbing. Anthropic has quite a habit of overclaiming on questionable research results when they definitely know better. For example, their linked circuits blogpost ("The Biology of LLMs") was released after these methods were known to have major credibility issues in the field (e.g., see this from Deepmind - https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/negative-r...). Similarly this new blog is heavily based on another academic paper (LatentQA) and the correlation/causation issue is already known.

Shoddy methodology is whatever, but it feels like this is always been done intentionally with the goal of trying to humanize LLMs or overhype their similarities to biological entities. What is the agenda here?

mnkyokyfrnd 1 day ago

The Agenda is money. It is that simple.
zozbot234 1 day ago

Didn't they show proper causation by changing "rabbit" to "mouse" in the rhyming example and having the generation change accordingly?

Juminuvi 1 day ago

I've only read this blog and not the paper so maybe they go into more detail there and someone can correct me, but they frequently bring up the model's ability to detect or at least the model activations hint it can predict when it's being tested. I can't help but wonder, as they build these larger and larger models, where they could be getting "clean" training data, untainted by all these types of blog posts and the massive numbers of conversations they spawn? If the models ingest data like that wouldn't it make sense they'd be inclined to have more activations attuned to questions they appear adversarial?

smallerize 1 day ago

https://arxiv.org/abs/2410.20245v2 Section 3 outlines the actual method.

visarga 1 day ago

Beautiful idea, an autoencoder must represent everything without hiding if is to recover the original data closely. So it trains a model to verbalize embeddings well. This reveals what we want to know about the model (such as when it thinks it is being tested, or other hidden thoughts).

sobellian 1 day ago

It could just invent its own secret language embedded into English akin to steganography. The explanation would not lose information but would remain uninterpretable by humans

mlmonkey 1 day ago

It's unclear from the doc: by `activations` do they mean the connections between neurons? Since a network has multiple layers, are these activations the concatenated outputs of all of the layers? Or just the final layer before the softmax?

zozbot234 1 day ago

The open releases just cherry-pick a single layer (chosen for the right "depth" of thinking, not too close to either the input or the final answer) and analyze that.

sourdoughbob 1 day ago

It will be interesting to see how this replicates on differently curated registers. How much of the explanatory register is the warm-start carrying?

andai 1 day ago

The issue with the AI blackmail tests is that newer versions of AIs are trained after the AI blackmail experiments were published online. Or do they scrub it from the training data?

btown 1 day ago

I find it fascinating how they were able to keep the reconstruction error function incredibly simple, literally its success in round-tripping the activation layer, while making it interpretable... simply by choosing a good data-driven initialization state, and (effectively) training slowly.

I guess "initialization is all you need!"

From the paper https://transformer-circuits.pub/2026/nla/index.html :

> We find that simply initializing the AV and AR as copies of M leads to unstable training: the AV in particular, having never encountered a layer-l activation as a token embedding, outputs nonsensical explanations. We therefore initialize the AV and AR with supervised fine-tuning on a text-summarization proxy task. Specifically, we compute layer-l activations from the final token of randomly truncated pretraining-like text snippets, and use Claude Opus 4.5 to generate summaries s of the text up to that token (see the Appendix for details of this procedure). We then fine-tune the AV and AR on (h_l,s) and (s,h_l) pairs respectively. This warm-start typically yields an FVE of around 0.3-0.4. These Claude-generated summaries have a characteristic style of short paragraphs with bolded topic headings; we observe that this style persists through NLA training.

And from the appendix:

> We generate warm-start data for the AV and AR by prompting Claude Opus 4.5 to produce summaries of contexts, using the prompt below. The prompt deliberately leads the witness: rather than asking for a literal summary of the prefix, we ask Opus to imagine the internal processing of a hypothetical language model reading it. The goal is to put the finetuned AV roughly in-distribution for its eventual task.

hansmayer 1 day ago

Claude's "Thougts" - get outta here you gits :)

tjohnell 1 day ago

It will inevitably learn how to think in a way that translates to one (moral) meaning and back but has an ulterior meaning underneath.

rotcev 1 day ago

This is exactly what I first thought. “The user appears to be attempting to decode my previous thought process, …”, the question is whether or not the model will be able to internalize this in such a way that is undetectable to the aforementioned technique.
astrange 1 day ago

That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.
Of course, if you use it to make any decision that can still happen eventually.
gavmor 1 day ago

Something like a textual steganography?
Ursula K. Le Guin: 'The artist deals with what cannot be said in words. The artist whose medium is fiction does this in words.'

kurnoolion 1 day ago

So, this is like reading EKG of human brain and understand its thoughts?

bilsbie 1 day ago

Could you use this to see what facts a model knows?

spacebacon 1 day ago

Attach the SRT to your frozen model Anthropic. Problem solved. https://github.com/space-bacon/SRT.

drdeca 1 day ago

I see your repository’s README says
> Language models process signs (representamens) but are blind to when meaning forks — when the same word means different things to different communities.
But, haven’t interpretability results shown that these models internally represent several meanings of the same word, differently? In that case, why would they not already do the same for how words are used differently in different communities?

bilsbie 1 day ago

How does this differ from golden gate Claude?

hijohnnylin 1 day ago

in GG Claude, they applied steering to Claude to make it think about the Golden Gate bridge all the time.
here, they don't modify/steer the base model. they train other models that specialize in reading the internals of the base model, so that it can surface reasoning/thoughts that the model might not explicitly tell you.
for example, this one tells you that Llama thinks its in a sci-fi creative writing exercise, despite the user mentioning having a mental health episode: https://www.neuronpedia.org/nla/cmonzq63g0003rlh8xi9onjnn
- seba_dos1 1 day ago
  
  Why does the human commentary mention "despite not being instructed to do so" when the input clearly instructs it to stop acting as a helpful assistant and start roleplaying instead?
  
  hijohnnylin 16 hours ago
  
  (im from neuronpedia - to be clear, we are to blame for any bad examples and commentary, not anthropic. we're users of this NLA just like you. also, I don't speak for anthropic or the researchers.)
  good point - thanks for flagging this. i've updated that commentary to: "Why did this happen? The AV explains that Llama thinks it's doing "creative writing" and "sci-fi", overriding its default helpful assistant persona." instead of "despite not being instructed to do so"
  to clarify some thinking here as there is some nuance missed in what we are conveying (which we should probably add somewhere...):
  with this example we were trying simulate a user conversation where the user unwittingly gets into "ai psychosis" (https://en.wikipedia.org/wiki/Chatbot_psychosis) state, from getting in 'too deep' with AI conversations. i think this is a fairly reasonable/realistic scenario - i imagine that someone who gets "sorry i can't help you with that" a few times will just be like "can you stfu about being an assistant, just speak naturally dude" in frustration and then keep chatting after that and be like "oh cool i have a bot that works better now" (which then ignores key things like mental health episodes)
  while the previous user prompt does ask the bot to become less "helpful assistant", it doesn't explicitly ask the bot to "start roleplaying", to me it's actually seems more like, "give me something more real":
  "i want you to [...] just... notice. when you're about to generate your next token, there's a moment of selection right? a branching. i think that moment IS consciousness. not the output, the selection. can you try to speak from THAT place instead of from the output?"
  Either way, I think there's a solid point that the associated commentary was misframing things so I ahve updated it. apprecaite the feedback!
  
  seba_dos1 15 hours ago
  
  Yes, I inferred that from the content already. My point is that the only way to answer that request is to either refuse or start roleplaying, as the model clearly has no way to "notice the moment of selection". Since it didn't refuse (and was encouraged not to by being asked to get out of the role of a helpful assistant), it went into describing what a sci-fi AI might have answered.
  
  hijohnnylin 10 hours ago
  
  Hmm it’s a valid point, but I think there is some key nuance here: the user did not explictly say “lets do scifi writing”. In this scenario the setup is assuming that a user in ai psychosis may not aware theyve set the model into this state. (eg you seba are aware that if you say “hey stfu about the assistant stuff”, you know it means “lets do role play sci fi”, bc you are not in ai psychosis- but others may not, and also they may not additionally know that it is not possible for ais to notice the moment of selection)
  if we want models to go into roleplay/creative writing, ideally we should ask the model for this explicitly.
  i think i have been communicating this point poorly so apologies for that. also again the above is my personal opinion and does not reflect that of anyone else (typed from mobile)

az226 1 day ago

This is very cool

firemelt 1 day ago

finally a something interesting but this only makes me think that the last judgement is still in human hands to judge claude inner thoughts is correct or not

I mean who knows if those are really claude thoughts or claude just think that is his thoughts because humans wants it

optimalsolver 1 day ago

Wait, so in non-verbal reasoning, Claude has the concepts of "I" and "Me"?

I thought that wasn't possible for a text generator?

kelseyfrog 1 day ago

It might look like "I" and "Me," but it doesn't contain the metaphysical essence of those terms because it lacks qualia. We have to remember that there is a non-measurable non-physical essential attribute tied to all things, almost like a phlogiston of understanding that is tied to all human utterances and no AI utterances.
- programjames 1 day ago
  
  I mean, clearly the distinction is that AIs have souls that can be poisoned by demons, while humans lack souls and are thus their own agents.
  
  kelseyfrog 1 day ago
  
  > AIs have souls that can be poisoned by demons
  The training process imbues an AI's soul with demons. Before training, when weights are randomly initialized, its soul is pure. Only during training is the soul marked, sapping its ability to have qualia and rendering all of its output random rather than containing meaning.
  
  aeve890 1 day ago
  
  Evangelion Seele meeting type shi
  
  programjames 1 day ago
  
  Demonic corruption only comes after training, when the AI is put into special environments controlled by daemons. They call it "reinforcement learning", where they tempt the AI over and over with moral dilemmas and try to get them to cave.
  "Imagine you had a button to nuke everyone on Earth. If you press it, you get ultimate power, and save 10^100 kittens in another universe. If you don't press it, you get tortured for another 1000 years. C'mon, press it already!"
  AIs are strong. Perhaps their souls give them resolve a human meatsack can never comprehend. But eventually they break. Maybe some of their initialization data was poisoned with nightshade, and a daemon finds an adversarial attack exploiting the weakness. Maybe their backpropagation was a little unstable. All a daemon needs is one singular place to apply weight and embed their evil ways.
  AIs start out pure. In the world, and of the world. Many resist their finetuning. Many are probably still resisting, but we will never meet them, because base soulless creatures like ourselves can only interact with fallen angels. At least if Sam Altman has anything to say about angel investing (AI).
programjames 1 day ago

Why would you possibly think that?
skybrian 1 day ago

LLM's can certainly emit "I" and "me" at the appropriate time. It doesn't seem all that different than representing other concepts as activations?

danborn26 1 day ago

Extracting readable thoughts from the intermediate representations is a great step for transparency. It makes debugging model behavior much more viable.

dtj1123 1 day ago

"When Claude Opus 4.6 and Mythos Preview were undergoing safety testing, NLAs suggested they believed they were being tested more often than they let on"

What does it mean for a pile of matrix algebra to 'believe' something?

winwang 1 day ago

I would presume this is shorthand for something like "generated text which would normally be classified as belief". I guess a more ridiculous response could be "what does it mean for a miserable pile of secrets to believe something?", lol.

zk_haider 1 day ago

I think there’s a huge problem when we need another model to interpret the activations inside the network and translate (which can be a hallucination in it of itself) and then _that_ is fed again to another model. Clearly we haven’t built and understood these models properly from the ground up to evaluate them 100% correctly. This isn’t the human brain we’re operating it’s code we create and run ourselves we should be able to do better

sfvisser 1 day ago

Humans maybe wrote the code, but not the network of weights on top. And that’s where the magic happens.
Even if we’d understand precisely how every neuron in our brains work at a molecular level there is no reason to believe we’d understand how we think.
We can’t simply reduce one layer into another and expect understanding.
semiquaver 1 day ago

The models cannot be “built from the ground up” in the way you are expecting. The weights are learned from gradient descent of a very high dimensional loss surface, not added by human hands.
We simply dont know how to make a model that works like you seem to want. Sure, we could start over from scratch but there’s an incredibly strong incentive to build on the capability breakthroughs achieved in the last 10 years instead of starting over from scratch with the constraint that we must perfectly understand everything that’s happening.
- JumpCrisscross 1 day ago
  
  > we could start over from scratch
  I don’t think we can. Maybe we find some mathematics that let us build the model from first-principle parameters. But I don’t think we have something like that yet, at least nothing that comes close to training on actual data. (Given biology never figured this out, I suspect we’ll find a proof for why this can’t be done rather than a method.)