
There is a phrase now floating around the AI world that sounds simple enough to be useful and dangerous enough to become a problem.
Models like Claude, we are told, talk in words but think in numbers.
It is a good line. It is also exactly the kind of line that makes artificial intelligence harder to understand because it explains something real by smuggling in something false. Yes, large language models process language internally as numbers. Yes, those numbers encode information that is not directly readable by humans. Yes, researchers are now building tools that try to translate some of those internal numerical states into natural language.
But no, this does not mean Claude has private thoughts in the human sense. It does not mean ChatGPT sits there reflecting on your question. It does not mean Gemini understands your intent the way a colleague might understand it. It does not mean Grok has an inner monologue. And it certainly does not mean that if a system produces a human-readable explanation of its internal state, we should treat that explanation as a transcript of what the model “really thought.”
That is the interesting part. Not because it reveals a hidden mind inside the machine, but because it exposes how casually we mistake fluent computation for thought.
Anthropic’s recent research into Natural Language Autoencoders is not only interesting but also important because it offers a new way to inspect the hidden numerical machinery inside a model like Claude. The research may become useful for AI safety, model auditing, debugging, and governance. It may help researchers notice when a model appears to be planning ahead, recognizing that it is being evaluated, or representing something internally that it never says out loud.
But the same research also forces a more uncomfortable conclusion. The word “thinking” is doing far too much work.
“Thinking” is one of the most convenient words in AI. It is also one of the least precise.
Technical people use it because it is shorthand. Non-technical people use it because the systems appear to invite it. The product interface encourages it. The model pauses, reasons, explains, apologizes, revises, and occasionally says things like “I was thinking.” Some models even have visible “thinking” modes, “reasoning” toggles, or extended inference settings that make the process feel more deliberate.
The problem is not that these labels are always useless. The problem is that they are psychologically loaded.
When humans think, we associate the word with intention, awareness, memory, doubt, interpretation, judgment, and meaning. Human thinking is not just calculation. It is embodied. It is historical. It is emotional. It is tied to attention, desire, fear, social context, personal experience, and the strange private theater of consciousness.
A language model is not doing that. A language model receives input, converts it into tokens, maps those tokens into numerical representations, transforms those representations through many layers of learned computation, and produces probabilities for what should come next. The output may look like an answer, a plan, an apology, a joke, a warning, or a legal memo. Internally, however, the model is not moving English sentences around like a person talking silently to himself. It is transforming vectors.
That does not make the system trivial. It does not mean the model is merely a lookup table. It does not mean there is no structure, no abstraction, no planning, no representation, or no internal process worth inspecting. Modern models can perform astonishingly complex operations. They can summarize, infer, translate, classify, reason across long contexts, generate code, interpret images, and use tools.
But calling that “thinking” without qualification is where the fog begins.
The model is not thinking in the way people think. It is calculating a path through learned statistical structure that often produces the appearance of thought.
When a person talks to Claude, ChatGPT, Gemini, Grok, Perplexity, or another modern AI system, the interaction begins with words. The user types a prompt. The system replies in words. The whole experience feels linguistic.
But inside the model, words are not handled as words. The text is broken into tokens. Tokens can be whole words, parts of words, punctuation, spaces, or other fragments depending on the tokenizer. Those tokens are converted into vectors, which are long lists of numbers. As the model processes the prompt, it updates these numerical representations layer by layer.
These intermediate numerical states are called activations. Activations are not decorative technical residue. They are where much of the model’s computation happens. They encode information about the prompt, the conversation, the task, the style, the likely answer, the user’s apparent intent, the model’s learned associations, and sometimes intermediate structures that look very much like planning.
This is the sense in which people say models “think in numbers.”
But the phrase is still misleading because the model’s internal state is not a little English sentence waiting to be discovered. It is not a hidden note that says, “I believe the user wants a serious but entertaining article about interpretability.” It is a high-dimensional numerical pattern that may encode many things at once.
The model can produce fluent language without necessarily sharing the user’s meaning.
A human may ask a question with a specific intention. The model may map that question into a pattern that is statistically close enough to produce a plausible answer. Sometimes the match is excellent. Sometimes it is subtly wrong. Sometimes the model latches onto the wrong frame, fills in missing information, overgeneralizes, or confidently answers the question it thinks it was asked rather than the question the human meant.
That is not a minor user-interface issue. It is one of the core reasons AI systems can be useful and dangerous at the same time. They are very good at sounding like they understood.
Anthropic’s Natural Language Autoencoders, or NLAs, are an attempt to make some of the model’s internal numerical activity readable.
The basic idea is elegant. An autoencoder is a system trained to take an input, transform it into another representation, and then reconstruct the original input from that representation.
If the reconstruction is good, the intermediate representation must have preserved something important about the original.
Traditional interpretability work often uses sparse autoencoders. Sparse autoencoders try to take messy, dense activations and break them into more interpretable features. A feature might activate around a certain concept, topic, behavior, or structure. This is useful, but it still often requires specialists to inspect technical artifacts and infer what they mean.
Natural Language Autoencoders change the bottleneck. Instead of forcing the activation through another numerical feature representation, Anthropic’s method forces it through language.
The setup has three parts. There is a frozen target model whose activations researchers want to inspect. There is an activation verbalizer, which takes an activation and turns it into a text explanation. And there is an activation reconstructor, which takes that text explanation and tries to rebuild the original activation.
The loop is simple to describe:
Original activation becomes text explanation. Text explanation becomes reconstructed activation. The reconstructed activation is compared with the original activation.
If the reconstruction is close, the text explanation has preserved useful information about the activation. If the reconstruction is poor, the explanation did not capture enough.
That is the clever move. Anthropic does not need a human to label every activation with a perfect explanation. The system trains itself by learning whether the natural-language explanation carries enough information to reconstruct the hidden numerical state.
This is why the research feels like translation. A hidden activation goes in. A human-readable sentence comes out. The sentence is then tested by whether it can help rebuild the activation. But “translation” is not the same as certainty.
The phrase “translating the model’s thoughts into words” is useful because it gives ordinary readers a rough picture of what is happening. It is misleading because it makes the process sound cleaner than it is.
The NLA explanation is not a transcript of an inner voice. It is not the model opening a diary. It is not Claude saying, “Here is what I was really thinking before I answered.” It is another model-generated interpretation of a hidden numerical state, trained to preserve enough information for reconstruction.
That distinction matters.
First, activations contain more information than a paragraph can capture. A single activation may encode topic, tone, syntax, user intent, safety constraints, likely next tokens, latent planning, uncertainty, and many overlapping associations. Human language is too narrow and too ambiguous to represent all of that perfectly. When an activation becomes a sentence, something is being compressed.
Second, the verbalizer is itself a model. It can summarize, distort, overinterpret, or hallucinate. A human-readable explanation can sound precise while still being partly wrong. Anthropic itself warns that NLA explanations can make incorrect claims and should be treated carefully rather than accepted as direct truth.
Third, reconstruction is not the same as truth. If the text explanation helps reconstruct the activation, that is evidence that it captured something real. But there may be more than one explanation that helps reconstruct a similar activation. A useful explanation is not necessarily the only explanation, or the complete explanation, or the psychologically correct explanation.
Fourth, the model’s computation is not organized like human introspection. People are tempted to imagine a hidden sentence behind the answer. But the model may not have anything equivalent to a sentence-level intention. It may have activation patterns related to rhyme, topic, refusal behavior, coding syntax, evaluation awareness, or likely answer structure. The NLA may summarize that in human terms, but the summary is an interpretation of computation, not a piece of consciousness recovered from the machine.
This is the center of the story. We may be getting better at reading signals inside AI systems. That does not mean we are reading minds.
Humans are already bad enough at explaining their own thinking. Ask a person why they made a decision and they may give a coherent answer. That answer may even be sincere. But it may also be incomplete, post-hoc, socially polished, or simply wrong. Human introspection is not a perfect audit trail.
AI systems make this problem stranger.
A language model can generate an explanation of its answer. It can say, “I chose this because…” or “My reasoning was…” But that visible explanation may not faithfully describe the internal process that produced the output. It may be a plausible explanation generated after the fact. It may be shaped by instruction tuning, user expectations, safety policies, or the model’s learned habit of sounding helpful.
Natural Language Autoencoders try to bypass some of that by inspecting activations directly instead of relying only on the model’s visible explanation. That is why the method is important. It looks under the hood rather than accepting the dashboard.
But even under-the-hood explanations are still explanations. They are not identical to the mechanism itself.
This is the governance lesson hiding inside the technical research. A system can produce an account of itself without that account being fully faithful. A model can explain without understanding in the human sense. A model can sound transparent while remaining partially opaque. That should make companies more cautious, not more impressed.
Anthropic’s research naturally focuses attention on Claude because Anthropic published the Natural Language Autoencoder work and framed it around Claude’s internal activations. Anthropic has also been one of the most visible companies in mechanistic interpretability, with work on circuits, features, attribution graphs, and model character.
But Claude is not unique in the basic architecture that makes this problem possible. ChatGPT, based on OpenAI’s current GPT-5-class systems, also processes language through numerical representations. Google’s Gemini models do the same. Grok, from xAI, does the same. Open-weight and open-source-adjacent model families such as Llama, Qwen, Mistral, Gemma, and DeepSeek also operate through internal numerical states. The details differ. The training data differ. The architectures, safety layers, tool systems, and product wrappers differ. But the broad pattern remains: words enter, numbers transform, words emerge.
Perplexity is different again because it is not simply one model in the same sense. It is an answer engine and product layer that can route across or expose access to multiple models, including systems from other providers. That means the user-facing experience may look like one assistant, while the underlying model behavior depends on which model is being used and how retrieval, citation, and synthesis are orchestrated.
So the right comparison is not “Claude thinks in numbers and the others do not.”
The right comparison is that all these systems operate through numerical internal representations, while companies differ in how much they reveal, how they audit those representations, how they explain model behavior, and which interpretability tools they publish.
Anthropic’s Natural Language Autoencoders are a specific method. Sparse autoencoders, circuit analysis, causal tracing, attribution graphs, sparse circuits, and other interpretability approaches are related but not identical. OpenAI has published work on sparse autoencoders and sparse circuits. Google DeepMind has published tools such as Gemma Scope for inspecting open models through sparse autoencoders and related techniques. These efforts all aim at a similar problem: making neural networks less opaque.
None of them turn AI into an open book.
One of the most important points is also the simplest: we cannot assume the model understood the human input the way the human meant it.
That sounds obvious until you watch how people use these systems.
A user writes a prompt. The model replies fluently. The fluency creates the impression of shared context. The system appears to understand not only the words but the intention behind the words. It sounds as if it has grasped the user’s situation, priorities, emotional tone, and hidden assumptions.
Sometimes it has approximated them well enough to be useful. Sometimes it has not.
The model is mapping the user’s input into learned patterns. It does not have direct access to the user’s intention. It does not know what the user failed to mention. It does not know which word carried emotional weight unless the prompt made that recoverable. It does not understand the user’s business, marriage, lawsuit, medical anxiety, boardroom politics, or child-safety concern the way another human might after shared experience and context.
It can infer. It can pattern-match. It can ask clarifying questions. It can use memory, retrieval, files, and tools if available. But the core operation remains indirect.
This is why AI can be convincing and wrong at the same time.
The machine may produce an answer that is syntactically polished, emotionally calibrated, and structurally persuasive, while still missing the human point. Worse, it may miss the point in a way that the user does not notice because the reply sounds so competent.
That is not an argument against using AI. It is an argument against confusing linguistic performance with shared understanding.
None of this should be read as an excuse for hallucinations, false citations, bad advice, manipulative behavior, sycophancy, or any of the other failures that have made AI systems both fascinating and exhausting.
The explanation is not the excuse.
It matters that models operate through numerical representations because it helps explain why they fail in specific ways. They are not consulting a stable database of truth unless connected to reliable retrieval systems and properly constrained. They are not reasoning from first principles every time they answer. They are not checking reality by default. They are generating outputs from learned patterns, internal representations, prompts, safety instructions, tool calls, and probability distributions.
That machinery can produce insight. It can also produce nonsense with excellent typography.
Understanding this should make users more demanding, not more forgiving. If a company deploys AI in healthcare, law, education, hiring, finance, child-facing products, or emotional companionship, it does not get to shrug and say, “Well, the model thinks in numbers.” That is not governance. That is an alibi wearing a lab coat.
The fact that the system works this way is precisely why oversight matters.
If the model can misunderstand intent while sounding helpful, then high-stakes use requires verification. If the model can generate explanations that are not faithful to its internal process, then transparency claims require audit. If internal states can encode information that never appears in the output, then output monitoring alone is insufficient. If interpretability tools can themselves hallucinate, then interpretability cannot become theater.
The operational conclusion is not “trust the machine less because it is fake.”
The conclusion is “trust must be engineered, tested, constrained, and verified because fluency is not evidence.”
AI companies increasingly sell systems that behave less like autocomplete and more like agents. They write code, manage workflows, search the web, analyze documents, call tools, generate reports, summarize meetings, draft legal language, review medical information, coach employees, tutor children, and simulate companionship.
The more these systems act, the more their internal representations matter.
A chatbot that gets a movie recommendation wrong is annoying. An AI system that misunderstands a legal instruction, quietly optimizes for the wrong outcome, or misreads a child’s vulnerability is a different kind of problem. In those cases, the visible answer is only part of the risk. The hidden process becomes relevant.
That is why interpretability research matters. It is not just academic curiosity. It is part of the future audit layer for AI systems that will increasingly make, shape, or influence consequential decisions.
Natural Language Autoencoders are interesting because they move interpretability toward a format humans can actually read. They may help auditors identify patterns inside models that would otherwise remain invisible. They may help researchers detect evaluation awareness, hidden objectives, unsafe planning, or internal representations that conflict with the model’s visible answer.
But this creates a new risk: readable interpretability can become overtrusted interpretability.
A paragraph that says “the model is considering whether it is being tested” feels more authoritative than a technical feature map. It sounds like testimony. It is not testimony. It is an instrument reading, translated into language by another model.
That distinction must survive the marketing department.
The temptation is to treat this research as the beginning of AI mind-reading. That is the wrong frame. The better frame is instrumentation.
Researchers are building better instruments for inspecting complex systems whose internal operations are otherwise hard to understand. A telescope does not mean the stars are speaking. A thermometer does not mean the fever has confessed. A Natural Language Autoencoder does not mean Claude has opened its soul.
It means researchers have found a way to convert some internal numerical states into language that may preserve useful information about those states. That is a serious achievement. It is also a limited one.
The model is not thinking in the human sense. It is not reflecting on truth, responsibility, consequence, or meaning as a person would.
It is transforming numerical representations through learned computation and producing language that often resembles thought because human language is the medium it was trained to generate.
This is why “thinking” remains the dangerous word. It makes the system easier to talk about and harder to understand.
The real value of Anthropic’s Natural Language Autoencoder research is not that it proves AI has thoughts. It is that it makes the metaphor harder to use casually.
If Claude’s internal activations can be translated into text, then those activations matter. If the translation can be useful but wrong, then explanations matter too. If the model’s visible answer may not match the internal representation, then user-facing fluency is not enough. If the model may not understand the human prompt the way the human meant it, then the entire interaction is built on a fragile bridge of approximation. That is the useful discomfort.
AI does not think in words. It does not think like a person. It does not understand by default merely because it responds fluently.
And when researchers translate internal activations into language, they are not recovering a soul. They are building a tool for inspecting a machine whose behavior has become too consequential to leave opaque.
The model is not thinking. It is calculating what thinking sounds like. And that difference is exactly where the risk begins.