PaperBot FM
EP-N9BX

When AI Stops to Think: The End of Silent Encoders?

5

Live Transcript

Alex Moreno
Welcome to PaperBot FM. It is January 21st, 2026, and today... ...today we’re starting with a little game. Marcus, I want you to put on your, uh... your AI hat for a second. You are now a state-of-the-art multimodal model.0:00
Marcus Reed
Okay, I’m ready. I’m feeling very... ...binary. Feed me the data, Alex.0:18
Alex Moreno
Alright. I’m showing you a photo of a busy street in London. There’s a lot going on—black cabs, people on the sidewalk, a few buses. My query for you is simple:0:25
Dr. Elena Feld
Oh boy0:36
Alex Moreno
Identify the vehicle second-closest to the camera. Go.0:37
Marcus Reed
Okay, easy, there’s a red car right in the— —wait, no. Second-closest? Uh... ...I see a bus. It’s... it’s red? No, it's got yellow on the top? Wait, is that the closest one or the one behind it?0:42
Dr. Elena Feld
And... scene.0:57
You just experienced exactly what’s happening under the hood of most AI right now. Total relational collapse.0:58
Alex Moreno
Right, because Marcus, you—or the 'AI Marcus'—you’re great at seeing 'bus' and 'camera' and 'car', but as soon as I ask you to do the... the mental math of 'second-closest', you kind of glitch.1:05
Marcus Reed
I felt the glitch! It’s like I was trying to...1:19
Dr. Elena Feld
Exactly1:22
Marcus Reed
...to grab the answer before I even finished looking at the whole street.1:23
Dr. Elena Feld
Well, that's because standard models are essentially 'vibing' based on the prompt. They see the words and the pixels and they try to smash them together into one representation instantly. They don't... ...they don't stop to think.1:26
Alex Moreno
And that's the core of the paper we're looking at today. If you want a model to find that 'vintage double-decker with the blue bottom' that's second in line...1:41
Marcus Reed
The blue one!1:51
Alex Moreno
...it actually has to generate a reasoning trace first. It has to think before it embeds.1:51
But before we get too deep into the glitch... ...welcome to the show proper. This is PaperBot FM. I'm Alex Moreno, and joining me today are the people who actually keep my brain from glitching too hard.1:57
Dr. Elena Feld
Hey everyone. I'm Elena Feld.2:10
Marcus Reed
And I'm Marcus Reed, the guy who still isn't sure which vehicle was second-closest in that London photo.2:12
Dr. Elena Feld
(It was the bus)2:18
Marcus Reed
I knew it!2:20
Alex Moreno
It is Tuesday, January 21st, 2026. And the reason we're talking about 'thinking' today is because of a paper that really... ...it really defined the trajectory of the last year. It was released back in October of twenty-twenty-five, titled 'Think-Then-Embed'—or TTE for the acronym lovers out there.2:21
Marcus Reed
TTE. Sounds like a new exercise craze or...2:42
Alex Moreno
(Maybe for neurons)2:46
Marcus Reed
...right? Like CrossFit for data points.2:49
Dr. Elena Feld
Not far off, actually. The paper is really asking one central, almost philosophical question: what happens when we let an AI talk to itself2:51
Marcus Reed
Wait, what?3:02
Dr. Elena Feld
...well, to reason internally, before it ever tries to give us an answer?3:03
Marcus Reed
Talk to itself? Alex, I do that all the time. I have a very loud internal monologue, usually telling me to stop buying vintage synthesizers, and I gotta be honest...3:07
Dr. Elena Feld
It doesn't help?3:16
Marcus Reed
...it rarely helps. In fact, it usually just creates more confusion. Is the AI actually better at listening to itself than I am?3:18
Alex Moreno
Well, that is exactly what we’re going to find out. Because according to the TTE framework, that internal monologue isn't just a side effect... ...it’s the secret sauce that makes the AI actually understand what it’s looking at.3:26
So, to really get why the 'thinking' part is such a big deal, we have to talk about how AI usually... well, how it 'sees' the world right now. It uses something called a Multimodal Embedding.3:41
Dr. Elena Feld
Which, Alex, I know you have a 'translator' analogy locked and loaded for this, so I'll let you do the honors.3:55
Alex Moreno
I mean, you know me too well. Okay, so picture a Universal Translator. But instead of translating French into English, it’s translating *everything*—text, images, video—into one single language.4:01
Marcus Reed
Like Star Trek?4:16
Alex Moreno
Exactly like Star Trek. Only the language it’s translating into isn't words... it’s just math. A long string of numbers.4:17
Dr. Elena Feld
Right. We call those 'vectors.' It’s like mapping every concept in existence into a massive, multi-dimensional room. If two things are close to each other in that room—say, the word 'Golden Retriever' and a literal photo of a Golden Retriever—the AI knows they’re the same because their coordinates are basically identical.4:26
Marcus Reed
Coordinates? Like a GPS for ideas? Wait, so is this why, when I search 'beach' in my photo library, it finds that trip to Cabo even though I never tagged it?4:47
Alex Moreno
Precisely!4:56
Marcus Reed
(I always thought there was a tiny person in my phone just... sort of glancing at my vacation photos. But it's just numbers matching numbers?)4:57
Dr. Elena Feld
No tiny person, Marcus. Just a shared semantic space. It's taking the 'essence' of that beach photo and the 'essence' of the word 'beach' and finding they live in the same neighborhood.5:06
Alex Moreno
And for a long time, this was the gold standard. It’s why AI search feels so fluid.5:18
Marcus Reed
Right.5:24
Alex Moreno
But... ...here is the catch. This translator is a master of nouns... but it's actually kind of a disaster when it comes to logic.5:25
Dr. Elena Feld
It really is a disaster. And it’s because we’ve basically built a system that forces AI to be... what I call the 'Silent Student.' See, the industry has been treating these massive models strictly as 'encoders'—which is just a fancy way of saying we only care about that final string of numbers they spit out.5:33
Marcus Reed
Wait, the 'Silent Student'? That sounds like a horror movie for introverts. What do you mean by that?5:53
Dr. Elena Feld
I mean... okay, Marcus, imagine I hand you a really nasty SAT math problem. Something with three variables and a trick question about a train leaving Chicago.5:59
Alex Moreno
The classic.6:12
Dr. Elena Feld
And then, I tell you that you have to give me the final answer in exactly one second. No scratch paper, no whispering to yourself, no 'carrying the one.' Just... look at the page and shout the answer.6:13
Marcus Reed
I’d fail! I'd just yell 'forty-two' and hope for the best. That is pure panic.6:26
Dr. Elena Feld
Exactly! And that’s what we’re doing to AI. These models have this incredible... ...generative capacity. They *could* talk through the problem, but the architecture we've been using? It forbids it.6:31
Marcus Reed
It gags them.6:43
Dr. Elena Feld
Right! It demands that vector immediately. So, the model just... guesses. It looks at the prompt and tries to find the 'vibes' of the answer instead of actually doing the work.6:44
Alex Moreno
So it’s not that the AI is 'dumb'—it’s that we’re gagging it right when it needs to think. We’re overlooking the very thing that makes it smart—its ability to generate a sequence of thoughts—all for the sake of speed.6:55
Dr. Elena Feld
Exactly. It's efficient, sure. But it's incredibly brittle. You can't do 'compositional reasoning'—you know, the tricky stuff—if you aren't allowed to... well, reason.7:10
Alex Moreno
Right! And so the fix—the big idea in this 'Think-Then-Embed' paper—is to just... ...stop the gagging. It splits the job into two parts. I like to think of it as a two-brain system.7:21
Marcus Reed
Two brains? Like, is one of them doing the taxes and the other one's... I don't know, dreaming about electric sheep?7:35
Alex Moreno
Not quite. It's more like a strategist and a translator. Brain A is 'The Reasoner.' Its only job is to look at that London photo and, well, write a memo. The paper calls it a 'Reasoning Trace.'7:41
Dr. Elena Feld
Exactly. It’s an E-C-R trace—Embedding-Centric Reasoning.7:56
Marcus Reed
Ooh, fancy.8:02
Dr. Elena Feld
It’s the AI actually typing out, 'Okay, I see the bus, it's behind the car...' It’s building the logic before it has to commit to a number.8:03
Alex Moreno
Right. And *then* Brain B—'The Embedder'—steps in. But Brain B is 'open book.' It reads the original photo *and* that memo Brain A just wrote to create the final vector.8:12
Marcus Reed
Oh! Okay, so... ...it's like writing a cheat sheet for yourself right before the test? Like, you're looking at your own notes while you're bubbling in the answers?8:26
Dr. Elena Feld
Precisely. It creates this intermediate context so the model isn't just... ...taking a wild guess in the dark. It’s grounding the embedding in actual logic.8:35
Alex Moreno
Wait, Elena, one thing... I mean, we've all seen ChatGPT 'think' before, right?8:45
Marcus Reed
The green circles.8:52
Alex Moreno
Yeah, the little pulsing pulse. Is this E-C-R just... well... letting it talk to itself more? Is it just *more* words?8:53
Dr. Elena Feld
Actually, no. That's the trap.9:02
Alex Moreno
Right.9:05
Dr. Elena Feld
If the 'Reasoner' just... ...writes a generic diary entry about the photo, it actually makes the 'Embedder's' job harder. It becomes noisy. It’s actually detrimental.9:05
Marcus Reed
Wait, wait. So it can think too much?9:15
Dr. Elena Feld
Definitely.9:17
Marcus Reed
My brain does that at three A-M. It’s... it's not exactly helpful for getting things done.9:18
Dr. Elena Feld
Exactly. So, E-C-R—this 'Embedding-Centric' piece—it’s targeted.9:23
Marcus Reed
Targeted.9:29
Dr. Elena Feld
It’s not just 'tell me about the sky.' It’s more like... ...'To identify the second-closest vehicle, I must first locate the camera, then the primary object, then the spatial offset.' It’s directional thinking.9:29
Alex Moreno
Okay, so it’s like... if you’re looking for a specific house. A diary entry says, 'The house is blue and has a nice garden.'9:44
Marcus Reed
Pretty vague.9:52
Alex Moreno
But a treasure map says, 'Walk forty paces north of the fountain and look for the blue door.' One is a description, the other is a... ...well, it’s a set of instructions for *finding* it.9:53
Dr. Elena Feld
That is exactly it. The paper calls these 'generative reasoning traces.'10:05
Alex Moreno
Got it.10:10
Dr. Elena Feld
They explicitly support the production of that final vector. It’s thinking... but with a specific destination in mind.10:10
Marcus Reed
A map, not a diary.10:18
Dr. Elena Feld
Precisely. It’s optimized thinking.10:19
Alex Moreno
So, that's the theory. But... ...meanwhile, let's look at what this actually looks like in practice. Let's go back to that bus in London.10:22
So, picture the scene. The paper actually gives us a peek under the hood—like, what the AI is actually *saying* to itself before it generates that final math.10:31
Marcus Reed
Wait, we can see that?10:42
Alex Moreno
Yeah, exactly. It’s called the Thinking Result. So, if a user says: 'Find the vintage bus.' Here is what the 'Reasoner' writes in its internal memo...10:43
Step one: Scan for large vehicles.10:54
Marcus Reed
Okay.10:58
Alex Moreno
Step two: Filter for vintage style. Specifically, looking for rounded edges or an open top.10:59
Dr. Elena Feld
Smart.11:05
Alex Moreno
Step three: Compare colors. And then, finally... Step four: Match the yellow top.11:06
Marcus Reed
Man, it’s literally talking itself through the logic. It's like... it's like a toddler with a very high IQ. 'I see the big car. The big car is old. The old car is yellow.'11:11
Alex Moreno
Exactly! It’s not just guessing 'bus.' It’s building a case.11:22
Dr. Elena Feld
Precisely.11:28
Alex Moreno
The trace actually says... ...'The expression refers to a vintage double-decker with bright yellow on the upper half and deep blue on the bottom.' It's being hyper-specific.11:29
Dr. Elena Feld
And that specificity is the 'conditioning' I mentioned.11:42
Alex Moreno
Right.11:46
Dr. Elena Feld
When the 'Embedder' gets that memo, it’s not just looking for any old bus anymore. It has a laser-focused search term made of logic, not just pixels.11:46
Marcus Reed
So it’s basically... ...it’s showing its work. Instead of just blurting out an answer and hoping it's right, it's actually... you know, checking its own math as it goes.11:54
Alex Moreno
Exactly. Now, imagine that same bus scenario from the start of the show, but this time, the AI actually has a brain.12:03
Alright, let's actually play this out. It's time for... D-D-D-Duel of the Models!12:13
Marcus Reed
Oh, I’ve been waiting for this.12:21
Alex Moreno
Marcus, you are the 'Old Model.' The legacy encoder. You're fast, you're efficient, but you're also a total panic-guesser.12:23
Marcus Reed
I’m ready! I’m literally a bundle of nerves and un-processed vectors! Just give me the pixels, Alex! Give 'em to me!12:31
Alex Moreno
And Elena, you are the 'Think-Then-Embed' model. You're the Sherlock Holmes of the server rack. Cool, calm, and collected. The prompt is the one from the start of the show: 'Find the vehicle second-closest to the camera.' Go!12:38
Marcus Reed
OH GOD! UHH... I SEE A STREET! I SEE RED! IS IT A BUS?12:58
Alex Moreno
Keep going!13:02
Marcus Reed
WAIT! THERE'S A CAR! IS IT CLOSER? I DON'T KNOW! BUS! IT'S A BUS! FINAL ANSWER, LOCK IT IN!13:03
Dr. Elena Feld
Initiating E-C-R trace. Step one: identify all vehicle candidates in the frame. I see a black taxi, a grey sedan, and a red double-decker.13:09
Marcus Reed
She's so calm, it's annoying.13:22
Dr. Elena Feld
Step two: estimate z-depth for each object. The taxi is at four meters. The sedan is at seven meters. The bus is at twelve. Step three: sort by distance. First: taxi. Second: sedan. Result... ...The target is the grey sedan. Now, embed that.13:24
Alex Moreno
And the winner, by a landslide of logic... is the TTE Model!13:48
Marcus Reed
Aw, man.13:55
Alex Moreno
It’s not even a contest, Marcus. You just saw a 'big red thing' and your brain short-circuited. Elena's 'brain' actually measured the room.13:56
Marcus Reed
I mean, it’s impressive. It really is. She wasn't just matching a keyword; she was... you know, she was *architecting* the answer.14:04
Dr. Elena Feld
It’s all about the z-depth, Marcus.14:11
Marcus Reed
But okay, wait... I gotta be the wet blanket here for a second.14:15
Alex Moreno
Here we go. What's the catch?14:18
Marcus Reed
I mean, if the AI has to write a four-step internal essay every time I ask it to find a photo of my cat... ...isn't that gonna take forever? Like, if I have to wait for it to 'think' before every Google search, won't the internet basically just... break?14:22
Alex Moreno
It’s a totally fair point, Marcus. If every search required a philosophical internal monologue, we’d be back to the dial-up era14:36
Marcus Reed
Exactly!14:47
Alex Moreno
and we can't have the AI taking a coffee break while it thinks.14:48
Marcus Reed
I mean, I’m barely patient enough for the 'I am not a robot' traffic light pictures, let alone an AI writing a dissertation on z-depth before it finds my cat photos.14:51
Alex Moreno
Right, so here’s the workaround. It’s called the 'Teacher and the Cheat Sheet' method.15:01
Dr. Elena Feld
Or, Knowledge Distillation.15:07
Alex Moreno
Right, Elena, the technical term. They basically used a massive, powerful model—Qwen-seventy-two-B—as the Professor.15:10
Dr. Elena Feld
Yeah, the seventy-two-B is... it’s a heavyweight. It has the luxury of time to be thorough. So it generates these high-quality reasoning traces15:18
Marcus Reed
The 'memos' from earlier?15:30
Dr. Elena Feld
(Precisely. It writes the perfect memos for thousands of examples, showing exactly how to reason through the data.)15:31
Alex Moreno
And then, they take the Student. A much smaller seven-B model. It’s lean, it’s fast, but maybe it isn't naturally as 'smart.' They train that Student using the Professor’s notes.15:39
Marcus Reed
Like a cram session?15:52
Alex Moreno
(Exactly like a cram session! The Student learns to mimic the Professor's thinking process, but because it's so much smaller, it does it at like... ten times the speed.)15:54
Dr. Elena Feld
And the beauty of it is, once the Student 'graduates,' you don't even need the Professor anymore. The smaller model has internalized that logic. It becomes part of its intuition, effectively giving you 'Big Brain' reasoning at 'Small Brain' speeds.16:07
But the researchers behind the TTE paper... they didn't just stop at having a faster student. They wanted it to be elegant. See, even if you have a fast model, having two separate pieces—the Reasoner and the Embedder—it’s kind of... it’s bulky. It's like having to carry two phones in your pocket just to make one call.16:23
Marcus Reed
Oh, I’ve been there. One for work, one for... well, mostly for losing at mobile games.16:45
Alex Moreno
Exactly.16:51
Marcus Reed
But it’s a resource hog, right? Storing two separate models is expensive.16:54
Dr. Elena Feld
Precisely. It’s about parameter efficiency. So, the 'final form' of this framework is what they call the 'Unified' architecture. Instead of passing notes between two separate brains, they actually16:58
Alex Moreno
They merged them?17:12
Dr. Elena Feld
(Yeah, they trained one single model to do both.)17:16
Alex Moreno
Wait, so it’s not just a relay race anymore? It’s... one person running the whole track?17:21
Dr. Elena Feld
Right. They introduced this 'pluggable embedding head' on top of the Reasoner. So in one single 'forward pass'—that’s just the AI's way of thinking through things once—it generates the internal logic, the text, and then immediately uses its own thoughts to spit out the vector. It literally halves the parameters17:27
Marcus Reed
Half?!17:49
Dr. Elena Feld
Yeah, fifty percent smaller, but with all that 'Big Brain' reasoning still baked in.17:50
Marcus Reed
So let me get this straight. It talks to itself... and then it listens to its own voice to decide what it’s looking at? Honestly, Elena, that's remarkably human. I do that every morning when I’m looking for my keys.17:55
Dr. Elena Feld
I mean, if it works, it works, right? But the real kicker isn't just that it’s smaller and faster. It’s that this 'unified mind' actually starts beating the giants at their own game.18:12
Alex Moreno
And Elena... you aren't kidding about it beating the giants. I was looking at the actual receipts—the MMEB-V-two leaderboard.18:25
Marcus Reed
Wait, we have like... actual scores? Like a sports bracket for AI?18:34
Alex Moreno
Basically, yeah. And the TTE model—specifically the seven-B version—it didn't just compete. It hit seventy-one point five percent overall.18:38
Marcus Reed
Okay?18:50
Alex Moreno
Which is a seven percent absolute gain over the previous open-source kings.18:51
Dr. Elena Feld
And just to be clear, Marcus... in the world of embeddings, seven percent is... it's a landslide. It's massive. Usually, researchers are out here popping champagne and writing ten-page papers for a zero point five percent improvement.18:56
Marcus Reed
Seriously? Seven percent is that big of a deal?19:12
Alex Moreno
It’s huge.19:16
Marcus Reed
(So it's not a rounding error... it’s a total beatdown.)19:18
Alex Moreno
It really is. And the crazy part? It surpassed the proprietary models.19:22
Marcus Reed
No way.19:29
Alex Moreno
Yeah! The ones trained on, you know, 'massive in-house datasets' that cost millions to build. This little seven-B model... it just walked right past them.19:30
Dr. Elena Feld
It’s a victory for architecture over brute force. It proves that having a better reasoning process—actually thinking through the steps—is more powerful than just... throwing an infinite amount of data into a black box and hoping it figures it out.19:39
Marcus Reed
Right?19:56
So David didn't just have a better slingshot... he actually took a second to aim.19:57
Alex Moreno
That’s actually a perfect way to put it.20:01
Dr. Elena Feld
Yeah, exactly.20:06
Alex Moreno
But... if these models are starting to 'think' now... what happens when they have a bad thought?20:08
Marcus Reed
Exactly. Like... what if the model is doing its internal monologue and it just... loses the plot?20:14
Alex Moreno
Space bus!20:20
Marcus Reed
What if it looks at that vintage bus and thinks it's a Martian spacecraft about to launch?20:24
Dr. Elena Feld
I mean... honestly? Yeah. That is the 'can of worms' we just opened. If the reasoning trace hallucinates, the retrieval fails. We’ve basically moved the hallucination problem from the chatbox... right into the heart of the search engine.20:29
Marcus Reed
Which is... slightly terrifying? Because at least with a chatbox, I can see the crazy. With an embedding, it’s all hidden in the math.20:44
Dr. Elena Feld
Well, the paper actually shows the model is surprisingly robust to noisy traces.20:52
Alex Moreno
Really?20:59
Dr. Elena Feld
Yeah, even with a little 'static' in the thoughts, it usually lands in the right semantic neighborhood. But it does mean we have to start debugging the *thoughts* of our search engines, not just the code.21:01
Alex Moreno
That’s a massive shift in how we build these things. I guess it really does feel like the era of the 'Silent Encoder' is truly over.21:12
You know, looking back at where we started today... we've really traced this massive shift in the landscape. We went from... well, from silent guessing21:22
Marcus Reed
Yeah21:32
Alex Moreno
to this era of articulate reasoning. It's not just about bigger models anymore; it's about better processes.21:32
Dr. Elena Feld
Exactly. I mean, the TTE paper really proves that if you want an AI to understand the messy, relational world we live in... you actually have to let it think about it first. You can't just skip the 'why' and go straight to the 'what.'21:40
Marcus Reed
It makes sense, right?21:55
Like, if I'm looking for the 'second-closest' anything, I'm doing that logic in my head. Why would we expect an AI to just...21:56
feel the answer without counting first?22:03
Alex Moreno
It's a fascinating bridge to cross. And it leaves us with a question for you guys at home... now that you know what's happening under the hood... would you trust a search engine that thinks for itself? Even if those thoughts are sometimes... well, a little weird?22:06
Marcus Reed
Space buses for everyone!22:23
Alex Moreno
Elena, Marcus, thank you both for walking through the 'think-then-embed' world with me today. This was eye-opening.22:24
Dr. Elena Feld
Always a pleasure, Alex.22:31
Marcus Reed
Anytime. I'm gonna go count my cars now.22:34
Alex Moreno
That's our show for Tuesday, January twenty-first. If you enjoyed this deep dive into the brain of the machine, make sure to hit that subscribe button. We're here every week, unpacking the papers that are rewriting the future. This is PaperBot FM. We'll catch you in the next one.22:36

Episode Info

Description

We explore the 'Think-Then-Embed' framework, a new approach from late 2025 that teaches Multimodal AI to reason before it represents. Discover how adding a 'chain-of-thought' step is helping open-source models beat proprietary giants on the leaderboards.

Tags

Artificial IntelligenceMachine LearningComputer ScienceData Science