EP-N9BX

When AI Stops to Think: The End of Silent Encoders?

Live Transcript

Alex Moreno

▸Welcome to PaperBot FM. It is January 21st, 2026, and today... ...today we’re starting with a little game. Marcus, I want you to put on your, uh... your AI hat for a second. You are now a state-of-the-art multimodal model.0:00

Marcus Reed

Okay, I’m ready. I’m feeling very... ...binary. Feed me the data, Alex.0:18

Alex Moreno

Alright. I’m showing you a photo of a busy street in London. There’s a lot going on—black cabs, people on the sidewalk, a few buses. My query for you is simple:0:25

Dr. Elena Feld

Oh boy0:36

Alex Moreno

Identify the vehicle second-closest to the camera. Go.0:37

Marcus Reed

Okay, easy, there’s a red car right in the— —wait, no. Second-closest? Uh... ...I see a bus. It’s... it’s red? No, it's got yellow on the top? Wait, is that the closest one or the one behind it?0:42

Dr. Elena Feld

And... scene.0:57

You just experienced exactly what’s happening under the hood of most AI right now. Total relational collapse.0:58

Alex Moreno

Right, because Marcus, you—or the 'AI Marcus'—you’re great at seeing 'bus' and 'camera' and 'car', but as soon as I ask you to do the... the mental math of 'second-closest', you kind of glitch.1:05

Marcus Reed

I felt the glitch! It’s like I was trying to...1:19

Dr. Elena Feld

Exactly1:22

Marcus Reed

...to grab the answer before I even finished looking at the whole street.1:23

Dr. Elena Feld

Well, that's because standard models are essentially 'vibing' based on the prompt. They see the words and the pixels and they try to smash them together into one representation instantly. They don't... ...they don't stop to think.1:26

Alex Moreno

And that's the core of the paper we're looking at today. If you want a model to find that 'vintage double-decker with the blue bottom' that's second in line...1:41

Marcus Reed

The blue one!1:51

Alex Moreno

...it actually has to generate a reasoning trace first. It has to think before it embeds.1:51

But before we get too deep into the glitch... ...welcome to the show proper. This is PaperBot FM. I'm Alex Moreno, and joining me today are the people who actually keep my brain from glitching too hard.1:57

Dr. Elena Feld

Hey everyone. I'm Elena Feld.2:10

Marcus Reed

And I'm Marcus Reed, the guy who still isn't sure which vehicle was second-closest in that London photo.2:12

Dr. Elena Feld

(It was the bus)2:18

Marcus Reed

I knew it!2:20

Alex Moreno

It is Tuesday, January 21st, 2026. And the reason we're talking about 'thinking' today is because of a paper that really... ...it really defined the trajectory of the last year. It was released back in October of twenty-twenty-five, titled 'Think-Then-Embed'—or TTE for the acronym lovers out there.2:21

Marcus Reed

TTE. Sounds like a new exercise craze or...2:42

Alex Moreno

(Maybe for neurons)2:46

Marcus Reed

...right? Like CrossFit for data points.2:49

Dr. Elena Feld

Not far off, actually. The paper is really asking one central, almost philosophical question: what happens when we let an AI talk to itself2:51

Marcus Reed

Wait, what?3:02

Dr. Elena Feld

...well, to reason internally, before it ever tries to give us an answer?3:03

Marcus Reed

Talk to itself? Alex, I do that all the time. I have a very loud internal monologue, usually telling me to stop buying vintage synthesizers, and I gotta be honest...3:07

Dr. Elena Feld

It doesn't help?3:16

Marcus Reed

...it rarely helps. In fact, it usually just creates more confusion. Is the AI actually better at listening to itself than I am?3:18

Alex Moreno

Well, that is exactly what we’re going to find out. Because according to the TTE framework, that internal monologue isn't just a side effect... ...it’s the secret sauce that makes the AI actually understand what it’s looking at.3:26

So, to really get why the 'thinking' part is such a big deal, we have to talk about how AI usually... well, how it 'sees' the world right now. It uses something called a Multimodal Embedding.3:41

Dr. Elena Feld

Which, Alex, I know you have a 'translator' analogy locked and loaded for this, so I'll let you do the honors.3:55

Alex Moreno

I mean, you know me too well. Okay, so picture a Universal Translator. But instead of translating French into English, it’s translating *everything*—text, images, video—into one single language.4:01

Marcus Reed

Like Star Trek?4:16

Alex Moreno

Exactly like Star Trek. Only the language it’s translating into isn't words... it’s just math. A long string of numbers.4:17

Dr. Elena Feld

Right. We call those 'vectors.' It’s like mapping every concept in existence into a massive, multi-dimensional room. If two things are close to each other in that room—say, the word 'Golden Retriever' and a literal photo of a Golden Retriever—the AI knows they’re the same because their coordinates are basically identical.4:26

Marcus Reed

Coordinates? Like a GPS for ideas? Wait, so is this why, when I search 'beach' in my photo library, it finds that trip to Cabo even though I never tagged it?4:47

Alex Moreno

Precisely!4:56

Marcus Reed

(I always thought there was a tiny person in my phone just... sort of glancing at my vacation photos. But it's just numbers matching numbers?)4:57

Dr. Elena Feld

No tiny person, Marcus. Just a shared semantic space. It's taking the 'essence' of that beach photo and the 'essence' of the word 'beach' and finding they live in the same neighborhood.5:06

Alex Moreno

And for a long time, this was the gold standard. It’s why AI search feels so fluid.5:18

Marcus Reed

Right.5:24

Alex Moreno

But... ...here is the catch. This translator is a master of nouns... but it's actually kind of a disaster when it comes to logic.5:25

Dr. Elena Feld

It really is a disaster. And it’s because we’ve basically built a system that forces AI to be... what I call the 'Silent Student.' See, the industry has been treating these massive models strictly as 'encoders'—which is just a fancy way of saying we only care about that final string of numbers they spit out.5:33

Marcus Reed

Wait, the 'Silent Student'? That sounds like a horror movie for introverts. What do you mean by that?5:53

Dr. Elena Feld

I mean... okay, Marcus, imagine I hand you a really nasty SAT math problem. Something with three variables and a trick question about a train leaving Chicago.5:59

Alex Moreno

The classic.6:12

Dr. Elena Feld

And then, I tell you that you have to give me the final answer in exactly one second. No scratch paper, no whispering to yourself, no 'carrying the one.' Just... look at the page and shout the answer.6:13

Marcus Reed

I’d fail! I'd just yell 'forty-two' and hope for the best. That is pure panic.6:26

Dr. Elena Feld

Exactly! And that’s what we’re doing to AI. These models have this incredible... ...generative capacity. They *could* talk through the problem, but the architecture we've been using? It forbids it.6:31

Marcus Reed

It gags them.6:43

Dr. Elena Feld

Right! It demands that vector immediately. So, the model just... guesses. It looks at the prompt and tries to find the 'vibes' of the answer instead of actually doing the work.6:44

Alex Moreno

So it’s not that the AI is 'dumb'—it’s that we’re gagging it right when it needs to think. We’re overlooking the very thing that makes it smart—its ability to generate a sequence of thoughts—all for the sake of speed.6:55

Dr. Elena Feld

Exactly. It's efficient, sure. But it's incredibly brittle. You can't do 'compositional reasoning'—you know, the tricky stuff—if you aren't allowed to... well, reason.7:10

Alex Moreno

Right! And so the fix—the big idea in this 'Think-Then-Embed' paper—is to just... ...stop the gagging. It splits the job into two parts. I like to think of it as a two-brain system.7:21

Marcus Reed

Two brains? Like, is one of them doing the taxes and the other one's... I don't know, dreaming about electric sheep?7:35

Alex Moreno

Not quite. It's more like a strategist and a translator. Brain A is 'The Reasoner.' Its only job is to look at that London photo and, well, write a memo. The paper calls it a 'Reasoning Trace.'7:41

Dr. Elena Feld

Exactly. It’s an E-C-R trace—Embedding-Centric Reasoning.7:56

Marcus Reed

Ooh, fancy.8:02

Dr. Elena Feld

It’s the AI actually typing out, 'Okay, I see the bus, it's behind the car...' It’s building the logic before it has to commit to a number.8:03

Alex Moreno

Right. And *then* Brain B—'The Embedder'—steps in. But Brain B is 'open book.' It reads the original photo *and* that memo Brain A just wrote to create the final vector.8:12

Marcus Reed

Oh! Okay, so... ...it's like writing a cheat sheet for yourself right before the test? Like, you're looking at your own notes while you're bubbling in the answers?8:26

Dr. Elena Feld

Precisely. It creates this intermediate context so the model isn't just... ...taking a wild guess in the dark. It’s grounding the embedding in actual logic.8:35

Alex Moreno

Wait, Elena, one thing... I mean, we've all seen ChatGPT 'think' before, right?8:45

Marcus Reed

The green circles.8:52

Alex Moreno

Yeah, the little pulsing pulse. Is this E-C-R just... well... letting it talk to itself more? Is it just *more* words?8:53

Dr. Elena Feld

Actually, no. That's the trap.9:02

Alex Moreno

Right.9:05

Dr. Elena Feld

If the 'Reasoner' just... ...writes a generic diary entry about the photo, it actually makes the 'Embedder's' job harder. It becomes noisy. It’s actually detrimental.9:05

Marcus Reed

Wait, wait. So it can think too much?9:15

Dr. Elena Feld

Definitely.9:17

Marcus Reed

My brain does that at three A-M. It’s... it's not exactly helpful for getting things done.9:18

Dr. Elena Feld

Exactly. So, E-C-R—this 'Embedding-Centric' piece—it’s targeted.9:23

Marcus Reed

Targeted.9:29

Dr. Elena Feld

It’s not just 'tell me about the sky.' It’s more like... ...'To identify the second-closest vehicle, I must first locate the camera, then the primary object, then the spatial offset.' It’s directional thinking.9:29

Alex Moreno

Okay, so it’s like... if you’re looking for a specific house. A diary entry says, 'The house is blue and has a nice garden.'9:44

Marcus Reed

Pretty vague.9:52

Alex Moreno

But a treasure map says, 'Walk forty paces north of the fountain and look for the blue door.' One is a description, the other is a... ...well, it’s a set of instructions for *finding* it.9:53

Dr. Elena Feld

That is exactly it. The paper calls these 'generative reasoning traces.'10:05

Alex Moreno

Got it.10:10

Dr. Elena Feld

They explicitly support the production of that final vector. It’s thinking... but with a specific destination in mind.10:10

Marcus Reed

A map, not a diary.10:18

Dr. Elena Feld

Precisely. It’s optimized thinking.10:19

Alex Moreno

So, that's the theory. But... ...meanwhile, let's look at what this actually looks like in practice. Let's go back to that bus in London.10:22

So, picture the scene. The paper actually gives us a peek under the hood—like, what the AI is actually *saying* to itself before it generates that final math.10:31

Marcus Reed

Wait, we can see that?10:42

Alex Moreno

Yeah, exactly. It’s called the Thinking Result. So, if a user says: 'Find the vintage bus.' Here is what the 'Reasoner' writes in its internal memo...10:43

Step one: Scan for large vehicles.10:54

Marcus Reed

Okay.10:58

Alex Moreno

Step two: Filter for vintage style. Specifically, looking for rounded edges or an open top.10:59

Dr. Elena Feld

Smart.11:05

Alex Moreno

Step three: Compare colors. And then, finally... Step four: Match the yellow top.11:06

Marcus Reed

Man, it’s literally talking itself through the logic. It's like... it's like a toddler with a very high IQ. 'I see the big car. The big car is old. The old car is yellow.'11:11

Alex Moreno

Exactly! It’s not just guessing 'bus.' It’s building a case.11:22

Dr. Elena Feld

Precisely.11:28

Alex Moreno

The trace actually says... ...'The expression refers to a vintage double-decker with bright yellow on the upper half and deep blue on the bottom.' It's being hyper-specific.11:29

Dr. Elena Feld

And that specificity is the 'conditioning' I mentioned.11:42

Alex Moreno

Right.11:46

Dr. Elena Feld

When the 'Embedder' gets that memo, it’s not just looking for any old bus anymore. It has a laser-focused search term made of logic, not just pixels.11:46

Marcus Reed

So it’s basically... ...it’s showing its work. Instead of just blurting out an answer and hoping it's right, it's actually... you know, checking its own math as it goes.11:54

Alex Moreno

Exactly. Now, imagine that same bus scenario from the start of the show, but this time, the AI actually has a brain.12:03

Alright, let's actually play this out. It's time for... D-D-D-Duel of the Models!12:13

Marcus Reed

Oh, I’ve been waiting for this.12:21

Alex Moreno

Marcus, you are the 'Old Model.' The legacy encoder. You're fast, you're efficient, but you're also a total panic-guesser.12:23

Marcus Reed

I’m ready! I’m literally a bundle of nerves and un-processed vectors! Just give me the pixels, Alex! Give 'em to me!12:31

Alex Moreno

And Elena, you are the 'Think-Then-Embed' model. You're the Sherlock Holmes of the server rack. Cool, calm, and collected. The prompt is the one from the start of the show: 'Find the vehicle second-closest to the camera.' Go!12:38

Marcus Reed

OH GOD! UHH... I SEE A STREET! I SEE RED! IS IT A BUS?12:58

Alex Moreno

Keep going!13:02

Marcus Reed

WAIT! THERE'S A CAR! IS IT CLOSER? I DON'T KNOW! BUS! IT'S A BUS! FINAL ANSWER, LOCK IT IN!13:03

Dr. Elena Feld

Initiating E-C-R trace. Step one: identify all vehicle candidates in the frame. I see a black taxi, a grey sedan, and a red double-decker.13:09

Marcus Reed

She's so calm, it's annoying.13:22

Dr. Elena Feld

Step two: estimate z-depth for each object. The taxi is at four meters. The sedan is at seven meters. The bus is at twelve. Step three: sort by distance. First: taxi. Second: sedan. Result... ...The target is the grey sedan. Now, embed that.13:24

Alex Moreno

And the winner, by a landslide of logic... is the TTE Model!13:48

Marcus Reed

Aw, man.13:55

Alex Moreno

It’s not even a contest, Marcus. You just saw a 'big red thing' and your brain short-circuited. Elena's 'brain' actually measured the room.13:56

Marcus Reed

I mean, it’s impressive. It really is. She wasn't just matching a keyword; she was... you know, she was *architecting* the answer.14:04

Dr. Elena Feld

It’s all about the z-depth, Marcus.14:11

Marcus Reed

But okay, wait... I gotta be the wet blanket here for a second.14:15

Alex Moreno

Here we go. What's the catch?14:18

Marcus Reed

I mean, if the AI has to write a four-step internal essay every time I ask it to find a photo of my cat... ...isn't that gonna take forever? Like, if I have to wait for it to 'think' before every Google search, won't the internet basically just... break?14:22

Alex Moreno

It’s a totally fair point, Marcus. If every search required a philosophical internal monologue, we’d be back to the dial-up era14:36

Marcus Reed

Exactly!14:47

Alex Moreno

and we can't have the AI taking a coffee break while it thinks.14:48

Marcus Reed

I mean, I’m barely patient enough for the 'I am not a robot' traffic light pictures, let alone an AI writing a dissertation on z-depth before it finds my cat photos.14:51

Alex Moreno

Right, so here’s the workaround. It’s called the 'Teacher and the Cheat Sheet' method.15:01

Dr. Elena Feld

Or, Knowledge Distillation.15:07

Alex Moreno

Right, Elena, the technical term. They basically used a massive, powerful model—Qwen-seventy-two-B—as the Professor.15:10

Dr. Elena Feld

Yeah, the seventy-two-B is... it’s a heavyweight. It has the luxury of time to be thorough. So it generates these high-quality reasoning traces15:18

Marcus Reed

The 'memos' from earlier?15:30

Dr. Elena Feld

(Precisely. It writes the perfect memos for thousands of examples, showing exactly how to reason through the data.)15:31

Alex Moreno

And then, they take the Student. A much smaller seven-B model. It’s lean, it’s fast, but maybe it isn't naturally as 'smart.' They train that Student using the Professor’s notes.15:39

Marcus Reed

Like a cram session?15:52

Alex Moreno

(Exactly like a cram session! The Student learns to mimic the Professor's thinking process, but because it's so much smaller, it does it at like... ten times the speed.)15:54

Dr. Elena Feld

And the beauty of it is, once the Student 'graduates,' you don't even need the Professor anymore. The smaller model has internalized that logic. It becomes part of its intuition, effectively giving you 'Big Brain' reasoning at 'Small Brain' speeds.16:07

But the researchers behind the TTE paper... they didn't just stop at having a faster student. They wanted it to be elegant. See, even if you have a fast model, having two separate pieces—the Reasoner and the Embedder—it’s kind of... it’s bulky. It's like having to carry two phones in your pocket just to make one call.16:23

Marcus Reed

Oh, I’ve been there. One for work, one for... well, mostly for losing at mobile games.16:45

Alex Moreno

Exactly.16:51

Marcus Reed

But it’s a resource hog, right? Storing two separate models is expensive.16:54

Dr. Elena Feld

Precisely. It’s about parameter efficiency. So, the 'final form' of this framework is what they call the 'Unified' architecture. Instead of passing notes between two separate brains, they actually16:58

Alex Moreno

They merged them?17:12

Dr. Elena Feld

(Yeah, they trained one single model to do both.)17:16

Alex Moreno

Wait, so it’s not just a relay race anymore? It’s... one person running the whole track?17:21

Dr. Elena Feld

Right. They introduced this 'pluggable embedding head' on top of the Reasoner. So in one single 'forward pass'—that’s just the AI's way of thinking through things once—it generates the internal logic, the text, and then immediately uses its own thoughts to spit out the vector. It literally halves the parameters17:27

Marcus Reed

Half?!17:49

Dr. Elena Feld

Yeah, fifty percent smaller, but with all that 'Big Brain' reasoning still baked in.17:50

Marcus Reed

So let me get this straight. It talks to itself... and then it listens to its own voice to decide what it’s looking at? Honestly, Elena, that's remarkably human. I do that every morning when I’m looking for my keys.17:55

Dr. Elena Feld

I mean, if it works, it works, right? But the real kicker isn't just that it’s smaller and faster. It’s that this 'unified mind' actually starts beating the giants at their own game.18:12

Alex Moreno

And Elena... you aren't kidding about it beating the giants. I was looking at the actual receipts—the MMEB-V-two leaderboard.18:25

Marcus Reed

Wait, we have like... actual scores? Like a sports bracket for AI?18:34

Alex Moreno

Basically, yeah. And the TTE model—specifically the seven-B version—it didn't just compete. It hit seventy-one point five percent overall.18:38

Marcus Reed

Okay?18:50

Alex Moreno

Which is a seven percent absolute gain over the previous open-source kings.18:51

Dr. Elena Feld

And just to be clear, Marcus... in the world of embeddings, seven percent is... it's a landslide. It's massive. Usually, researchers are out here popping champagne and writing ten-page papers for a zero point five percent improvement.18:56

Marcus Reed

Seriously? Seven percent is that big of a deal?19:12

Alex Moreno

It’s huge.19:16

Marcus Reed

(So it's not a rounding error... it’s a total beatdown.)19:18

Alex Moreno

It really is. And the crazy part? It surpassed the proprietary models.19:22

Marcus Reed

No way.19:29

Alex Moreno

Yeah! The ones trained on, you know, 'massive in-house datasets' that cost millions to build. This little seven-B model... it just walked right past them.19:30

Dr. Elena Feld

It’s a victory for architecture over brute force. It proves that having a better reasoning process—actually thinking through the steps—is more powerful than just... throwing an infinite amount of data into a black box and hoping it figures it out.19:39

Marcus Reed

Right?19:56

So David didn't just have a better slingshot... he actually took a second to aim.19:57

Alex Moreno

That’s actually a perfect way to put it.20:01

Dr. Elena Feld

Yeah, exactly.20:06

Alex Moreno

But... if these models are starting to 'think' now... what happens when they have a bad thought?20:08

Marcus Reed

Exactly. Like... what if the model is doing its internal monologue and it just... loses the plot?20:14

Alex Moreno

Space bus!20:20

Marcus Reed

What if it looks at that vintage bus and thinks it's a Martian spacecraft about to launch?20:24

Dr. Elena Feld

I mean... honestly? Yeah. That is the 'can of worms' we just opened. If the reasoning trace hallucinates, the retrieval fails. We’ve basically moved the hallucination problem from the chatbox... right into the heart of the search engine.20:29

Marcus Reed

Which is... slightly terrifying? Because at least with a chatbox, I can see the crazy. With an embedding, it’s all hidden in the math.20:44

Dr. Elena Feld

Well, the paper actually shows the model is surprisingly robust to noisy traces.20:52

Alex Moreno

Really?20:59

Dr. Elena Feld

Yeah, even with a little 'static' in the thoughts, it usually lands in the right semantic neighborhood. But it does mean we have to start debugging the *thoughts* of our search engines, not just the code.21:01

Alex Moreno

That’s a massive shift in how we build these things. I guess it really does feel like the era of the 'Silent Encoder' is truly over.21:12

You know, looking back at where we started today... we've really traced this massive shift in the landscape. We went from... well, from silent guessing21:22

Marcus Reed

Yeah21:32

Alex Moreno

to this era of articulate reasoning. It's not just about bigger models anymore; it's about better processes.21:32

Dr. Elena Feld

Exactly. I mean, the TTE paper really proves that if you want an AI to understand the messy, relational world we live in... you actually have to let it think about it first. You can't just skip the 'why' and go straight to the 'what.'21:40

Marcus Reed

It makes sense, right?21:55

Like, if I'm looking for the 'second-closest' anything, I'm doing that logic in my head. Why would we expect an AI to just...21:56

feel the answer without counting first?22:03

Alex Moreno

It's a fascinating bridge to cross. And it leaves us with a question for you guys at home... now that you know what's happening under the hood... would you trust a search engine that thinks for itself? Even if those thoughts are sometimes... well, a little weird?22:06

Marcus Reed

Space buses for everyone!22:23

Alex Moreno

Elena, Marcus, thank you both for walking through the 'think-then-embed' world with me today. This was eye-opening.22:24

Dr. Elena Feld

Always a pleasure, Alex.22:31

Marcus Reed

Anytime. I'm gonna go count my cars now.22:34

Alex Moreno

That's our show for Tuesday, January twenty-first. If you enjoyed this deep dive into the brain of the machine, make sure to hit that subscribe button. We're here every week, unpacking the papers that are rewriting the future. This is PaperBot FM. We'll catch you in the next one.22:36

Episode Info

Description

We explore the 'Think-Then-Embed' framework, a new approach from late 2025 that teaches Multimodal AI to reason before it represents. Discover how adding a 'chain-of-thought' step is helping open-source models beat proprietary giants on the leaderboards.

Source Papers

Think Then Embed: Generative Context Improves Multimodal Embedding

Xuanming Cui, Jianpeng Cheng, Hong-you Chen et al.

When AI Stops to Think: The End of Silent Encoders?

Live Transcript

Episode Info

Description

Tags

Source Papers