Gemma 4: One Flight, Seven Models, and Only One Got It Right

Written by Carl Pfranger
Updated April 10, 2026

Uncategorized

I wasn’t trying to run an AI benchmark at 36,000 feet. I was just sitting on a plane next to my wife, looking at the seat-back flight display, and I noticed something. The screen had a bunch of data on it — time to destination, distance, local times, altitude, wind speed, ground speed — and I realized it actually made me read it twice before I was confident about when we were going to land. That felt like a good test.

My wife caught me contorting my phone toward the seat-back screen trying to get a clean shot with no glare. She looked confused for a moment, then something clicked and she smiled. I’m pretty sure she was just glad I’d found something to occupy myself with rather than rambling about open weight models for then next hour…

I’ve been keeping a close eye on the open weight model space for a while now. Not because I’m a fanboy for any particular company, but because I help businesses use technology in ways that actually make sense for them, and efficiency matters. Smaller models that are genuinely capable cost less to run, use less energy, and are easier to deploy. So when Gemma 4 came out with some impressive looking benchmarks, I wanted to see what it could actually do on something real, not a benchmark, just a messy real world problem that every single person on that airplane had to look at if they wanted to know when we were landing.

I was deliberate about taking a single photo because I didn’t want any variables messing up the comparison between models. Same photo, every test. I ran the initial tests on my phone during the flight, but when I got back to my office I re-ran everything through LM Studio on my desktop. It gave me clean chain of thought logs and consistent timing that were a lot easier to review. The results matched what I’d seen on the plane, but now I had the logs.

Why This Screen Is Trickier Than It Looks

Here’s the thing I want to be upfront about before getting into results. In my work building AI-powered systems for clients, I almost always push for cleaning and structuring data before it ever reaches the model. Give the AI only what it needs, in a format that’s unambiguous. That reduces errors and keeps responses consistent.

This screen does the opposite of that. It has a field that reads “Local Time at Destination: 2:52 PM” which looks exactly like the answer to the question — what time will it be in Atlanta when I land? Except it isn’t the answer. That field shows the current time in Atlanta right now. To get the actual landing time you have to notice the “Time to Destination: 1:09” field, understand what it means in relation to the destination time, and add them together. The answer is 4:01 PM.

“A model that only reasons correctly when everything is pre-digested for it has a ceiling. I wanted to find that ceiling.”

I would never set up a production AI agent to work against data like this without cleaning it first. But I thought it was worth testing anyway because real world data is messy. I wanted to find that ceiling.

On the Plane — Initial Mobile Tests

I had the two smaller Gemma 4 models running on my phone during the flight.

❌ Gemma 4 E2B

Just refused. Every prompt, every framing, it kept saying it didn’t have enough information to answer. Smallest model in the family, and failing grade, but not a surprising one at that scale.

❌ Gemma 4 E4B

Actually engaged with the image and thought for about 10 seconds. It came back with 2:52 PM. Its reasoning found the destination time field, read it correctly, and then just treated it as the arrival time without ever questioning whether that interpretation made sense. The 1:09 was right there. It never asked itself why that field existed.

Two for two wrong on the plane.

Back at the Office — Desktop Tests

When I landed I wanted to push further. I re-ran the Gemma 4 mobile results through LM Studio to get proper logs, then added several more models to the mix.

Gemma 4 26B MoE had been on my radar specifically because the benchmark numbers coming out of Google DeepMind looked strong — but I also know that benchmarks can be over-optimized for. Models get really good at benchmark tasks in ways that don’t always translate to real world reasoning.

I also added Gemma 3 27B to the mix because it’s the model that shows up in Google’s own marketing materials and benchmark comparisons. I also added GPT OSS 20B because I was curious how it would handle the same challenge.

For models that couldn’t process images directly I converted the screen to a markdown table and passed that as text. I want to be clear about that — those models got an easier version of the test. The visual interpretation step was already done for them.

❌ Gemma 3 27B

No thinking capability, just an instant response. It grabbed “Local Time at Destination: 2:52 PM” and output it directly. Fastest answer of the whole test. This is the model that dominates Google’s benchmark slides. Worth keeping in mind.

✅ Gemma 4 26B MoE — The One That Got It Right

It thought for about 51 seconds and came back with 4:01 PM. The correct answer. But more importantly, how it got there was different from everything else. Midway through its reasoning it stopped and asked itself whether “Local Time at Destination” meant the current time in Atlanta or the arrival time. It resolved that question correctly and then did the math. That self-interrogation — pausing to question what a field actually represents before acting on it — is the step every other model skipped.

“That self-interrogation — pausing to question what a field actually represents before acting on it — is the step every other model skipped.”

❌ Ministral 3 14B

Spent 37 seconds thinking, wrote out all the right numbers in its chain of thought, noted the 1:09 remaining, and still landed on 2:52 PM. It built a timezone offset explanation that sounded reasonable but was solving the wrong problem entirely. The reasoning was running but it wasn’t really driving the answer.

❌ GPT OSS 20B

The one that surprised me most… It actually computed the right answer in its chain of thought , 1:52 plus 1:09 gets you to around 3:01, and then looked at the 2:52 PM destination field, decided that must be the arrival time accounting for the timezone difference, and threw its own correct math away. It had it. And then it talked itself out of it in about 2.7 seconds.

❌ Phi-4 Reasoning Plus

Spent 49 seconds on the problem and received the markdown table, not the raw image — so it had the easiest version of the test. It listed every field including Time to Destination: 1:09. Never once asked what that field implied. Came back with 2:52 PM. Nearly 50 seconds of thinking, structured data handed to it, still wrong.

The Scorecard

Model	Tested On	Vision	Answer	Correct	CoT Time
Gemma 4 E2B	Mobile + Desktop*	✅	Refused	❌	—
Gemma 4 E4B	Mobile + Desktop*	✅	2:52 PM	❌	~10s
Gemma 3 27B	Desktop	✅	2:52 PM	❌	None
Gemma 4 26B MoE	Desktop	✅	4:01 PM ✅	✅	~51s
Ministral 3 14B	Desktop	✅	2:52 PM	❌	37.46s
GPT OSS 20B	Desktop	❌ **	2:52 PM	❌	2.73s
Phi-4 Reasoning Plus	Desktop	❌ **	2:52 PM	❌	49.29s

Correct answer: 4:01 PM | * Initially tested on mobile during flight, re-run on desktop via LM Studio for clean chain of thought logs and timing. | ** No vision capability — received pre-converted markdown table instead of raw image.

What I Actually Took Away From This

Seven models. One correct answer. And I’m not drawing huge conclusions from one informal test — I know the 26B MoE is nowhere close to the top tier models like Gemini, GPT-5, or the upper end of Claude. That’s not the point.

The Key Insight

Thinking time alone didn’t determine who got it right. GPT OSS 20B had the answer and rationalized it away in under 3 seconds. Phi-4 thought for nearly 50 seconds with easier data and still missed it. The difference wasn’t horsepower — it was whether the model stopped to question its own interpretation before committing to an answer.

I’m mostly rooting for efficiency in this space. Smaller models that genuinely reason well use less energy, cost less to run. As companies lean into this efficiency, deploying AI and autonomous agents becomes far more affordable and simply makes good business sense. Seeing a model at this size solve a real-world problem where comparable models failed—on a real task, not a benchmark—is exactly the signal I was looking for. While the massive frontier models can already do this, the real progress lies in these leaps in efficiency.

The seat-back screen on that flight turned out to be a better test than anything I could have designed on purpose. And Rachael got an hour of peace and quiet to read. Everyone won.

All tests run with thinking enabled where supported. Initial mobile tests conducted during flight from Kansas City to Atlanta. All results verified and chain of thought logs captured via LM Studio on desktop after landing. Models without vision capability received flight data as a pre-converted markdown table.

Explore More Blogs

HubSpot for Material Handling: Escaping the Spreadsheet Trap

Are you managing millions in inventory with sophisticated ERPs, but managing your sales leads on a static Excel file? It’s time to advance with HubSpot for material handling companies and…

ABM/Marketing Strategy, Digital Marketing, Email Marketing Campaigns & Automation, HubSpot

MHEDA

Site-Seeker Honored with 2026 MHEDA MVS Award

FOR IMMEDIATE RELEASE Site-Seeker Honored with 2026 MHEDA MVS Award for Ninth Consecutive Year MARCY, NY – Site-Seeker, a strategic digital marketing agency specializing in the material handling and manufacturing…

News & Announcements

I Watched an AI Agent Browse the Web for 1 Hour. Here’s What It Did

Artificial intelligence models don’t “think” the way humans do, but they are evolving fast. We’ve moved past static models into an era where AI can interact with tools, browsers, and…

ABM/Marketing Strategy, AI in Marketing, Content Creation & Strategy, Digital Marketing