I wasn’t trying to run an AI benchmark at 36,000 feet. I was just sitting on a plane next to my wife, looking at the seat-back flight display, and I noticed something. The screen had a bunch of data on it — time to destination, distance, local times, altitude, wind speed, ground speed — and I realized it actually made me read it twice before I was confident about when we were going to land. That felt like a good test.
My wife caught me contorting my phone toward the seat-back screen trying to get a clean shot with no glare. She looked confused for a moment, then something clicked and she smiled. I’m pretty sure she was just glad I’d found something to occupy myself with rather than rambling about open weight models for then next hour…
I’ve been keeping a close eye on the open weight model space for a while now. Not because I’m a fanboy for any particular company, but because I help businesses use technology in ways that actually make sense for them, and efficiency matters. Smaller models that are genuinely capable cost less to run, use less energy, and are easier to deploy. So when Gemma 4 came out with some impressive looking benchmarks, I wanted to see what it could actually do on something real, not a benchmark, just a messy real world problem that every single person on that airplane had to look at if they wanted to know when we were landing.
I was deliberate about taking a single photo because I didn’t want any variables messing up the comparison between models. Same photo, every test. I ran the initial tests on my phone during the flight, but when I got back to my office I re-ran everything through LM Studio on my desktop. It gave me clean chain of thought logs and consistent timing that were a lot easier to review. The results matched what I’d seen on the plane, but now I had the logs.

Why This Screen Is Trickier Than It Looks
Here’s the thing I want to be upfront about before getting into results. In my work building AI-powered systems for clients, I almost always push for cleaning and structuring data before it ever reaches the model. Give the AI only what it needs, in a format that’s unambiguous. That reduces errors and keeps responses consistent.
This screen does the opposite of that. It has a field that reads “Local Time at Destination: 2:52 PM” which looks exactly like the answer to the question — what time will it be in Atlanta when I land? Except it isn’t the answer. That field shows the current time in Atlanta right now. To get the actual landing time you have to notice the “Time to Destination: 1:09” field, understand what it means in relation to the destination time, and add them together. The answer is 4:01 PM.
“A model that only reasons correctly when everything is pre-digested for it has a ceiling. I wanted to find that ceiling.”
I would never set up a production AI agent to work against data like this without cleaning it first. But I thought it was worth testing anyway because real world data is messy. I wanted to find that ceiling.
On the Plane — Initial Mobile Tests
I had the two smaller Gemma 4 models running on my phone during the flight.
❌ Gemma 4 E2B
Just refused. Every prompt, every framing, it kept saying it didn’t have enough information to answer. Smallest model in the family, and failing grade, but not a surprising one at that scale.
❌ Gemma 4 E4B
Actually engaged with the image and thought for about 10 seconds. It came back with 2:52 PM. Its reasoning found the destination time field, read it correctly, and then just treated it as the arrival time without ever questioning whether that interpretation made sense. The 1:09 was right there. It never asked itself why that field existed.
Two for two wrong on the plane.
Back at the Office — Desktop Tests
When I landed I wanted to push further. I re-ran the Gemma 4 mobile results through LM Studio to get proper logs, then added several more models to the mix.
Gemma 4 26B MoE had been on my radar specifically because the benchmark numbers coming out of Google DeepMind looked strong — but I also know that benchmarks can be over-optimized for. Models get really good at benchmark tasks in ways that don’t always translate to real world reasoning.
I also added Gemma 3 27B to the mix because it’s the model that shows up in Google’s own marketing materials and benchmark comparisons. I also added GPT OSS 20B because I was curious how it would handle the same challenge.
For models that couldn’t process images directly I converted the screen to a markdown table and passed that as text. I want to be clear about that — those models got an easier version of the test. The visual interpretation step was already done for them.
❌ Gemma 3 27B
No thinking capability, just an instant response. It grabbed “Local Time at Destination: 2:52 PM” and output it directly. Fastest answer of the whole test. This is the model that dominates Google’s benchmark slides. Worth keeping in mind.
✅ Gemma 4 26B MoE — The One That Got It Right
It thought for about 51 seconds and came back with 4:01 PM. The correct answer. But more importantly, how it got there was different from everything else. Midway through its reasoning it stopped and asked itself whether “Local Time at Destination” meant the current time in Atlanta or the arrival time. It resolved that question correctly and then did the math. That self-interrogation — pausing to question what a field actually represents before acting on it — is the step every other model skipped.
“That self-interrogation — pausing to question what a field actually represents before acting on it — is the step every other model skipped.”
❌ Ministral 3 14B
Spent 37 seconds thinking, wrote out all the right numbers in its chain of thought, noted the 1:09 remaining, and still landed on 2:52 PM. It built a timezone offset explanation that sounded reasonable but was solving the wrong problem entirely. The reasoning was running but it wasn’t really driving the answer.
❌ GPT OSS 20B
The one that surprised me most… It actually computed the right answer in its chain of thought , 1:52 plus 1:09 gets you to around 3:01, and then looked at the 2:52 PM destination field, decided that must be the arrival time accounting for the timezone difference, and threw its own correct math away. It had it. And then it talked itself out of it in about 2.7 seconds.
❌ Phi-4 Reasoning Plus
Spent 49 seconds on the problem and received the markdown table, not the raw image — so it had the easiest version of the test. It listed every field including Time to Destination: 1:09. Never once asked what that field implied. Came back with 2:52 PM. Nearly 50 seconds of thinking, structured data handed to it, still wrong.
The Scorecard
| Model | Tested On | Vision | Answer | Correct | CoT Time |
|---|---|---|---|---|---|
| Gemma 4 E2B | Mobile + Desktop* | ✅ | Refused | ❌ | — |
| Gemma 4 E4B | Mobile + Desktop* | ✅ | 2:52 PM | ❌ | ~10s |
| Gemma 3 27B | Desktop | ✅ | 2:52 PM | ❌ | None |
| Gemma 4 26B MoE | Desktop | ✅ | 4:01 PM ✅ | ✅ | ~51s |
| Ministral 3 14B | Desktop | ✅ | 2:52 PM | ❌ | 37.46s |
| GPT OSS 20B | Desktop | ❌ ** | 2:52 PM | ❌ | 2.73s |
| Phi-4 Reasoning Plus | Desktop | ❌ ** | 2:52 PM | ❌ | 49.29s |
Correct answer: 4:01 PM | * Initially tested on mobile during flight, re-run on desktop via LM Studio for clean chain of thought logs and timing. | ** No vision capability — received pre-converted markdown table instead of raw image.
What I Actually Took Away From This
Seven models. One correct answer. And I’m not drawing huge conclusions from one informal test — I know the 26B MoE is nowhere close to the top tier models like Gemini, GPT-5, or the upper end of Claude. That’s not the point.
The Key Insight
Thinking time alone didn’t determine who got it right. GPT OSS 20B had the answer and rationalized it away in under 3 seconds. Phi-4 thought for nearly 50 seconds with easier data and still missed it. The difference wasn’t horsepower — it was whether the model stopped to question its own interpretation before committing to an answer.
I’m mostly rooting for efficiency in this space. Smaller models that genuinely reason well use less energy, cost less to run. As companies lean into this efficiency, deploying AI and autonomous agents becomes far more affordable and simply makes good business sense. Seeing a model at this size solve a real-world problem where comparable models failed—on a real task, not a benchmark—is exactly the signal I was looking for. While the massive frontier models can already do this, the real progress lies in these leaps in efficiency.
The seat-back screen on that flight turned out to be a better test than anything I could have designed on purpose. And Rachael got an hour of peace and quiet to read. Everyone won.
All tests run with thinking enabled where supported. Initial mobile tests conducted during flight from Kansas City to Atlanta. All results verified and chain of thought logs captured via LM Studio on desktop after landing. Models without vision capability received flight data as a pre-converted markdown table.