Which model has the best vision?

Trade-off by image type. Gemini 2.5 Pro wins on text-in-image (OCR, tables, scanned pages). Claude 4.7 wins on hand-drawn diagrams and whiteboards. GPT-5 is the most consistent generalist. Pick by content type, not blanket preference.

Why do vision answers often hallucinate details?

Because the default "what's in this image?" prompt is too broad. Models pattern-match to "things people usually ask about this image type" and confabulate. Narrow the question and the hallucination rate drops sharply.

Does image resolution matter?

Yes — feed the highest-resolution version you have. Most tools downsample to fit a token budget but they downsample less if the source is high-res. Tiny screenshots produce worse answers than the same screenshot at 2x.

Vision prompts that work: screenshots, diagrams, whiteboards

Vision in AI chats works, but the default prompts don't. "What's in this image?" produces a description-of-everything that's usually wrong on the detail you actually cared about. The fix is naming the specific element you want examined and the job you want done. This guide covers the three image types you'll upload most — UI screenshots, technical diagrams, and hand-drawn whiteboards — with the prompts that work for each.

UI screenshots

Job examples: "Why does this look broken?" / "What's wrong with this layout?" / "What would you change about this design?"

Best model: Gemini 2.5 Pro for text-heavy UIs (tables, dashboards). GPT-5 for general UI critique.

The prompt that works:

Attached: a screenshot of <what app, what screen>.

I'm concerned about <specific element — the modal, the spacing
around the CTA, the readability of the chart>. Look at that
specifically.

Tell me what's working, what isn't, and one change that would
improve it the most.

Don't ask broad UX questions on a single screenshot. Models can spot specific problems if you point at them; they make up general problems if you don't.

Technical diagrams (boxes-and-arrows, sequence diagrams, ER diagrams)

Job examples: "Explain this architecture" / "What's missing from this sequence?" / "Where is the bottleneck?"

Best model: Claude 4.7 — consistently the best at following arrows and labels in technical diagrams.

The prompt that works:

Attached: a <type> diagram of <system>.

Read it as a system. Tell me:
1. What it's modeling.
2. The three nodes that look most loaded with responsibility.
3. Any missing relationships you'd expect for a system of this kind.

If a label is unclear, ask before guessing.

The "ask before guessing" line is important — vision models will otherwise invent labels for boxes whose text they can't read.

Whiteboards (hand-drawn, photo of a wall)

Job examples: "Turn this whiteboard into a clean diagram" / "What did we decide here?" / "Reconstruct this list as markdown"

Best model: Claude 4.7 — best handwriting recognition AND best at preserving the structure of hand-drawn flowcharts.

The prompt that works:

Attached: a photo of a whiteboard. We were discussing <topic>.

I want to reconstruct this so it's readable by someone who wasn't
in the room. Return:
1. A clean text version of every readable label.
2. A description of the structure (boxes, arrows, groupings).
3. Anything you can't read, flagged as [illegible].

Don't fill in [illegible] with guesses.

The flag-illegible line is the difference between a reconstruction you can trust and one you can't.

A pattern that works across all three

The structure that works in every image type:

Attached: <what it is>.
I want: <specific job>.
Focus on: <specific element, not the whole thing>.
For unreadable parts: flag, don't guess.

If you keep one system prompt across models, add "when images are attached, focus on the specific element named in my prompt; flag what you can't read" to the constraints section. That makes the discipline default.

Switching models per image type

If your work mixes UI screenshots, diagrams, and whiteboards in one session, model-switching by question type is exactly the workflow this case is built for. Use oran.chat (or another multi-model tool — see our comparison) so you don't have to copy-paste images between apps.

More practical workflows in Playbooks.