Vision prompts that work: screenshots, diagrams, whiteboards
What to actually type when you upload an image — the prompts that get useful answers across UI screenshots, technical diagrams, and hand-drawn whiteboards.
Vision in AI chats works, but the default prompts don't. "What's in this image?" produces a description-of-everything that's usually wrong on the detail you actually cared about. The fix is naming the specific element you want examined and the job you want done. This guide covers the three image types you'll upload most — UI screenshots, technical diagrams, and hand-drawn whiteboards — with the prompts that work for each.
UI screenshots
Job examples: "Why does this look broken?" / "What's wrong with this layout?" / "What would you change about this design?"
Best model: Gemini 2.5 Pro for text-heavy UIs (tables, dashboards). GPT-5 for general UI critique.
The prompt that works:
Attached: a screenshot of <what app, what screen>.
I'm concerned about <specific element — the modal, the spacing
around the CTA, the readability of the chart>. Look at that
specifically.
Tell me what's working, what isn't, and one change that would
improve it the most.
Don't ask broad UX questions on a single screenshot. Models can spot specific problems if you point at them; they make up general problems if you don't.
Technical diagrams (boxes-and-arrows, sequence diagrams, ER diagrams)
Job examples: "Explain this architecture" / "What's missing from this sequence?" / "Where is the bottleneck?"
Best model: Claude 4.7 — consistently the best at following arrows and labels in technical diagrams.
The prompt that works:
Attached: a <type> diagram of <system>.
Read it as a system. Tell me:
1. What it's modeling.
2. The three nodes that look most loaded with responsibility.
3. Any missing relationships you'd expect for a system of this kind.
If a label is unclear, ask before guessing.
The "ask before guessing" line is important — vision models will otherwise invent labels for boxes whose text they can't read.
Whiteboards (hand-drawn, photo of a wall)
Job examples: "Turn this whiteboard into a clean diagram" / "What did we decide here?" / "Reconstruct this list as markdown"
Best model: Claude 4.7 — best handwriting recognition AND best at preserving the structure of hand-drawn flowcharts.
The prompt that works:
Attached: a photo of a whiteboard. We were discussing <topic>.
I want to reconstruct this so it's readable by someone who wasn't
in the room. Return:
1. A clean text version of every readable label.
2. A description of the structure (boxes, arrows, groupings).
3. Anything you can't read, flagged as [illegible].
Don't fill in [illegible] with guesses.
The flag-illegible line is the difference between a reconstruction you can trust and one you can't.
A pattern that works across all three
The structure that works in every image type:
Attached: <what it is>.
I want: <specific job>.
Focus on: <specific element, not the whole thing>.
For unreadable parts: flag, don't guess.
If you keep one system prompt across models, add "when images are attached, focus on the specific element named in my prompt; flag what you can't read" to the constraints section. That makes the discipline default.
Switching models per image type
If your work mixes UI screenshots, diagrams, and whiteboards in one session, model-switching by question type is exactly the workflow this case is built for. Use oran.chat (or another multi-model tool — see our comparison) so you don't have to copy-paste images between apps.
More practical workflows in Playbooks.