👁️withVision: Instructions
Use withVision: to interact with visual, icon-based, or dynamic UIs.
What is withVision:
?
withVision:
?withVision:
is an advanced AI instruction type in GPT Driver that uses a multimodal model to analyze the entire screen as an image. Unlike standard AI instructions—which rely on OCR and icon detection—withVision:
understands layout, color, visual relationships, and unlabeled UI elements.
This makes it ideal for dynamic or highly visual interfaces, where elements may shift, be styled differently, or lack accessible text or IDs.
When to Use withVision:
withVision:
Use withVision:
when:
Elements are unlabeled or icon-based (e.g. flags, icons, map markers).
You need to reference visual traits (e.g. "green button", "top right corner").
The UI layout or styling changes frequently, making command-based steps or text-based AI instructions brittle.
Standard AI steps fail due to missing IDs or non-standard rendering.
Syntax
Prefix the natural language step with:
withVision: [your instruction]
Examples:
withVision: tap on the Dutch flag
withVision: tap the green continue button at the bottom
withVision: tap the calendar date with a green dot
withVision: tap the plus icon to add a new community, then enter name, and save
Examples of Use Cases
🏳️ Flag Selection in a Language App
withVision: tap on the Dutch flag
Use when flags or similar icons have no accessible text label.
📅 Calendar Date with Visual Indicator
withVision: tap on the date with a green dot
Helpful when visual markers indicate state, like availability or activity, and can’t be targeted via element ID.
⭐ Icon Buttons or Visual Cues Only
withVision: tap on the white star icon in the top right corner
Use when buttons are purely icon-based and not exposed in the element tree.
🔄 Complex & Fast-Changing UIs
withVision: tap the plus icon to add a new community, type the name, and save
Best for high-level flows where the layout changes frequently. The model can handle multiple actions (tapping, typing, saving) in a single instruction—even when intermediate UI details differ across app versions.
Known Limitations
Visual edge cases (e.g. very small elements or animated/moving targets) may still require additional tuning or
wait
commands.In rare cases, the model may return incorrect tap coordinates. This is an active area of improvement, with updates expected in the coming weeks. This behavior is actively being addressed. We expect upcoming model improvements in the next few weeks to significantly reduce these cases.
Last updated