withVision: Instructions
Use withVision: to interact with visual, icon-based, or dynamic UIs.
What is withVision:
?
withVision:
?withVision:
is an advanced AI instruction type in GPT Driver that uses a multimodal model to analyze the entire screen as an image. Unlike standard AI instructions—which rely on OCR and icon detection—withVision:
understands layout, color, visual relationships, and unlabeled UI elements.
This makes it ideal for dynamic or highly visual interfaces, where elements may shift, be styled differently, or lack accessible text or IDs.
When to Use withVision:
withVision:
Use withVision:
when:
Elements are unlabeled or icon-based (e.g. flags, icons, map markers).
You need to reference visual traits (e.g. "green button", "top right corner").
The UI layout or styling changes frequently, making command-based steps or text-based AI instructions brittle.
Standard AI steps fail due to missing IDs or non-standard rendering.
Syntax
Prefix the natural language step with:
Examples:
Examples of Use Cases
🏳️ Flag Selection in a Language App
Use when flags or similar icons have no accessible text label.
📅 Calendar Date with Visual Indicator
Helpful when visual markers indicate state, like availability or activity, and can’t be targeted via element ID.
⭐ Icon Buttons or Visual Cues Only
Use when buttons are purely icon-based and not exposed in the element tree.
🔄 Complex & Fast-Changing UIs
Best for high-level flows where the layout changes frequently. The model can handle multiple actions (tapping, typing, saving) in a single instruction—even when intermediate UI details differ across app versions.
Known Limitations
Visual edge cases (e.g. very small elements or animated/moving targets) may still require additional tuning or
wait
commands.In rare cases, the model may return incorrect tap coordinates. This is an active area of improvement, with updates expected in the coming weeks. This behavior is actively being addressed. We expect upcoming model improvements in the next few weeks to significantly reduce these cases.
Last updated