withVision: Instructions

Use withVision: to interact with visual, icon-based, or dynamic UIs.

What is `withVision:`?

withVision:is an advanced AI instruction type in GPT Driver that uses a multimodal model to analyze the entire screen as an image. Unlike standard AI instructions—which rely on OCR and icon detection—withVision: understands layout, color, visual relationships, and unlabeled UI elements.

This makes it ideal for dynamic or highly visual interfaces, where elements may shift, be styled differently, or lack accessible text or IDs.

When to Use `withVision:`

Use withVision: when:

Elements are unlabeled or icon-based (e.g. flags, icons, map markers).
You need to reference visual traits (e.g. "green button", "top right corner").
The UI layout or styling changes frequently, making command-based steps or text-based AI instructions brittle.
Standard AI steps fail due to missing IDs or non-standard rendering.

Syntax

Prefix the natural language step with:

withVision: [your instruction]

Examples:

withVision: tap on the Dutch flag

withVision: tap the green continue button at the bottom

withVision: tap the calendar date with a green dot

withVision: tap the plus icon to add a new community, then enter name, and save

Examples of Use Cases

🏳️ Flag Selection in a Language App

withVision: tap on the Dutch flag

Use when flags or similar icons have no accessible text label.

📅 Calendar Date with Visual Indicator

withVision: tap on the date with a green dot

Helpful when visual markers indicate state, like availability or activity, and can’t be targeted via element ID.

⭐ Icon Buttons or Visual Cues Only

withVision: tap on the white star icon in the top right corner

Use when buttons are purely icon-based and not exposed in the element tree.

🔄 Complex & Fast-Changing UIs

withVision: tap the plus icon to add a new community, type the name, and save

Best for high-level flows where the layout changes frequently. The model can handle multiple actions (tapping, typing, saving) in a single instruction—even when intermediate UI details differ across app versions.

Known Limitations

Visual edge cases (e.g. very small elements or animated/moving targets) may still require additional tuning or wait commands.
In rare cases, the model may return incorrect tap coordinates. This is an active area of improvement, with updates expected in the coming weeks. This behavior is actively being addressed. We expect upcoming model improvements in the next few weeks to significantly reduce these cases.

PreviousrestartApp Command NextAI Instructions

Last updated 1 month ago

What is withVision:?

When to Use withVision:

Syntax

Examples of Use Cases

Known Limitations

What is `withVision:`?

When to Use `withVision:`