GPT Driver User Guide
GPT Driver User Guide
GPT Driver User Guide
  • Getting Started
    • 🚀Getting Started
    • 🔃Uploading a Build File
    • 🧙‍♂️Creating Tests
      • Commands
        • Tap Command
        • Type Command
        • scroll Command
        • scrollUntilVisible Command
        • swipe Command
        • slide Command
        • wait Command
        • assertVisible Command
        • openLink Command
        • pressBackButton Command
        • launchApp Command
        • restartApp Command
      • 👁️withVision: Instructions
      • 🧠AI Instructions
    • 🏁Executing Tests
  • ☎️Device Configuration
  • ⚙️Under the Hood
  • Getting Around
    • ✏️Test Editor
    • 🛰️Test Overview
    • 🏅Test Reports
    • ⏺️Test Recordings
    • 👤Settings
  • Best Practices
    • 🧑‍💻API Documentation
    • Versioning
    • ↗️Templates
    • 🖇️Test Dependencies
    • 🔗Deep Links
    • 📧Email Verification
    • 📡Network Calls
    • 🪡Parameterized Strings
    • 📍Changing Device Location
    • 🪶Conditional Steps
    • 🐦Nested Steps
    • ⌚Smart Caching
    • 🗣️Env. Variables
    • 💯Bulk Step Testing for Robustness
    • 📖Exact Text Assertions
    • 💬Auto-grant Permissions
  • 🧪Mocking Network Data
  • 🌎Localization Testing
  • ❔FAQs
Powered by GitBook
On this page
  • What is withVision:?
  • When to Use withVision:
  • Syntax
  • Examples of Use Cases
  • Known Limitations
  1. Getting Started
  2. Creating Tests

withVision: Instructions

Use withVision: to interact with visual, icon-based, or dynamic UIs.

What is withVision:?

withVision:is an advanced AI instruction type in GPT Driver that uses a multimodal model to analyze the entire screen as an image. Unlike standard AI instructions—which rely on OCR and icon detection—withVision: understands layout, color, visual relationships, and unlabeled UI elements.

This makes it ideal for dynamic or highly visual interfaces, where elements may shift, be styled differently, or lack accessible text or IDs.

When to Use withVision:

Use withVision: when:

  • Elements are unlabeled or icon-based (e.g. flags, icons, map markers).

  • You need to reference visual traits (e.g. "green button", "top right corner").

  • The UI layout or styling changes frequently, making command-based steps or text-based AI instructions brittle.

  • Standard AI steps fail due to missing IDs or non-standard rendering.

Syntax

Prefix the natural language step with:

withVision: [your instruction] 

Examples:

withVision: tap on the Dutch flag 
withVision: tap the green continue button at the bottom
withVision: tap the calendar date with a green dot
withVision: tap the plus icon to add a new community, then enter name, and save

Examples of Use Cases

🏳️ Flag Selection in a Language App

withVision: tap on the Dutch flag

Use when flags or similar icons have no accessible text label.


📅 Calendar Date with Visual Indicator

withVision: tap on the date with a green dot

Helpful when visual markers indicate state, like availability or activity, and can’t be targeted via element ID.


⭐ Icon Buttons or Visual Cues Only

withVision: tap on the white star icon in the top right corner

Use when buttons are purely icon-based and not exposed in the element tree.


🔄 Complex & Fast-Changing UIs

withVision: tap the plus icon to add a new community, type the name, and save

Best for high-level flows where the layout changes frequently. The model can handle multiple actions (tapping, typing, saving) in a single instruction—even when intermediate UI details differ across app versions.

Known Limitations

  • Visual edge cases (e.g. very small elements or animated/moving targets) may still require additional tuning or wait commands.

  • In rare cases, the model may return incorrect tap coordinates. This is an active area of improvement, with updates expected in the coming weeks. This behavior is actively being addressed. We expect upcoming model improvements in the next few weeks to significantly reduce these cases.

PreviousrestartApp CommandNextAI Instructions

Last updated 11 days ago

🧙‍♂️
👁️