Apple Unveils Ferret-UI to Understand and Act on Mobile App Screens
-
Apple unveiled a new multimodal large language model (MLLM), Ferret-UI, that can understand mobile user interface (UI) screens, like recognizing app icons and text.
-
Ferret-UI outperformed OpenAI's GPT-4V multimodal model on most UI comprehension tasks like icon recognition and OCR.
-
The model has "referring, grounding, and reasoning capabilities" to fully understand UI screens and perform instructed tasks based on screen contents.
-
Potential applications include advancing UI-related downstream tasks, though specific Apple plans are not detailed.
-
Capabilities like thoroughly understanding UIs could be used to improve Siri, carrying out full tasks like booking flights without step-by-step user instructions.