Apple Unveils Ferret-UI to Understand and Act on Mobile App Screens

Apple unveiled a new multimodal large language model (MLLM), Ferret-UI, that can understand mobile user interface (UI) screens, like recognizing app icons and text.
Ferret-UI outperformed OpenAI's GPT-4V multimodal model on most UI comprehension tasks like icon recognition and OCR.
The model has "referring, grounding, and reasoning capabilities" to fully understand UI screens and perform instructed tasks based on screen contents.
Potential applications include advancing UI-related downstream tasks, though specific Apple plans are not detailed.
Capabilities like thoroughly understanding UIs could be used to improve Siri, carrying out full tasks like booking flights without step-by-step user instructions.