Multimodal AI Programs Gain New Sensory Powers, Bringing Excitement and Concern About the Future

New "multimodal" AI programs like ChatGPT and Google's Bard can now analyze images and audio in addition to text. This allows them to describe scenes, do visual tasks, and hold voice conversations.
Multimodal AI has exciting applications for the disabled, like OpenAI's partnership with the app Be My Eyes to provide visual descriptions for blind users. Early feedback has been positive.
These AIs work by combining separate models built for processing text, images, etc. The models "translate" inputs to a common vector format that allows integrated functioning.
In the future, multimodal AI could understand and generate video, smells, and more. In 5-10 years, personal AI assistants could handle complex tasks through multiple modalities.
Multimodal AI brings risks like hallucination and privacy breaches. But it's an important step toward artificial general intelligence, which would require human-like sensory integration.