ChatGPT mythbusting: no, it can’t hear and speak now
OpenAI announced this week that ChatGPT can “see, hear, and speak.” Following the announcement, I noticed posts on LinkedIn, some from very highly-followed and respected accounts, saying that the underlying model is now multimodal with respect to audio — for example, claiming it can generate music.
Nope!
The reality:
The new voice features are speech recognition and text-to-speech, functioning as an interface to the existing model
These allow you to speak and listen rather than type
The underlying model sees the text of your conversation, not the audio
The underlying model is only capable of consuming and generating text and images
The hyperbolic wording of OpenAI’s announcement — saying ChatGPT can “hear” and “speak” — no doubt contributed to the mistaken idea that the model is capable of processing audio.
What about vision? Can ChatGPT “see”? The model can indeed consume and generate images, so this part is less misleading. But I could still do without the anthropomorphizing language.
*ChatGPT can write music in text form, such as lyrics or chord progressions, when prompted. It can't generate actual audio files.