End-to-end multimodal model for speech processing with parallel text and audio generation capabilities.
[Research Preview] An end-to-end model that processes and generates both text and audio. It performs well on speech-centric tasks like transcription, audio captioning, and speech-to-speech translation, offering robust multi-turn dialogue support and text-to-speech capabilities.