News + Trends

Microsoft's VALL-E imitates any voice - three seconds of recording is enough

11-1-2023

Translation: machine translated

DALL-E is followed by VALL-E: Microsoft and OpenAI have created a new artificial intelligence (AI) that can imitate voices. A voice recording of just three seconds should be enough input for the AI.

Today we know: What photos or videos show doesn't necessarily have to have happened. Since ChatGPT and DALL-E, it's also clear that a text doesn't necessarily have to come from an author's pen or a picture from an artist's brush. Now it's the voice's turn.

VALL-E is an AI model that Microsoft calls "Neural Codec Language". It enables voice profiles to be created and the corresponding voice to be imitated. Three seconds of voice recording are enough for the AI to imitate what it hears naturally and with emotional colouring. It can then use the voice to read out any text. The ambient sound of the recording is also retained. The new AI is ideal for text-to-speech functions, which could at best enable a book to be read aloud in the author's voice.

Microsoft is aware that the technology also has potential for misuse. For this reason, a protocol in future applications will ensure that content created by VALL-E can be recognised as such.

Overview of how VALL-E works.
Source: Microsoft

The AI delivers impressive results with the examples presented by Microsoft. For its training, 60,000 hours of English language recordings were processed. This corresponds to a hundred times the input of existing speech syntheses.

You can listen to examples of VALL-E on GitHub https://valle-demo.github.io/. In addition to the VALL-E voice output, the three-second input recordings (speaker prompt) are also available. You can also listen to how the entered text sounds when spoken in the original voice (Ground Truth). And under Baseline you can hear how good existing text-to-speech synthesis sounds in comparison. Whether, when and in what form you will be able to use VALL-E in the future is still unclear.

Cover image: shutterstock

52 people like this article

Martin Jud

Senior Editor

martin.jud@digitecgalaxus.ch

I find my muse in everything. When I don’t, I draw inspiration from daydreaming. After all, if you dream, you don’t sleep through life.

Search

Microsoft's VALL-E imitates any voice - three seconds of recording is enough

These articles might also interest you

Comments