Ars Technica reported that Microsoft showcased its latest text-to-speech AI research using a model called VALL-E that can simulate a person’s voice from just a three-second audio sample, reports engadget.
Speech can match not only timbre but also the emotional tone of the speaker, and even the acoustics of a room, and could one day be used for custom or high-end text-to-speech applications, although like deepfakes, it involves the risks of abuse.
VALL-E is what Microsoft calls a “neural coding language paradigm.” It is derived from Meta’s AI-powered compression neural network coding, which generates audio from text input and short samples from the target speaker.
In a paper, the researchers describe how they trained VALL-E on 60,000 hours of English speech from more than 7,000 speakers in the LibriLight Meta audio library. It uses the training data to infer what the target speaker would sound like if they were speaking by entering the required text. .
And the team explains exactly how it works well on the VALL-E Github page. For every sentence they want the AI to “speak”, they have a three-second request from the speaker to mimic, a “basic fact” of the same speaker saying another sentence for comparison, a “line of base” for traditional text-to-speech and a VALL-E sample at the end.
The results are mixed, some machine-like and some startlingly lifelike, the fact that it retains the emotional tone of the original samples and fits the acoustic environment faithfully, so if a speaker records his or her voice in an echoed hall, the l The VALL-E exit appears to come from the same place.
To improve the model, Microsoft plans to extend its training data to “improve model performance across similarity perspectives between technical presentations, speaking style, and speakers.” Also explore ways to reduce unclear or missing words.
Microsoft chose not to open source the code, perhaps due to the inherent risks of AI putting words into someone’s mouth.
He added that it will follow “Microsoft principles of artificial intelligence” in any further development. “Because VALL-E can synthesize speech that preserves the speaker’s identity,” the company wrote in the “Broad Implications” section of its conclusion, it could carry potential risks in misusing the model, such as speech recognition representation or the representation.