Meta Announces release of ‘Voicebox’, an AI model for Speech Generation

Ten News Network

On Jun 19, 2023

New Delhi, 19th June 2023: Tech behemoth Meta, just announced that they have used generative AI models for speech-related tasks. The business has released Voicebox, a powerful tool for audio editing, sampling, and styling.

Voicebox technology can help content creators with a variety of activities, assist the visually impaired in hearing printed communications, and allow users to talk in any foreign language.

The company claims to have made a breakthrough in generative AI for voice.

Meta said in a blog post, “We’ve developed Voicebox, the first model that can generalize to speech-generation tasks it was not specifically trained to accomplish with state-of-the-art performance.”

Voicebox can generate outputs in a number of styles and from scratch. While most generative AI models generate images from text prompts, Voicebox generates high-quality audio samples.

The model can already process speech in six languages and execute tasks including noise reduction, content editing, diverse sample production, and style conversion.

Meta also stated that its versatile generative AI models, such as Voicebox, may provide virtual assistants and NPCs in the metaverse with natural-sounding voices.

The model includes in-context text-to-speech synthesis, which allows Voicebox to match audio style for text-to-speech production from as little as two seconds of audio.

The technology does not have the need to re-record the speech and it can subsequently reproduce a section of it that was interrupted by noise or replace misspoken words.

Voicebox can generate speech from text in French, Spanish, English, German, Polish, and Portuguese using a voice sample. This is referred to as cross-lingual style transfer.

The company stated, “This capability could be used in the future to help people communicate in a natural, authentic way even if they don’t speak the same languages.”

Furthermore, the tool’s diverse voice sampling can generate speech that is representative of how individuals speak in real life.