How to make music with speech synthesis?

If you want to compose a song, but hate your voice, don’t know a singer to work with, or just don’t have the equipment to record, you can use text-to-speech software to sing for you.

Text-to-speech tools such as “Emvoice”, “Synthesizer V Studio” or “Vocaloid” allow you to create expressive and realistic vocal lines with different voices. The vocal lines and lyrics are customizable via an interface similar to a sequencer, with vibrato, expression and even breathing sounds.

Before we see how to make your computer sing, let’s take a quick look at the history of text-to-speech.

What is text-to-speech?

Text-to-speech is a technology that creates an artificial voice signal. It first appeared in the 1930s with the VODER (video below), but the first commercial applications were developed in the 1960s.

At that time, text-to-speech was primarily used to help people with physical disabilities communicate more easily. Since then, text-to-speech has been used in many different applications, including telephone information systems and intelligent voice assistants.

Over the decades, text-to-speech has seen many improvements. Early systems were very mechanical and unnatural, but modern technologies allow for the creation of artificial voices that are almost indistinguishable from real human voices. Advances in artificial intelligence and machine learning have also greatly improved the quality and nature of synthetic voices.

Today, text-to-speech is used in a wide variety of applications, including audiobook production, navigation systems, and intelligent voice assistants such as those on the iPhone or Android. It is also used to help people with physical disabilities communicate more easily and to allow illiterate people to access written information.

Although text-to-speech has seen many improvements and has become quite realistic, there are still many challenges, especially with regard to the naturalness of artificial voices, as well as the ability to convey emotions and subtle nuances of spoken language.

The voice of Stephen Hawking

Stephen Hawking’s synthesized voice is remembered as a customized version of the DECtalk text-to-speech software. When Hawking began to lose his ability to speak due to the progression of his disease (Lime disease), he began to look for an alternative way to communicate. In 1985, he began using a text-to-speech program called“Equalizer” that was installed on a Sinclair QL computer.

In 1988, he began using a DECtalk speech synthesizer that was controlled by a toggle switch. It was a robotic, monotone voice that became famous worldwide.

However, in 1988, the sound card for the Sinclair QL computer was withdrawn from the market and replaced with a model that was incompatible with the DECtalk speech synthesizer. This forced Hawking to look for an alternative. He finally opted for a voice provided by the Canadian text-to-speech company,“Speech Plus“.

This voice became the most famous, as it was used by Hawking for more than three decades, until his death in 2018.

Apple and text-to-speech

Just as famous, the Macintosh SE’s text-to-speech used MacinTalk software, which was developed by Apple in the 1980s. It was a robotic, mechanical voice that quickly became emblematic of Apple computers of the time.

Interestingly, text-to-speech has since evolved significantly, with more natural and expressive voices.

What text-to-speech software can I use to make music?

When it comes to music composition, the technology has also advanced a lot and it has become almost indistinguishable from a real voice at times.

There are several applications that allow you to compose melodic lines with a synthesized voice, but there is a great disparity in terms of realism and languages available from one software to another.

Parameters such as vibrato, intensity and even breathing sounds are controllable and programmable. You will need to write your musical notes and add the lyrics of your song. Then, you can modulate the articulation with different voice effects from the software’s interface.

Let’s see which are the best text-to-speech programs for music.

Emvoice

Emvoice offers several different voices such as Keela, Lucy, Jay and Thomas with different timbres and tessituras for various music styles. Emvoice is available in VST format for Mac and Windows.

https://emvoiceapp.com/

Eclipsed Sounds

Eclipse Sounds produces quite possibly one of the best text-to-speech software for music available today with two different voices: Solaria (female) and Asterian (male).

Both voices require the free Synthesizer V Studio Basic software, but can be used to their full potential with the Pro version of the software which will allow you to use the VST and Audio Unit plugins.

Here is a small preview of the software interface:

Solaria

Solaria is a female voice that can sing in three different languages,English, Chinese and Japanese.

https://www.eclipsedsounds.com/solaria

Asterian

A deep male voice in English only.

https://www.eclipsedsounds.com/asterian

Vocaloid

Vocaloid is a very popular text-to-speech software. It was created by Yamaha in 2004. Since then, the software has undergone many updates and improvements, and it has become very popular with musicians and music producers, especially in Japan after being popularized by the success of Hatsune Miku.

The exact number of voices available in Vocaloid depends on the version and add-on packs installed, but there are usually several different voices available for each supported language.

For example, the English version of Vocaloid 5 includes voices such as“Ruby“,“Chris“,“Amy“, “Otomania” and “YAMAHA VOCALOID 5 Library”.

Vocaloid voices are created by voice production studios such as Crypton Future Media (which created Hatsune Miku), Zero-G and PowerFX.

Despite its relative success, we will prefer other voices like Solaria and or Emvoice because their sound rendering has aged rather badly and sounds robotic. 🤖

https://www.vocaloid.com/en/

Who is Hatsune Miku?

It’s hard to talk about voice synthesis without talking about the biggest star of voice synthesis, Hatsune Miku. She is a virtual character entirely created by Crypton Future Media using the Vocaloid voice synthesis software (since Version 2). She sings exclusively in Japanese, her public being mainly Japanese.

Her success is such that she has become a real icon of the Japanese pop culture. She even gives virtual concerts, like this one in 2016:

Audiology

Part of the most realistic singing voice synthesis, there are Jun and Anri, produced by Audiologie. Like Eclipsed Sounds, these two voices also require Synthesizer V Studio Basic (free) but more parameters will be available with the Pro version of the software.

https://audiologie.us/