Microsoft's VALL-E Could this be the start of Bad Actors

VALL-E Has the ability to clone someone’s voice, you can see why this might be a bad thing

VALL-E is an artificial intelligence (AI) speech synthesizer that is developed by Microsoft. It has the ability to clone and manipulate a person voice. The scary thing about this thing is that it’s so advanced it only required a three-second sample of someone’s voice. With this sample it can convert it into any word or even a sentence.

Not only that is can shape speech into a wide range of emotions. It can even replicate a speakers background noise and even the environment they are recording in. We can see a pretty grim time a head, especially with scammers using this type of technology.

In the past Digital forensics would consider that a lack of background noise would be an obvious sign that voice was AI-Manipulated. But he way that VALL-E is capable of mimicking even the smallest details is going to make it a challenge for investigators to tell the difference.

In a way it’s like an arms race, people are creating new AI technology where as others are working on ways to detect it, to protect users. Can you see why there is so much regulation going around in the world?

So What is VALL-E

Well technically VALL-E is a Neural Codec Language model. It has been trained on 60,000 hours worth of speech. This speech data was taken from over 7,000 speakers. It’s basically a massive speech database it origins are from LibriLight, an audio bank assembled by Meta.

What might be possible is to connect VALL-E with other AI generative tools like GPT-4 this would allow chatbots to have a spoken voice. We are all to familiar with how advanced these chatbots are becoming. Microsoft isn’t in a rush yet to show the potential interactions between there AI tools just yet.

One good thing is that Microsoft has already acknowledged that bad actors would consider using VALL-E. So they are aware people might try to exploit this piece of software. Personally though we don’t see why anyone would need to imitate someone else’s voice.

We would not be surprised if Microsoft are working on some kind of counter program to VALL-E. This would help identify voices and tell users if they have been manipulated using this piece of software.

The hackers are coming — Voice Manipulation

Potential Threats Of AI Voice Synthesizers

This is nothing new. Even before anyone was aware VALL-E was being worked on there was already people out their creating deep fakes to con people. You might know someone or have at least heard of someone that has fallen victim for these kind of scams in the past. What used to be someone putting on a “voice” has now become automated with these kinds of new technology.

Now with the introduction of VALL-E especially if it becomes available to the general public, it will be all too simple to clone someone’s voice and use it against them or their family. Sure it might be fun to mimic others and joke around, but this is serious. We could have people cloning anyone’s voice and just doing whatever they like with it.

Fraud on an AI level

You think all this might be rare and people are not falling for these kind of scams. Well you would you might actually be quite surprised to find out there has been countless incidents were big corporations have been conned out of Hundreds of Thousands of dollars and in one case millions!

This type of fraud is known as voice phishing. The attacker will clone someone’s voice, perhaps use some social engineering to understand relationships within the workplace and attitudes. Then they will mimic this behaviour and try to exploit a member of staff to take funds. If VALL-E was to be set out in the public you could clone anyone’s voice with only a few sentences. All it would take was you to call a company director and record the conversation. Scary world right?

This is why some companies have had conversations about creating certain fail safes in order to catch bad actors. Think of key words, phares and such that only people at the office might know. But even then we have issues with people sharing company information in the past. So it’s might be a case for AI to detect AI to stop these kind of scams in their tracks.

Potentially Destructive Applications

Let’s face we are all too familiar with the speed at which this new form of AI technology is growing. Due to the fact some of these AI generated scams were successful it will just lead to more people wanting to do the same thing.

We can’t be sure as to what might happen in the near future but our guess would be that we might see alot more of these types of attacks in the not too distant future:

Compromise Bank Accounts: Cloning peoples voices to gain access to bank accounts or pretending to be banking agencies. This already does happen but we expect the type of attack to be more sophisticated. They might start attacking banking headquarters.

Framing People and Fake Crime Evidence: Picture your voice being recorded and then an individual getting an AI bot to say what ever they like. This could put you in a bad light, they might make you say something offensive or worse admitting to a crime.

Scam Family members: This is becoming all too common. We have heard contless cases where scammers have duplicated childrens voices and called up parents/grandparents asking for money. Remember to give your children keywords, so you know it’s them.

These are just a few of the examples of how these AI voice tools can be used to exploit people. We have already seen a massive influx of deep fake videos, it’s only a matter of time before someone does something political and it starts WW3!

So How Do We Detect AI Voices?

Well luckily at the moment it’s much easier to detect. There are trained individuals that are taught how to detect AI voices. There are many factors you will want to take into account on weather or not the person you are talking to is real or fake. Here are some examples of what to look for:

Background noise: Is there any? too much? or not making any sense
Voice: Is it fluent, not choppy.
Long pauses: does the person on the other side of the phone take a few seconds to respond
Pitch: is the tone of the voice monotone or doesn’t sound realistic.
Ask yourself does this response sound human or does it sound like ChatGPT wrote it?

If you are at all worried you should record the conversation you are having and pass it onto a professional that can analyse the voice for you. Majority of the time the voice will come across as low quality, so bare this in mind the next time you have a conversation with a complete stranger over the phone.

What we recommend our readers do is prepare challegne questions or phrases that someone on the other side can repeat to you, so you can ensure it’s them. It might be something simple as “I don’t know mom I don’t really like watermelon ice cubes” (random we know). But that’s the point, think of something that someone can easily remember but a bad actor would have a hard time guessing.

Sure you could say a pass code, but the point is you want to listen to the person read out the phrase to ensure they didn’t just make a lucky guess or someone obtained the pass phrase through unlawful means.

Could AI really clone people? — Maybe in the future AI might clone people. If it can take your voice it could take your looks too.

Conclusion

Ok some of you might not like the view we have on these new AI tools. Personally we hate it, we don’t see the need in anyone trying to mimic anyone’s voice. It’s not needed and if we continue developing clones of peoples voices it’s going to lead down a slippery slope.

We are not just talking about bad actors we are also talking about the distasteful. Think of songs created by the deceased, or “celebrities” leaving messages on your phones. Can you see where we are going here?

We love Text to Speech when the actor is generic, someone that isn’t real as it’s great for producing content. But when you start stealing someone else’s voice, that’s where we draw the line. What do you think about all this? please leave your comments below.