AI now sounds more like us – should we be concerned?

Several wealthy Italian businessmen received a surprising phone call earlier this year. The speaker, who sounded just like Defence Minister Guido Crosetto, had a special request: Please send money to help us free kidnapped Italian journalists in the Middle East.

But it was not Crosetto at the end of the line. He learned about the call only when several of the targeted businessmen contacted him about it. It eventually transpired that fraudsters had used artificial intelligence (AI) to fake Crosetto’s voice.

Advances in AI technology mean it is now possible to generate ultrarealistic voiceovers and soundbites. Indeed, new research has found that AI-generated voices are now indistinguishable from real human voices. In this explainer, we unpack what the implications of this could be.

What happened in the Crosetto case?

Several Italian entrepreneurs and businessmen received calls at the start of February, one month after Prime Minister Giorgia Meloni had secured the release of Italian journalist Cecilia Sala, who had been imprisoned in Iran.

In the calls, the “deepfake” voice of Crosetto asked the businessmen to wire approximately 1 million euros ($1.17m) to an overseas bank account, the details of which were provided during the call or in other calls purporting to be from members of Crosetto’s staff.

On February 6, Crosetto posted on X, saying he had received a call on February 4 from “a friend, a prominent entrepreneur”. That friend asked Crosetto if his office had called to ask for his mobile number. Crosetto said it had not. “I tell him it was absurd, as I already had it, and that it was impossible,” he wrote in his X post.

Crosetto added that he was later contacted by another businessman who had made a large bank transfer following a call from a “General” who provided bank account information.

“He calls me and tells me that he was contacted by me and then by a General, and that he had made a very large bank transfer to an account provided by the ‘General’. I tell him it’s a scam and inform the carabinieri [Italian police], who go to his house and take his complaint.”

Similar calls from fake Ministry of Defence officials were also made to other entrepreneurs, asking for personal information and money.

While he has reported all this to the police, Crosetto added: “I prefer to make the facts public so that no one runs the risk of falling into the trap.”

Some of Italy’s most prominent business figures, including late fashion designer Giorgio Armani and Prada cofounder Patrizio Bertelli, were targeted in the scam. But, according to the authorities, only Massimo Moratti, the former owner of Inter Milan football club, actually sent the requested money. The police were able to trace and freeze the money from the wire transfer he made.

Moratti has since filed a legal complaint with the city’s prosecutor’s office. He told Italian media: “I filed the complaint, of course, but I’d prefer not to talk about it and see how the investigation goes. It all seemed real. They were good. It could happen to anyone.”

How does AI voice generation work?

AI voice generators typically use “deep learning” algorithms, through which the AI program studies large data sets of real human voices and “learns” pitch, enunciation, intonation and other elements of a voice.

The AI program is trained using several audio clips of the same person and is “taught” to mimic that specific person’s voice, accent and style of speaking. The generated voice or audio is also called an AI-generated voice clone.

Using natural language processing (NLP) programs, which instruct it to understand, interpret and generate human language, AI can even learn to understand tonal features of a voice, such as sarcasm or curiosity.

These programs can convert text to phonetic components and then generate a synthetic voice clip that sounds like a real human.

“Broadly speaking, we can train an AI model with thousands and thousands of hours of recordings of human voice, so the model can learn what human voices in general sound like,” Nadine Lavan, a senior lecturer in psychology at Queen Mary University of London, told Al Jazeera.

“That is then for model from which you can create AI-generated voices, either by just asking the model to go and generate a voice that has no real human counterpart or by giving the model an example of a voice and telling it to clone that voice, to create an AI-generated version of that specific voice or a deepfake,” Lavan, who is one of the coauthors of the recent research on AI voice, said.

The term “deepfake” was coined in 2014 by Ian Goodfellow, director of machine learning at Apple Special Projects Group. It combines “deep learning” and “fake”, and refers to highly realistic AI images, videos or audio, all generated through deep learning.

How good are they at impersonating someone?

Research conducted by a team at Queen Mary University of London and published by the science journal PLOS One on September 24 concluded that AI-generated voices do sound like real human voices to people listening to them.

In order to conduct the research, the team generated 40 samples of AI voices – both using real people’s voices and creating entirely new voices – using a tool called ElevenLabs. The researchers also collected 40 recording samples of people’s actual voices. All 80 of these clips were edited and cleaned for quality.

The research team used male and female voices with British, American, Australian and Indian accents in the samples. ElevenLabs offers an “African” accent as well, but the researchers found that the accent label was “too general for our purposes”.

The team recruited 50 participants aged 18-65 in the United Kingdom for the tests. They were asked to listen to the recordings to try to distinguish between the AI voices and the real human voices. They were also asked which voices sounded more trustworthy.

The study found that while the “new” voices generated entirely by AI were less convincing to the participants, the deepfakes or voice clones were rated about equally realistic as the real human voices.

Forty-one percent of AI-generated voices and 58 percent of voice clones were mistaken for real human voices.

Additionally, the participants were more likely to rate British-accented voices as real or human compared with those with American accents, suggesting that the AI voices are extremely sophisticated.

More worrying, the participants tended to rate the AI-generated voices as more trustworthy than the real human voices. This contrasts with previous research, which usually found AI voices less trustworthy, signalling, again, that AI has become particularly sophisticated at generating fake voices.

“One likely explanation for why state-of-the-art AI voice generation has become much more sophisticated recently might be that the models are now trained on vast, high-quality training data sets,” Lavan said.

“That just means that the models get much more information about how voices work, such that it can build up a more detailed picture,” Lavan said. She explained that AI can create more realistic voices by mimicking different accents, intonation, speaking patterns, even breathing sounds and speech errors.

Should we all be very worried about this?

While AI-generated audio that sounds very “human” can be useful for industries such as advertising and film editing, it can be misused in scams and to generate fake news.

Scams similar to the one that targeted the Italian businessmen are already on the rise. In the United States, there have been reports of people receiving calls featuring deepfake voices of their relatives saying they are in trouble and requesting money.

Between January and June this year, people all over the world have lost more than $547.2m to deepfake scams, according to data by the California-based AI company Resemble AI. Showing an upward trend, the figure rose from about $200m in the first quarter to $347m in the second.

“If it only takes a few minutes [or even a few seconds] of recording of a voice to clone it in a reasonably convincing manner, one obvious concern for highly realistic AI-generated voices is identity theft,” Lavan said.

However, Lavan added that there are positive ways AI-generated voices are being used.

“Beyond risks, though, one of the most promising and compassionate applications of AI-generated voice technology is its potential to restore voices for individuals who can no longer speak, or who have limited control over their physical voice,” Lavan said.

“Today, users can choose to recreate their original voice, if that’s what they prefer, or design a completely new voice that reflects their identity and personal taste.”

Can video also be deepfaked?

Alarmingly, yes. AI programs can be used to generate deepfake videos of real people. This, combined with AI-generated audio, means video clips of people doing and saying things they have not done can be faked very convincingly.

Furthermore, it is becoming increasingly difficult to distinguish which videos on the internet are real and which are fake.

DeepMedia, a company working on tools to detect synthetic media, estimates that about eight million deepfakes will have been created and shared online by the end of this year – a huge increase from the 500,000 that were shared online in 2023.

What else are deepfakes being used for?

Besides the phone call fraud and fake news, AI deepfakes have been used to create sexual content about real people. Most worryingly, Resemble AI’s report, which was released in July, found that advances in AI have resulted in the industrialised production of AI-generated child sexual abuse material, which has overwhelmed law enforcement globally.

In May this year, US President Donald Trump signed a bill making it a federal crime to publish intimate images of a person without their consent. This includes AI-generated deepfakes. Last month, the Australian government also announced that it would ban an application used to create deepfake nude images.