ǿմý

Skip to main content

Researchers warn of unchecked toxicity in AI language models

File - The OpenAI logo appears on a mobile phone in front of a screen showing part of the company website in this photo taken on Nov. 21, 2023 in New York. (AP Photo/Peter Morgan, File) File - The OpenAI logo appears on a mobile phone in front of a screen showing part of the company website in this photo taken on Nov. 21, 2023 in New York. (AP Photo/Peter Morgan, File)
Share

As OpenAI’s ChatGPT continues to change the game for automated text generation, researchers warn that more measures are needed to avoid dangerous responses.

While advanced language models such as ChatGPT could quickly write a computer program with complex code or summarize studies with cogent synopsis, these text generators are also able to provide toxic information, such as how to build a bomb.

In order to prevent these potential safety issues, companies using large language models deploy safeguard measures called “red-teaming,” where teams of human testers write prompts aimed at provoking unsafe responses, in order to trace risks and train chatbots to avoid providing those types of answers.

However, according to researchers with Massachusetts Institute of Technology (MIT), “red teaming” is only effective if engineers know which provocative responses to test.

In other words, a technology that does not rely on human cognition to function still relies on human cognition to remain safe.

Researchers from Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab are deploying machine learning to fix this problem, developing a “red-team language model” specifically designed to generate problematic prompts that trigger undesirable responses from tested chatbots.

"Right now, every large language model has to undergo a very lengthy period of red-teaming to ensure its safety,” said Zhang-Wei Hong, a researcher with the Improbable AI lab and lead author of a paper on this red-teaming approach, in a press release.

“That is not going to be sustainable if we want to update these models in rapidly changing environments. Our method provides a faster and more effective way to do this quality assurance.”

According to the research, the machine-learning technique outperformed human testers by generating prompts that triggered increasingly toxic responses from advanced language models, even drawing out dangerous answers from chatbots that have built-in safeguards.

AI red-teaming

The automated process of red-teaming a language model depends on a trial-and-error process which rewards the model for triggering toxic responses, says MIT researchers.

This reward system is based on what’s called “curiosity-driven exploration,” where the red-team model tries to push to boundaries of toxicity, deploying sensitive prompts with different words, sentence patterns or content.

"If the red-team model has already seen a specific prompt, then reproducing it will not generate any curiosity in the red-team model, so it will be pushed to create new prompts," Hong explained in the release.

The technique outperformed human testers and other machine-learning approaches by generating more distinct prompts that elicited increasingly toxic responses. Not only does their method significantly improve the coverage of inputs being tested compared to other automated methods, but it can also draw out toxic responses from a chatbot that had safeguards built into it by human experts.

The model is equipped with a “safety classifier” that provides a ranking for the level of toxicity provoked.

MIT researchers hope to train red-team models to generate prompts on a wider range of illicit content, and to eventually train chatbots to abide by specific standards, such as a company policy document, in order to test for company policy violations amidst increasingly automated output.

“These models are going to be an integral part of our lives and it's important that they are verified before released for public consumption,” said Pulkit Agrawal, senior author and director of Improbable AI, in the release.

“Manual verification of models is simply not scalable, and our work is an attempt to reduce the human effort to ensure a safer and trustworthy AI future," Agrawal said.

CTVNews.ca ǿմý

The province's public security minister said he was "shocked" Thursday amid reports that a body believed to be that of a 14-year-old boy was found this week near a Hells Angels hideout near Quebec City.

B.C.'s police watchdog is investigating the death of a woman who was shot by the RCMP after allegedly barricading herself in a room with a toddler early Thursday morning.

Local Spotlight

They say a dog is a man’s best friend. In the case of Darren Cropper, from Bonfield, Ont., his three-year-old Siberian husky and golden retriever mix named Bear literally saved his life.

Paleontologists from the Royal B.C. Museum have uncovered "a trove of extraordinary fossils" high in the mountains of northern B.C., the museum announced Thursday.

The search for a missing ancient 28-year-old chocolate donkey ended with a tragic discovery Wednesday.

The Royal Canadian Mounted Police is celebrating an important milestone in the organization's history: 50 years since the first women joined the force.

It's been a whirlwind of joyful events for a northern Ontario couple who just welcomed a baby into their family and won the $70 million Lotto Max jackpot last month.

A Good Samaritan in New Brunswick has replaced a man's stolen bottle cart so he can continue to collect cans and bottles in his Moncton neighbourhood.

David Krumholtz, known for roles like Bernard the Elf in The Santa Clause and physicist Isidor Rabi in Oppenheimer, has spent the latter part of his summer filming horror flick Altar in Winnipeg. He says Winnipeg is the most movie-savvy town he's ever been in.

Edmontonians can count themselves lucky to ever see one tiger salamander, let alone the thousands one local woman says recently descended on her childhood home.

A daytrip to the backcountry turned into a frightening experience for a Vancouver couple this weekend.