Why people end up mad when AI flags toxic speech

In the end, Mitchell Gordon suggests, annotators as well as social media executives will have to make value judgments with the knowledge that many decisions will always be controversial. (Credit: Getty Images)

New research sheds light on why artificial intelligence identification of toxic speech on the internet often frustrates people, despite getting high scores on technical tests.

The main problem: There is a huge difference between evaluating more traditional AI tasks, like recognizing spoken language, and the much messier task of identifying hate speech, harassment, or misinformation—especially in today’s polarized environment.

“It appears as if the models are getting almost perfect scores, so some people think they can use them as a sort of black box to test for toxicity,” says Mitchell Gordon, a PhD candidate in computer science at Stanford University who worked on the project. “But that’s not the case. They’re evaluating these models with approaches that work well when the answers are fairly clear, like recognizing whether ‘java’ means coffee or the computer language, but these are tasks where the answers are not clear.”

Facebook says its artificial intelligence models identified and pulled down 27 million pieces of hate speech in the final three months of 2020. In 97% of the cases, the systems took action before humans had even flagged the posts.

That’s a huge advance, and all the other major social media platforms are using AI-powered systems in similar ways. Given that people post hundreds of millions of items every day, from comments and memes to articles, there’s no real alternative. No army of human moderators could keep up on its own.

The team hopes their study will illuminate the gulf between what developers think they’re achieving and the reality—and perhaps help them develop systems that grapple more thoughtfully with the inherent disagreements around toxic speech.

Even people can’t agree

There are no simple solutions, because there will never be unanimous agreement on highly contested issues. Making matters more complicated, people are often ambivalent and inconsistent about how they react to a particular piece of content.

In one study, for example, human annotators rarely reached agreement when they were asked to label tweets that contained words from a lexicon of hate speech. Only 5% of the tweets were acknowledged by a majority as hate speech, while only 1.3% received unanimous verdicts. In a study on recognizing misinformation, in which people were given statements about purportedly true events, only 70% agreed on whether most of the events had or had not occurred.

Despite this challenge for human moderators, conventional AI models achieve high scores on recognizing toxic speech—.95 “ROCAUC”—a popular metric for evaluating AI models in which 0.5 means pure guessing and 1.0 means perfect performance. But the Stanford team found that the real score is much lower—at most .73—if you factor in the disagreement among human annotators.

Spotting toxic speech

In a new study, the team reassesses the performance of today’s AI models by getting a more accurate measure of what people truly believe and how much they disagree among themselves.

Michael Bernstein and Tatsunori Hashimoto, associate and assistant professors of computer science and faculty members of the Stanford Institute for Human-Centered Artificial Intelligence (HAI) oversaw the study.

To get a better measure of real-world views, the researchers developed an algorithm to filter out the “noise”—ambivalence, inconsistency, and misunderstanding—from how people label things like toxicity, leaving an estimate of the amount of true disagreement. They focused on how repeatedly each annotator labeled the same kind of language in the same way. The most consistent or dominant responses became what the researchers call “primary labels,” which the researchers then used as a more precise dataset that captures more of the true range of opinions about potential toxic content.

The team then used that approach to refine datasets that are widely used to train AI models in spotting toxicity, misinformation, and pornography. By applying existing AI metrics to these new “disagreement-adjusted” datasets, the researchers revealed dramatically less confidence about decisions in each category. Instead of getting nearly perfect scores on all fronts, the AI models achieved only .73 ROCAUC in classifying toxicity and 62% accuracy in labeling misinformation. Even for pornography—as in, “I know it when I see it”—the accuracy was only .79.

Controversy is inevitable

Gordon says AI models, which must ultimately make a single decision, will never assess hate speech or cyberbullying to everybody’s satisfaction. There will always be vehement disagreement. Giving human annotators more precise definitions of hate speech may not solve the problem either, because people end up suppressing their real views in order to provide the “right” answer.

But if social media platforms have a more accurate picture of what people really believe, as well as which groups hold particular views, they can design systems that make more informed and intentional decisions.

In the end, Gordon suggests, annotators as well as social media executives will have to make value judgments with the knowledge that many decisions will always be controversial.

“Is this going to resolve disagreements in society? No,” says Gordon. “The question is what can you do to make people less unhappy. Given that you will have to make some people unhappy, is there a better way to think about whom you are making unhappy?”

The paper’s additional coauthors include investigators from Stanford and Apple Inc.

Source: Stanford University