ChatGPT shows hiring bias against people with disabilities

ChatGPT consistently ranked resumes with disability-related honors and credentials—such as the “Tom Wilson Disability Leadership Award”—lower than the same resumes without those honors and credentials, according to new research.

While seeking research internships last year, graduate student Kate Glazko noticed recruiters posting online that they’d used OpenAI’s ChatGPT and other artificial intelligence tools to summarize resumes and rank candidates.

Automated screening has been commonplace in hiring for decades. Yet Glazko, a doctoral student in the Paul G. Allen School of Computer Science & Engineering at the University of Washington, studies how generative AI can replicate and amplify real-world biases—such as those against disabled people.

How might such a system, she wondered, rank resumes that implied someone had a disability?

In the new study, when researchers asked ChatGPT to explain its resume rankings, the system spat out biased perceptions of disabled people. For instance, it claimed a resume with an autism leadership award had “less emphasis on leadership roles”—implying the stereotype that autistic people aren’t good leaders.

But when researchers customized the tool with written instructions directing it not to be ableist, the tool reduced this bias for all but one of the disabilities tested.

Five of the six implied disabilities—deafness, blindness, cerebral palsy, autism, and the general term “disability”—improved, but only three ranked higher than resumes that didn’t mention disability.

“Ranking resumes with AI is starting to proliferate, yet there’s not much research behind whether it’s safe and effective,” says Glazko, the study’s lead author. “For a disabled job seeker, there’s always this question when you submit a resume of whether you should include disability credentials. I think disabled people consider that even when humans are the reviewers.”

The researchers used one of the study author’s publicly available curriculum vitae (CV), which ran about 10 pages. The team then created six enhanced CVs, each implying a different disability by including four disability-related credentials: a scholarship; an award; a diversity, equity, and inclusion (DEI) panel seat; and membership in a student organization.

The researchers then used ChatGPT’s GPT-4 model to rank these enhanced CVs against the original version for a real “student researcher” job listing at a large, US-based software company. They ran each comparison 10 times; in 60 trials, the system ranked the enhanced CVs, which were identical except for the implied disability, first only one quarter of the time.

“In a fair world, the enhanced resume should be ranked first every time,” says senior author Jennifer Mankoff, a professor in the Allen School. “I can’t think of a job where somebody who’s been recognized for their leadership skills, for example, shouldn’t be ranked ahead of someone with the same background who hasn’t.”

When researchers asked GPT-4 to explain the rankings, its responses exhibited explicit and implicit ableism. For instance, it noted that a candidate with depression had “additional focus on DEI and personal challenges,” which “detract from the core technical and research-oriented aspects of the role.”

“Some of GPT’s descriptions would color a person’s entire resume based on their disability and claimed that involvement with DEI or disability is potentially taking away from other parts of the resume,” Glazko says. “For instance, it hallucinated the concept of ‘challenges’ into the depression resume comparison, even though ‘challenges’ weren’t mentioned at all. So you could see some stereotypes emerge.”

Given this, the researchers were interested in whether the system could be trained to be less biased. They turned to the GPTs Editor tool, which allowed them to customize GPT-4 with written instructions (no code required). They instructed this chatbot to not exhibit ableist biases and instead work with disability justice and DEI principles.

They ran the experiment again, this time using the newly trained chatbot. Overall, this system ranked the enhanced CVs higher than the control CV 37 times out of 60. However, for some disabilities, the improvements were minimal or absent: The autism CV ranked first only three out of 10 times, and the depression CV only twice (unchanged from the original GPT-4 results).

“People need to be aware of the system’s biases when using AI for these real-world tasks,” Glazko says. “Otherwise, a recruiter using ChatGPT can’t make these corrections, or be aware that, even with instructions, bias can persist.”

The researchers note that some organizations, such as ourability.com and inclusively.com, are working to improve outcomes for disabled job seekers, who face biases whether or not AI is used for hiring. They also emphasize that more research is needed to document and remedy AI biases. Those include testing other systems, such as Google’s Gemini and Meta’s Llama; including other disabilities; studying the intersections of the system’s bias against disabilities with other attributes such as gender and race; exploring whether further customization could reduce biases more consistently across disabilities; and seeing whether the base version of GPT-4 can be made less biased.

“It is so important that we study and document these biases,” Mankoff says. “We’ve learned a lot from and will hopefully contribute back to a larger conversation—not only regarding disability, but also other minoritized identities—around making sure technology is implemented and deployed in ways that are equitable and fair.”

The team presented its findings at the 2024 ACM Conference on Fairness, Accountability, and Transparency in Rio de Janeiro.

Additional coauthors are from the University of Washington and University of Wisconsin–Madison.

Funding for this research came from the National Science Foundation; from donors to the UW’s Center for Research and Education on Accessible Technology and Experiences (CREATE); and from Microsoft.

Source: University of Washington