Millions of people share names. Computers have to distinguish—or, technically speaking, disambiguate—between them, which can be challenging for common names.
This conundrum occurs in a wide range of environments, from the bibliographic—which Anna Hernandez authored a specific study?—to law enforcement—which Robert Jones is attempting to board an airplane flight?
“Our method grows and changes when new persons appear, enabling us to recognize the ever-growing number of individuals.”
Computer scientists have developed a machine-learning method to tackle the problem. They report in a recent paper that the new method is an improvement on currently existing approaches of name disambiguation because the new method works on streaming data that enables the identification of previously unencountered individuals.
Existing methods can disambiguate an individual only if the person’s records are present in machine-learning training data, whereas the new method can perform non-exhaustive classification so that it can detect the fact that a new record that appears in streaming data actually belongs to a fourth John Smith, even if the training data has records of only three different John Smiths.
“Non-exhaustiveness” is a very important aspect for name disambiguation because training data can never be exhaustive, as it is impossible to include records of all living John Smiths.
“We looked at a problem applicable to scientific bibliographies using features like keywords and coauthors, but our disambiguation work has many other real-life applications—in the security field, for example,” says Mohammad al Hasan, who led the study and who is an associate professor of computer science at Indiana University-Purdue University Indianapolis. “We can teach the computer to recognize names and disambiguate information accumulated from a variety of sources—Facebook, Twitter, blog posts, public records, and other documents—by collecting features such as Facebook friends and keywords from people’s posts using the identical algorithm.”
In the new study, for a given name value, computers were “trained” by using records of different individuals with that name to build a model that distinguishes between individuals with that name, even individuals about whom information had not been included in the training data previously provided to the computer.
“Features” are bits of information with some degree of predictive power to define a specific individual. The researchers focused on three types of features:
- Relational or association features to reveal persons with whom an individual is associated: for example, relatives, friends, and colleagues.
- Text features, such as keywords in documents: for example, repeated use of sports-, culinary- or terrorism-associated keywords.
- Venue features: for example, institutions, memberships, or events with which an individual is currently or was formerly associated.
“Our proposed method is scalable and will be able to group records belonging to a unique person even if thousands of people have the same name, an extremely complicated task.
“Our innovative machine-learning model can perform name disambiguation in an online setting instantaneously and, importantly, in a non-exhaustive fashion,” Hasan adds. “Our method grows and changes when new persons appear, enabling us to recognize the ever-growing number of individuals whose records were not previously encountered.
“Also, some names are more common than others, so the number of individuals sharing that name grows faster than other names. While working in a non-exhaustive setting, our model automatically detects such names and adjusts the model parameters accordingly.”
The National Science Foundation supported the work.
Source: Indiana University