Software opens Urdu to data mining

U. BUFFALO (US) — A new software system will allow for computational processing of documents in Urdu, Pakistan’s national language, providing a foundation for more accurate transliteration for English social media sites.

Facebook and Twitter play a substantial role in political unrest worldwide, but without basic electronic resources, document analysis and meaningful data mining is difficult at best.

The system will also help computer scientists develop sophisticated ways to begin to do sentiment analysis of social media content.

“This is the first comprehensive, natural language processing system for Urdu,” says Rohini Srihari, associate professor of computer science and engineering at the University at Buffalo.

The work is published in ACM Transactions on Asian Language Information Processing.

“The system we developed provides the first full pipeline of electronic language processing capabilities in Urdu,” Srihari says. “It facilitates electronic tasks ranging from the simplest keyword search to sentiment analysis of social networks, where you use computational methods to analyze opinions in a country or culture.”

Srihari and colleagues became interested in Urdu when looking at blogs in different cultures. “The advent of the Web has really increased the amount of content in languages like Urdu,” she says.

“When you start looking at blogs in different cultures, you can really start to understand public sentiment and opinions.”

Because these languages don’t have established electronic infrastructures that are taken for granted in English and the European languages, such as lexicons, annotated electronic dictionaries and well-developed ontologies that describe relationships among words and entities in documents, social media access is problematic.

“If you are trying to do sentiment analysis—to find out what are the main topics people are talking about in a country, is there intensity building up over something and who is swaying opinion—then you must have an information extraction system,” she says.

Information extraction uses a combination of linguistics and computer science to extract salient information such as entities, relationships between entities and events from large collections of unstructured text.

“Now we have the developed the first system that will recognize everything in a raw, unprocessed Urdu document,” she says. “It will be able to plot all the interesting names, dates, times, all the entities that might be of interest in a particular set of documents. That’s what allows you to start data mining, whether it’s blogs, social networks or comments on a news site.”

The new information extraction system was developed as a “pipeline of processing” that begins with simple processing, akin to looking a word up in a dictionary to find its meaning, and progresses to more complex processing, such as diagramming a sentence to find its subject and object and establishing context.

The system performs several functions, including word segmentation, in which individual words are properly segmented, part-of-speech tagging in which parts of speech are properly identified, and named entity-tagging, in which names of people, places, organizations, dates and other specific pieces of information are identified and translated into English.

The work will be helpful in the development of similar systems for other languages that lack basic, electronic resources, such as Dari and Somali, Srihari says.

More news from University at Buffalo: