STANFORD (US) — A new online database analyzes newspaper vocabulary from the 1820s onward to reveal which topics have preoccupied Texas communities.
An all-consuming public interest in family, religion, and football in modern rural Texas is just one of the cultural snapshots that can be culled from Mapping Texts, a new interactive database that generates graphical interpretations of language trends embedded in over 230,000 pages of Texas newspapers from the late 1820s through the early 2000s.
A page from The Schulenburg Sticker, one of the dozens of historical newspapers cataloged in Mapping Texts. (Courtesy of University of North Texas Library)
A collaborative initiative between the University of North Texas and Stanford University’s Bill Lane Center for the American West, Mapping Texts is sponsored by a Digital Humanities Start-Up Grant from the National Endowment for the Humanities.
The project team—led by Andrew J. Torget from the University of North Texas and Jon Christensen from Stanford—spent the last 18 months developing new methods for finding and analyzing vocabulary patterns in massive collections of historical newspapers.
When a visitor to the project’s website clicks on “Modern Texas,” a map of the state appears with a visualization showing the quantity and location of digitized newspapers available for analysis. A box also lists the top ten topics discussed in the newspapers during that time span.
Government, politics, and business make an appearance, but sports, family, and church dominate the list. A quick look through this history reveals that coverage of sports elbows its way into the top 10 in the early 20th century.
In an era when historical newspapers are being digitized at an astonishing rate, the ability to extrapolate such meaningful patterns with a basic text search simply isn’t feasible. The primary goal of the project, explains Torget, “was to find new ways for people to make sense of the overwhelming abundance of information being made available in the digital age.”
As Christensen, executive director of the Bill Lane Center for the American West says, “One of the biggest problems in the humanities, and the digital humanities especially, is too much information.”
Sorting through history
The largest potential impact of the project is that it allows humanities scholars to efficiently identify evidence of societal trends without the burden of sorting through hundreds of search results.
For example, in the period between 1845 and 1861, after Texas went from being an independent republic to a state before the Civil War, 38 newspapers across Texas were dominated by discussion of politics, cotton markets, and the fate of the Union, as evidenced by the frequent appearance of the words Texas, county, land, law, and sale. Family terms, including wife, children, and love, are also among the top 10 most popular news topics of the day.
The computer-generated graphical representations of text correlations will yield new interpretations of historical and literary ideas. Christensen, a history scholar, describes Mapping Texts as “a way of discovering new evidence that we haven’t seen before that may change our view of history.”
There is evidence in these language patterns, for instance, that suggests Texans began regularly commemorating the Battle of San Jacinto, the final fight in the Texas Revolution of 1836, soon after the Civil War, long before the early 20th century, the period when most historians thought the commemorations began.
The source material was taken from the Texas Digital Newspaper Program at the University of North Texas Library.
Two views of the past
Visitors to the Mapping Texts website can delve into the data through two different interactive visualizations.
Through graphs, timelines, and a regional map, “Mapping Newspaper Quality” presents a quantitative survey of the newspapers. Users can explore the quantity of information available for any particular time period, location or newspaper, as well as the quality of the digitization of the newspapers. By clicking on individual newspaper titles, users can also access the original newspaper pages.
“Mapping Language Patterns” is a qualitative survey of the newspapers that visually plots major language patterns embedded in the collection. For any given time period, geography, or newspaper title, users can explore the most common words (word counts), named entities (people, places, etc.), and highly correlated words (topic models), which together provide a window into the major language patterns emanating from the newspapers.
Trends emerge with every connection plotted and it becomes much easier for scholars to discern major themes in their research, and as Christensen states, “You get a sense of the overall patterns of how things people were talking about changed over time.
“By mapping the contents of these newspapers across both time and space, as well as the quality of the OCR [optical character recognition] digitization, we aimed not just to reveal patterns and surprises in the collection that you simply would not otherwise see, but also to give researchers a concrete sense of what information is and what information is not available to them in a large digital archive.”
Composed of computer scientists, the University of North Texas team brought their natural language processing expertise to the project, while scholars from Stanford’s Bill Lane Center provided their specialized experience of digitally mapping complex historical data.
The collaboration, as the project’s website explains, allowed for the “unique opportunity to conduct experiments in what might be possible though text-mining and visualizing a large collection of historical newspapers.”
“One thing I find particularly compelling about this project,” says Brett Bobley, chief information officer at the National Endowment for the Humanities and director of the NEH Office of Digital Humanities, “is that since all the National Digital Newspaper Project pages are created using the same standards, work like the Mapping Texts project could, in theory, scale beyond the Texas newspapers to other states or even nationally.”
As millions of pages of newspapers—and other humanities materials—are scanned, “new methods for searching and analyzing the materials will become critical to scholarship,” says Bobley.
More news from Stanford: http://news.stanford.edu/