Picture digital data on a massive scale

U. TEXAS-AUSTIN (US) — By 2014, the National Archives and Records Administration expects to house more than 35 petabytes (quadrillions of bytes) of electronic data. Modern archivists face a daunting task: manage and make sense of it all.

“The National Archives is a unique national institution that responds to requirements for preservation, access, and the continued use of government records,” says Robert Chadduck, acting director for the National Archives Center for Advanced Systems and Technologies.

Researchers at the Texas Advanced Computing Center (TACC) at the University of Texas are developing scalable solutions that combine different data analysis methods into a visualization framework. The visualizations act as a bridge between the archivist and the data by interactively rendering information as shapes and colors that help illustrate the archive’s structure and content.

This snapshot corresponds to a regularly organized website containing a total of 2,000 files of different file formats. Highlighted in shades of yellow are different number of Portable Document Format (PDF) files. The purple color shows patterns in file naming convention across directories. (Credit: Maria Esteva, Weijia Xu, Suyog Dutt Jain, and Varun Jain)

Archivists spend a significant amount of time determining the organization, contents, and characteristics of collections so they can describe them for public access purposes.

“This process involves a set of standard practices and years of experience from the archivist side,” says Weijia Xu, a data analysis expert at TACC. “To accomplish this task in large-scale digital collections, we are developing technologies that combine computing power with domain expertise.”

Knowing that human visual perception is a powerful information processing system, TACC researchers expanded on methods that take advantage of this innate skill.

In particular, they adapted the well-known treemap visualization, which is traditionally used to represent file structures, to render additional information dimensions, such as technical metadata, file format correlations and preservation risk-levels.

This information is determined by data driven analysis methods on the visualization’s back-end. The renderings are tailored to suit the archivist’s need to compare and contrast different groups of electronic records on the fly. In this way, the archivist can assess, validate or question the results and run other analyses.

One of the back-end analysis methods developed by the team combines string alignment algorithms with Natural Language Processing methods, two techniques drawn from biology. Applied to directory labels and file naming conventions, the method helps archivists infer whether a group of records is organized by similar names, by date, by geographical location, in sequential order, or by a combination of any of those categories.

Another analysis method under development computes paragraph-to-paragraph similarity and uses clustering methods to automatically discover “stories” from large collections of email messages. These stories, made by messages that refer to the same activity or transaction, may then become the points of access to large collections that cannot be explored manually.

To analyze terabyte-level data, the researchers distribute data and computational tasks across multiple computing nodes on TACC’s high performance computing resource, Longhorn, a data analysis and visualization cluster funded by the National Science Foundation. This accelerates computing tasks that would otherwise take a much longer time on standard workstations.

The question remains as to whether archivists and the public will adapt to the abstract data representations proposed by TACC.

“A fundamental aspect of our research involves determining if the representation and the data abstractions are meaningful to archivists conducting analysis, if they allow them to have a clear and thorough understanding of the collection,” says TACC’s digital archivist Maria Esteva.

“The research addresses many of the problems associated with comprehending the preservation complexities of large and varied digital collections,” says Jennifer Lee, a librarian at the University of Texas at Austin. “The ability to assess varied characteristics and to compare selected file attributes across a vast collection is a breakthrough.”

More news from the University of Texas: www.utexas.edu/research/