The magazine for alumni and friends of the UB School of Management
Issue link: http://ubschoolofmanagement.uberflip.com/i/1449984
26 Buffalo Business Spring 2022 Data mining the past New algorithm searches historic documents to discover noteworthy people Old newspapers provide a window into our past, and a new algorithm co-developed by a School of Management researcher is helping turn those historic documents into useful, searchable data. Published in Decision Support Systems, the algorithm can find and rank people's names in order of importance from the results produced by optical character recog- nition (OCR), the computerized method of converting scanned documents into text that is oen messy. "It's a known fact that when OCR so- ware is run, very oen the text gets garbled," says Haimonti Dutta, assistant professor of management science and systems. "With old newspapers, books and magazines, problems can arise from poor ink quality, crumpled or torn paper, or even unusual page layouts the soware isn't expecting." To d e v el o p t h e a l g o r i t h m , t h e researchers partnered with the New York Public Library (NYPL) and analyzed more than 14,000 articles from New York City news- paper The Sun published during November and December of 1894. The NYPL has scanned more than 200,000 newspaper pages as part of Chronicling America, an initiative of the National Endowment for the Humanities and the Library of Congress that is working to develop an online, searchable database of historical newspapers from 1777 to 1963. Their algorithm ranks people's names by importance based on a number of attributes, including the context of the name, title before the name, article length and how frequently the name was mentioned in an article. The algorithm learns these attributes only from the text—it does not rely on external sources of information such as Wikipedia or other knowledgebases. But since the OCR text is garbled, it can't determine how effec- tive these attributes are for ranking people's names. So the researchers used statistical measures to model the many data attributes, which helped provide the desired ranking of names. The researchers used two sets of the historic articles to test their algorithm: One set was the raw text produced from the OCR soware, the other set had been cleaned up manually by New York City schoolchildren, who are using the articles to write biographies of local, notable people of the time. When compared to the cleaned-up versions of the stories, the ranking algo- rithm is able to sort people's names with a high degree of precision even from the noisy OCR text. Dutta says their process has wide- reaching implications for discovering important people throughout history. "We recently used this technique on African American literature from the Civil War to learn more about noteworthy people during the era of slavery," says Dutta. "Going forward, we'll be expanding the technique to examine relationships between people and build out the social networks of the past." Dutta Insights Important historical figures identified by the new algorithm (L to R): Grand Duke George Alexandrovich of Russia, Capt. William Bainbridge-Hoff, Fanny Gordon and Chauncey Mitchell Depew. Photos: Public Domain