Locating and Parsing Bibliographical References in HTML Medical Articles

Jie Zou, Daniel X. Le, George R. Thoma. Locating and Parsing Bibliographical References in HTML Medical Articles. 2009.

Abstract

Bibliographical references that appear in journal articles can provide valuable hints for subsequent information extraction. We describe our statistical machine learning algorithms for locating and parsing such references from HTML medical journal articles. Reference locating identifies the reference sections and then decomposes them into individual references. We formulate reference locating as a two-class classification problem based on text and geometric features. An evaluation conducted on 500 articles from 100 journals achieves near perfect precision and recall rates for locating references. Reference parsing is to identify components, e.g. author, article title, journal title etc., from each individual reference. We implement and compare two reference parsing algorithms. One relies on sequence statistics and trains a Conditional Random Field. The other focuses on local feature statistics and trains a Support Vector Machine to classify each individual word, and then a search algorithm systematically corrects low confidence labels if the label sequence violates a set of predefined rules. The overall performance of these two reference parsing algorithms is about the same: above 99% accuracy at the word level, and over 97% accuracy at the chunk level.