Autonomous Citation Matching

Steve Lawrence, C. Lee Giles, Kurt D. Bollacker. Autonomous Citation Matching. In Agents. pages 392-393, 1999. [doi]

Abstract

Advances in computational resources and the communications infrastructure, and the rapid rise of the World Wide Web, have led to the increasingly widespread availability of scientific papers in electronic form. Scientific papers usually contain citations to previous work, and indices of these citations are valuable for literature search, analysis, and evaluation. Current citation indices of the scientific literature are constructed using manual effort and are typically expensive. Part of the reason for using manual effort is the great variability of citation syntax – it can be difficult to autonomously determine if two citations refer to the same article because citations can be written in many different formats. We present machine learning techniques that identify variant forms of citations to the same paper. A number of algorithms are presented. An algorithm based on word and phrase matching is found to perform best, and is sufficiently accurate for unassisted use in an autonomous citation indexing system. An algorithm based on a string edit distance performs poorly in comparison. A computationally efficient subfield algorithm is also presented. The accuracy and efficiency of all algorithms is quantitatively compared on a number of datasets.