Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts in Source Code

Adrian Kuhn. Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts in Source Code. Master's thesis, University of Bern, March 2006.

Abstract

Many approaches have been developed to comprehend software source code, most of them focusing on program structural information. However, in doing so we are missing a crucial information, namely, the domain semantics information contained in the text or symbols of the source code. When we are to understand software as a whole, we need to enrich these approaches with conceptual insights gained from the domain semantics. This paper proposes the use of information retrieval techniques to exploit linguistic information, such as identifier names and comments in source code, to gain insights into how the domain is mapped to the code. We introduce Semantic Clustering, an algorithm to group source artifacts based on how they use similar terms. The algorithm uses Latent Semantic Indexing. After detecting the clusters, we provide an automatic labeling and then we visually explore how the clusters are spread over the system. Our approach works at the source code textual level which makes it language independent. Nevertheless, we correlate the semantics with structural information and we apply it at different levels of abstraction (for example packages, classes, methods). To validate our approach we applied it on several case studies.