Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts in Source Code

researchr

You are not signed in
Sign in
Sign up

Adrian Kuhn. Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts in Source Code. Master's thesis, University of Bern, March 2006.

Many approaches have been developed to comprehend software source code, most of them focusing on program structural information. However, in doing so we are missing a crucial information, namely, the domain semantics information contained in the text or symbols of the source code. When we are to understand software as a whole, we need to enrich these approaches with conceptual insights gained from the domain semantics. This paper proposes the use of information retrieval techniques to exploit linguistic information, such as identifier names and comments in source code, to gain insights into how the domain is mapped to the code. We introduce Semantic Clustering, an algorithm to group source artifacts based on how they use similar terms. The algorithm uses Latent Semantic Indexing. After detecting the clusters, we provide an automatic labeling and then we visually explore how the clusters are spread over the system. Our approach works at the source code textual level which makes it language independent. Nevertheless, we correlate the semantics with structural information and we apply it at different levels of abstraction (for example packages, classes, methods). To validate our approach we applied it on several case studies.

External Links

Cite Key

Statistics

PDF

Tags

Bibliographies

Researchr

Semantic Clustering: Making Use of Linguistic Information to Reveal Concepts in Source Code

Abstract