Campos, ViolaKlein, MaikeKrupka, DanielWinter, CorneliaGergeleit, MartinMartin, Ludger2024-10-212024-10-212024978-3-88579-746-32944-7682https://dl.gi.de/handle/20.500.12116/45097Language models for source code have improved significantly with the emergence of Transformer-based Large Language Models (LLMs). These models are trained on large amounts of code in which defects are relatively rare, causing them to perceive faulty code as unlikely and correct code as more 'natural,' thus assigning it a higher likelihood. We hypothesize that the likelihood scores generated by an LLM can be directly used as a lightweight approach to detect and localize bugs in source code. In this study, we evaluate various methods to construct a suspiciousness score for faulty code segments based on LLM likelihoods. Our results demonstrate that these methods can detect buggy methods in a common benchmark with up to 78% accuracy. However, using LLMs directly for fault localization raises concerns about training data leakage, as common benchmarks are often already incorporated into the training data of such models and thus learned. By additionally evaluating our experiments on a small, non-public dataset of student submissions to programming exercises, we show that leakage is indeed an issue, as the evaluation results on both datasets differ significantly.enFault DetectionFault LocalizationAI4SELLM4SEBug Detection and Localization using Pre-trained Code Language ModelsText/Conference Paper10.18420/inf2024_1241617-54682944-7682