Ongoing Automated Data Set Generation for Vulnerability Prediction from Github Data

Hinrichs, TorgeChristian Wressnegger, Delphine Reinhardt2023-01-242023-01-242022978-3-88579-717-3https://dl.gi.de/handle/20.500.12116/40138This paper describes the development of a continuous github repository analysis pipeline with the focus on creating a data set for vulnerability prediction in source code. Currently, used data sets consist only of source code functions or methods without additional meta information. This paper assumes that the surrounding code of vulnerable functions can be beneficial to the detection rate. In order to test this assumption, large data sets are needed that can be created using the proposed pipeline. Although the pipeline requires some improvements, in a first test run 1.5 million repositories could be analyzed and evaluated. The resulting data set will be published in the future.enVulnerability PredictionVulnerability DetectionMachine Learningdata set generationOngoing Automated Data Set Generation for Vulnerability Prediction from Github Data10.18420/sicherheit2022_171617-5468