Auflistung nach Schlagwort "clustering"
1 - 8 von 8
Treffer pro Seite
Sortieroptionen
- TextdokumentChain-detection for DBSCAN(BTW 2019 – Workshopband, 2019) Held, Janis; Beer, Anna; Seidl, ThomasChains connecting two or more different clusters are a well known problem of the probably most famous density-based clustering algorithm DBSCAN. Since already a small number of points resulting from, e.g., noise can form such a chain and build a bridge between different clusters, it can happen that the results of DBSCAN are distorted: several disparate clusters get merged into one. This single-link effect is rather known but to the best of our knowledge there are no satisfying solutions which extract those chains, yet. We present a new algorithm detecting not only straight chains between clusters, but also bent and noisy ones. Users are able to choose between eliminating one dimensional and higher dimensional chains connecting clusters to receive the underlying cluster structure by DBSCAN. Also, the desired straightness can be set by the user. We tested our efficient algorithm on a dataset containing traffic accidents in Great Britain and were able to detect chains emerging from streets between cities and villages, which led to clusters composed of diverse villages.
- KonferenzbeitragData Leakage Through Click Data in Virtual Learning Environments(20. Fachtagung Bildungstechnologien (DELFI), 2022) Hartmann, Johanna; Heuer, Hendrik; Breiter, AndreasUnsupervised machine learning techniques are increasingly used to cluster students based on their activity in virtual learning environments. It is commonly assumed that clusters formed by click data merely represent the actions of users and do not allow inferring personal information about individual users. Based on an analysis of 18,660 students and 5.56 million data points from the Open University Learning Analytics Dataset, we show that clusters trained on "raw" click data are highly correlated with personal information like student success, course specifics, and student demographics. Our analysis demonstrates that these clusters allow conclusions about demographic variables like the previous education and the affluence of the residential area. Our investigation shows that apparently, objective click data can leak private attributes. The paper discusses the implications of this for the design of virtual learning environments, especially considering the legal requirements posed by the principle of data minimization of the EU GDPR.
- KonferenzbeitragDiscovering Non-Redundant K-means Clusterings in Optimal Subspaces(INFORMATIK 2019: 50 Jahre Gesellschaft für Informatik – Informatik für Gesellschaft, 2019) Mautz, Dominik; Ye, Wei; Plant, Claudia; Böhm, Christian
- ZeitschriftenartikelExploring syntactical features for anomaly detection in application logs(it - Information Technology: Vol. 64, No. 1-2, 2022) Copstein, Rafael; Karlsen, Egil; Schwartzentruber, Jeff; Zincir-Heywood, Nur; Heywood, MalcolmIn this research, we analyze the effect of lightweight syntactical feature extraction techniques from the field of information retrieval for log abstraction in information security. To this end, we evaluate three feature extraction techniques and three clustering algorithms on four different security datasets for anomaly detection. Results demonstrate that these techniques have a role to play for log abstraction in the form of extracting syntactic features which improves the identification of anomalous minority classes, specifically in homogeneous security datasets.
- WorkshopbeitragHow can Small Data Sets be Clustered?(Mensch und Computer 2021 - Workshopband, 2021) Weigand, Anna Christina; Lange, Daniel; Rauschenberger, MariaIn many areas, only small data sets are available and big data does not play a significant role, e.g., in Human-Centered Design research. In the context of machine learning analysis, results of small data sets can be biased due to single variables or missing values. Nevertheless, reliable and interpretable results are essential for determining further actions, such as, e.g., treatments in a health-related use case. In this paper, we explore machine learning clustering algorithms on the basis of a small, health-related (variance) data set about early dyslexia screening. Therefore, we selected three different clustering algorithms from different clustering methods: K-Means, HAC and DBSCAN. In our case, K-Means and HAC showed promising results, while DBSCAN did not deliver distinct results. Based on our experiences, we provide first proposals on how to handle small data set clustering and describe situations in which using Human- Centered Design methods can increase interpretability of machine learning clustering results. Our work represents a starting point for discussing the topic of clustering small data sets.
- TextdokumentA Hybrid Information Extraction Approach Exploiting Structured Data Within a Text Mining Process(BTW 2019, 2019) Kiefer, Cornelia; Reimann, Peter; Mitschang, BernhardMany data sets encompass structured data fields with embedded free text fields. The text fields allow customers and workers to input information which cannot be encoded in structured fields. Several approaches use structured and unstructured data in isolated analyses. The result of isolated mining of structured data fields misses crucial information encoded in free text. The result of isolated text mining often mainly repeats information already available from structured data. The actual information gain of isolated text mining is thus limited. The main drawback of both isolated approaches is that they may miss crucial information. The hybrid information extraction approach suggested in this paper adresses this issue. Instead of extracting information that in large parts was already available beforehand, it extracts new, valuable information from free texts. Our solution exploits results of analyzing structured data within the text mining process, i.e., structured information guides and improves the information extraction process on textual data. Our main contributions comprise the description of the concept of hybrid information extraction as well as a prototypical implementation and an evaluation with two real-world data sets from aftersales and production with English and German free text fields.
- KonferenzbeitragRobust Clustering-based Segmentation Methods for Fingerprint Recognition(BIOSIG 2018 - Proceedings of the 17th International Conference of the Biometrics Special Interest Group, 2018) Ferreira, Pedro M.; Sequeira, Ana F.; Cardoso, Jaime S.; Rebelo, AnaFingerprint recognition has been widely studied for more than 45 years and yet it remains an intriguing pattern recognition problem. This paper focuses on the foreground mask estimation which is crucial for the accuracy of a fingerprint recognition system. The method consists of a robust cluster-based fingerprint segmentation framework incorporating an additional step to deal with pixels that were rejected as foreground in a decision considered not reliable enough. These rejected pixels are then further analysed for a more accurate classification. The procedure falls in the paradigm of classification with reject option - a viable option in several real world applications of machine learning and pattern recognition, where the cost of misclassifying observations is high. The present work expands a previous method based on the fuzzy C-means clustering with two variations regarding: i) the filters used; and ii) the clustering method for pixel classification as foreground/background. Experimental results demonstrate improved results on FVC datasets comparing with state-of-the-art methods even including methodologies based on deep learning architectures.
- ZeitschriftenartikelUnderstanding the effects of temporal energy-data aggregation on clustering quality(it - Information Technology: Vol. 61, No. 2-3, 2019) Trittenbach, Holger; Bach, Jakob; Böhm, KlemensEnergy data often is available at high temporal resolution, which challenges the scalability of data-analysis methods. A common way to cope with this is to aggregate data to, say, 15-minute-interval summaries. But it often is not known how much information is lost with this, i. e., how good analysis results on aggregated data actually are. In this article, we study the effects of aggregating energy data on clustering. We propose an experimental design to compare a wide range of clustering methods found in literature. We then introduce different ways to compare clustering results obtained with different aggregation schemes. Our evaluation shows that aggregation affects the clustering quality significantly. Finally, we propose guidelines to select an aggregation scheme.