MR-DSJ: distance-based self-join for large-scale vector data analysis with mapreduce

Data analytics gets faced with huge and tremendously increasing amounts of data for which MapReduce provides a very convenient and effective distributed programming model. Various algorithms already support massive data analysis on computer clusters but, in particular, distance-based similarity self-joins lack efficient solutions for large vector data sets though they are fundamental in many data mining tasks including clustering, near-duplicate detection or outlier analysis. Our novel distance-based self-join algorithm for MapReduce, MR-DSJ, is based on grid partitioning and delivers correct, complete, and inherently duplicate-free results in a single iteration. Additionally we propose several filter techniques which reduce the runtime and communication of the MR-DSJ algorithm. Analytical and experimental evaluations demonstrate the superiority over other join algorithms for MapReduce.

Seidl, Thomas; Fries, Sergej; Boden, Brigitte (2013): MR-DSJ: distance-based self-join for large-scale vector data analysis with mapreduce. Datenbanksysteme für Business, Technologie und Web (BTW) 2017. Bonn: Gesellschaft für Informatik e.V.. PISSN: 1617-5468. ISBN: 978-3-88579-608-4. pp. 37-56. Regular Research Papers. Magdeburg. 13.-15. März 2013

Sammlungen

P214 - BTW2013 - Datenbanksysteme für Business, Technologie und Web

Komplettanzeige

MR-DSJ: distance-based self-join for large-scale vector data analysis with mapreduce

Volltext URI

Dokumententyp

Dateien

Zusatzinformation

Datum

Autor:innen

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Quelle

Verlag

Zusammenfassung

Beschreibung

Schlagwörter

Zitierform

DOI

Tags

Sammlungen