Logo des Repositoriums
 
Konferenzbeitrag

MR-DSJ: distance-based self-join for large-scale vector data analysis with mapreduce

Lade...
Vorschaubild

Volltext URI

Dokumententyp

Text/Conference Paper

Zusatzinformation

Datum

2013

Zeitschriftentitel

ISSN der Zeitschrift

Bandtitel

Verlag

Gesellschaft für Informatik e.V.

Zusammenfassung

Data analytics gets faced with huge and tremendously increasing amounts of data for which MapReduce provides a very convenient and effective distributed programming model. Various algorithms already support massive data analysis on computer clusters but, in particular, distance-based similarity self-joins lack efficient solutions for large vector data sets though they are fundamental in many data mining tasks including clustering, near-duplicate detection or outlier analysis. Our novel distance-based self-join algorithm for MapReduce, MR-DSJ, is based on grid partitioning and delivers correct, complete, and inherently duplicate-free results in a single iteration. Additionally we propose several filter techniques which reduce the runtime and communication of the MR-DSJ algorithm. Analytical and experimental evaluations demonstrate the superiority over other join algorithms for MapReduce.

Beschreibung

Seidl, Thomas; Fries, Sergej; Boden, Brigitte (2013): MR-DSJ: distance-based self-join for large-scale vector data analysis with mapreduce. Datenbanksysteme für Business, Technologie und Web (BTW) 2017. Bonn: Gesellschaft für Informatik e.V.. PISSN: 1617-5468. ISBN: 978-3-88579-608-4. pp. 37-56. Regular Research Papers. Magdeburg. 13.-15. März 2013

Schlagwörter

Zitierform

DOI

Tags