Konferenzbeitrag
Ddup - towards a deduplication framework utilising apache spark
Lade...
Volltext URI
Dokumententyp
Text/Conference Paper
Dateien
Zusatzinformation
Datum
2015
Autor:innen
Zeitschriftentitel
ISSN der Zeitschrift
Bandtitel
Verlag
Gesellschaft für Informatik e.V.
Zusammenfassung
This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14], the Apache Spark Framework [ZCF+10] and its modules MLlib [MLl14] and GraphX [XCD+14]. The three main goals of this project are creating a prototype of the mentioned framework DduP, analysing the deduplication process about scalability and performance and evaluate the behaviour of different small cluster configurations. Tags: Duplicate Detection, Deduplication, Record Linkage, Machine Learning, Big Data, Apache Spark, MLlib, Scala, Hadoop, In-Memory