Ddup - towards a deduplication framework utilising apache spark

Wilcke, NiklasRitter, NorbertHenrich, AndreasLehner, WolfgangThor, AndreasFriedrich, SteffenWingerath, Wolfram2017-06-302017-06-302015978-3-88579-636-7This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14], the Apache Spark Framework [ZCF+10] and its modules MLlib [MLl14] and GraphX [XCD+14]. The three main goals of this project are creating a prototype of the mentioned framework DduP, analysing the deduplication process about scalability and performance and evaluate the behaviour of different small cluster configurations. Tags: Duplicate Detection, Deduplication, Record Linkage, Machine Learning, Big Data, Apache Spark, MLlib, Scala, Hadoop, In-MemoryenDdup - towards a deduplication framework utilising apache sparkText/Conference Paper1617-5468