Ddup - towards a deduplication framework utilising apache spark

Wilcke, Niklas

Ddup - towards a deduplication framework utilising apache spark

dc.contributor.author	Wilcke, Niklas
dc.contributor.editor	Ritter, Norbert
dc.contributor.editor	Henrich, Andreas
dc.contributor.editor	Lehner, Wolfgang
dc.contributor.editor	Thor, Andreas
dc.contributor.editor	Friedrich, Steffen
dc.contributor.editor	Wingerath, Wolfram
dc.date.accessioned	2017-06-30T11:39:36Z
dc.date.available	2017-06-30T11:39:36Z
dc.date.issued	2015
dc.description.abstract	This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14], the Apache Spark Framework [ZCF+10] and its modules MLlib [MLl14] and GraphX [XCD+14]. The three main goals of this project are creating a prototype of the mentioned framework DduP, analysing the deduplication process about scalability and performance and evaluate the behaviour of different small cluster configurations. Tags: Duplicate Detection, Deduplication, Record Linkage, Machine Learning, Big Data, Apache Spark, MLlib, Scala, Hadoop, In-Memory	en
dc.identifier.isbn	978-3-88579-636-7
dc.identifier.pissn	1617-5468
dc.language.iso	en
dc.publisher	Gesellschaft für Informatik e.V.
dc.relation.ispartof	Datenbanksysteme für Business, Technologie und Web (BTW 2015) - Workshopband
dc.relation.ispartofseries	Lecture Notes in Informatics (LNI) - Proceedings, Volume P-242
dc.title	Ddup - towards a deduplication framework utilising apache spark	en
dc.type	Text/Conference Paper
gi.citation.endPage	262
gi.citation.publisherPlace	Bonn
gi.citation.startPage	253
gi.conference.date	2.-3. März 2015
gi.conference.location	Hamburg

Dateien

Originalbündel

1 - 1 von 1

Name:: 253.pdf
Größe:: 96.18 KB
Format:: Adobe Portable Document Format

Herunterladen

Sammlungen

P242 - BTW2015 - Datenbanksysteme für Business, Technologie und Web - Workshopband