Logo des Repositoriums
 

Ddup - towards a deduplication framework utilising apache spark

dc.contributor.authorWilcke, Niklas
dc.contributor.editorRitter, Norbert
dc.contributor.editorHenrich, Andreas
dc.contributor.editorLehner, Wolfgang
dc.contributor.editorThor, Andreas
dc.contributor.editorFriedrich, Steffen
dc.contributor.editorWingerath, Wolfram
dc.date.accessioned2017-06-30T11:39:36Z
dc.date.available2017-06-30T11:39:36Z
dc.date.issued2015
dc.description.abstractThis paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14], the Apache Spark Framework [ZCF+10] and its modules MLlib [MLl14] and GraphX [XCD+14]. The three main goals of this project are creating a prototype of the mentioned framework DduP, analysing the deduplication process about scalability and performance and evaluate the behaviour of different small cluster configurations. Tags: Duplicate Detection, Deduplication, Record Linkage, Machine Learning, Big Data, Apache Spark, MLlib, Scala, Hadoop, In-Memoryen
dc.identifier.isbn978-3-88579-636-7
dc.identifier.pissn1617-5468
dc.language.isoen
dc.publisherGesellschaft für Informatik e.V.
dc.relation.ispartofDatenbanksysteme für Business, Technologie und Web (BTW 2015) - Workshopband
dc.relation.ispartofseriesLecture Notes in Informatics (LNI) - Proceedings, Volume P-242
dc.titleDdup - towards a deduplication framework utilising apache sparken
dc.typeText/Conference Paper
gi.citation.endPage262
gi.citation.publisherPlaceBonn
gi.citation.startPage253
gi.conference.date2.-3. März 2015
gi.conference.locationHamburg

Dateien

Originalbündel
1 - 1 von 1
Lade...
Vorschaubild
Name:
253.pdf
Größe:
96.18 KB
Format:
Adobe Portable Document Format