Auflistung nach Schlagwort "OCR"
1 - 3 von 3
Treffer pro Seite
Sortieroptionen
- KonferenzbeitragAnalyzing Historical Legal Textcorpora: German VET and CVET regulations(INFORMATIK 2024, 2024) Reiser, Thomas; Dörpinghaus, Jens; Steiner, PetraThe digitization of historical documents has gained particular interest in recent years. The majority of research endeavors aim at digitizing historical documents by extracting text from scanned images. A pipeline that transcribes scanned documents into fully structured texts was utilized to digitize over 900 German VET and CVET regulations. As a preliminary investigation, a basic corpus analysis was conducted to assess the usability of the digitized documents and the necessity for document digitization methods that can generate transcripts that maintain the logical text structure and hierarchy. This paper focuses on the processing of the transcripts created from German VET and CVET regulation images to demonstrate the advantages of fully structured text over plain OCR results and to illustrate that even simple analyses require more information for more comprehensive document understanding.
- TextdokumentDigitizing Drilling Logs - Challenges of typewritten forms(INFORMATIK 2021, 2021) Bürgl, Kim; Reinhardt, Lea; Binder, Frank; Müller, Lydia; Niekler, AndreasIn this work, we show prospects of how mining and geological documentation in the form of drilling reports can be digitized and further processed. Processing these typed and handwritten forms poses challenges for document management in renaturation projects. We highlight the structural problems of drilling reports and present three approaches for recognizing and processing the information documented in them. We use optical character recognition and document layout analysis techniques to approach the problem. Layout analysis was performed using a heuristic approach and a neural network for layout recognition. In detail, we show the approaches Form Processing (A), Table detection by line counting (B) and processing with Mask-R-CNN (C). A case study is used to show initial results and challenges. B and C are more robust than A to small changes in the form. C can recognize columns better with more training data than B in cases where table boundaries are not respected. B and C also allow other language models to be used for OCR and can thus also recognize handwriting with appropriate training data.
- KonferenzbeitragSemi-automatic extraction of metadata from old geological maps(INFORMATIK 2023 - Designing Futures: Zukünfte gestalten, 2023) Bürgl, Kim; Müller, LydiaGeological map communicate efficiently geological information. Old geological maps were stored as paper maps and thus need to be digitized when integrating them into digital geographic information systems. Metadata is required to find relevant maps fast. However, metadata is usually created manually with a lot of effort. We present work in progress for a semi-automated approach for extracting metadata from maps. The results show that it lowers the manual effort significantly to extract the location and improves at least the experience of the manual annotation with respect to date metadata.