我正在尝试将许多PDF文档转换为R中的文本,以便使用字符串解析和正则拨号来从中提取一组代码。我正在使用

问题描述 投票:0回答:0

我正在尝试从左列中获取代码。我成功提取的唯一代码是描述比单行更长的代码。 我已经使用

magick
进行了各种
PRE加工技术的实验,但在大多数情况下却差不多。我唯一能够获得代码集的实例是从图像中裁剪右侧,但不幸的是,在我的情况下,这不是一个有效的解决方案。

file <- magick::image_read("44F245A2-5FEE-408F-A197-756436A5CAFD.png") file %>% magick::image_resize("2000x") %>% magick::image_convert(type = 'Grayscale') %>% tesseract::ocr() %>% cat() # or # descriptions in this document. # 94942C This is a description that takes on multiple lines. It can contain any combination of # alphanumeric characters or punctuation. Different types of things can go in here and the # | terpenes Steet gine see # 272144 This is a description that takes on multiple lines. It can contain any combination of # eee # length of the description could be anywhere from 1 line to 5 lines of text. # E76744 This is a description that takes on multiple lines. It can contain any combination of # alphanumeric characters or punctuation. Different types of things can go in here and the # [terpenes Steet gine see # K77744 This is a description that takes on multiple lines. It can contain any combination of # alphanumeric characters or punctuation. Different types of things can go in here and the # | terrane een Steet gine seem # 172744 This is a description that takes on multiple lines. It can contain any combination of # Se # length of the description could be anywhere from 1 line to 5 lines of text. # A71744 This is a description that takes on multiple lines. It can contain any combination of # alphanumeric characters or punctuation. Different types of things can go in here and the # | teammates Steet gine see

comeyly我希望能够从上面链接中的图像中获取所有代码。任何帮助都很棒。

THIS trory要使用不同的页面细分模式,可用的分割模式为:

Page segmentation modes: 0 Orientation and script detection (OSD) only. 1 Automatic page segmentation with OSD. 2 Automatic page segmentation, but no OSD, or OCR. 3 Fully automatic page segmentation, but no OSD. (Default) 4 Assume a single column of text of variable sizes. 5 Assume a single uniform block of vertically aligned text. 6 Assume a single uniform block of text. 7 Treat the image as a single text line. 8 Treat the image as a single word. 9 Treat the image as a single word in a circle. 10 Tre at the image as a single character. 11 Sparse text. Find as much text as possible in no particular order. 12 Sparse text with OSD. 13 Raw line. Treat the image as a single text line,

从我的案例中,从我的经验#12给出了最多的文字,但可能不是按顺序进行的,但如果您想将代码与描述相关联,这可能是一个问题。

r image-processing ocr tesseract magick
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.