我正在使用 Google Cloud Vision API (OCR) 使用 PHP API 库 检测 PDF 文件中的文本。 OCR 已完美完成,我已保存了完整的 JSON 输出文件集(例如 output-1-to-2.json)以及完整的 OCR 数据(其中包含位置详细信息、置信度、全文等)。
这是一个简单的 2 页 PDF 的 JSON 输出示例(文件:output-1-to-2.json),其中第 1 页中包含单词“April”,第 2 页中包含“May”(如图像) :
{
"inputConfig":{
"gcsSource":{
"uri":"gs://my-ocr-bucket/php8723/sample.pdf"
},
"mimeType":"application/pdf"
},
"responses":[
{
"fullTextAnnotation":{
"pages":[
{
"property":{
"detectedLanguages":[
{
"languageCode":"en",
"confidence":1
}
]
},
"width":595,
"height":841,
"blocks":[
{
"boundingBox":{
"normalizedVertices":[
{
"x":0.0789916,
"y":0.049940545
},
{
"x":0.11596639,
"y":0.049940545
},
{
"x":0.11596639,
"y":0.059453033
},
{
"x":0.0789916,
"y":0.060642093
}
]
},
"paragraphs":[
{
"boundingBox":{
"normalizedVertices":[
{
"x":0.0789916,
"y":0.049940545
},
{
"x":0.11596639,
"y":0.049940545
},
{
"x":0.11596639,
"y":0.059453033
},
{
"x":0.0789916,
"y":0.060642093
}
]
},
"words":[
{
"property":{
"detectedLanguages":[
{
"languageCode":"en",
"confidence":1
}
]
},
"boundingBox":{
"normalizedVertices":[
{
"x":0.0789916,
"y":0.049940545
},
{
"x":0.11596639,
"y":0.049940545
},
{
"x":0.11596639,
"y":0.059453033
},
{
"x":0.0789916,
"y":0.060642093
}
]
},
"symbols":[
{
"text":"A",
"confidence":0.98833746
},
{
"text":"p",
"confidence":0.9870904
},
{
"text":"r",
"confidence":0.99477327
},
{
"text":"i",
"confidence":0.9951743
},
{
"property":{
"detectedBreak":{
"type":"LINE_BREAK"
}
},
"text":"l",
"confidence":0.98942703
}
],
"confidence":0.9909605
}
],
"confidence":0.9909605
}
],
"blockType":"TEXT",
"confidence":0.9909605
}
],
"confidence":0.9909605
}
],
"text":"April"
},
"context":{
"uri":"gs://my-ocr-bucket/php8723/sample.pdf",
"pageNumber":1
}
},
{
"fullTextAnnotation":{
"pages":[
{
"width":595,
"height":841,
"blocks":[
{
"boundingBox":{
"normalizedVertices":[
{
"x":0.0789916,
"y":0.05469679
},
{
"x":0.11092437,
"y":0.05588585
},
{
"x":0.11092437,
"y":0.065398335
},
{
"x":0.07731093,
"y":0.064209275
}
]
},
"paragraphs":[
{
"boundingBox":{
"normalizedVertices":[
{
"x":0.0789916,
"y":0.05469679
},
{
"x":0.11092437,
"y":0.05588585
},
{
"x":0.11092437,
"y":0.065398335
},
{
"x":0.07731093,
"y":0.064209275
}
]
},
"words":[
{
"boundingBox":{
"normalizedVertices":[
{
"x":0.0789916,
"y":0.05469679
},
{
"x":0.11092437,
"y":0.05588585
},
{
"x":0.11092437,
"y":0.065398335
},
{
"x":0.07731093,
"y":0.064209275
}
]
},
"symbols":[
{
"text":"M",
"confidence":0.98251665
},
{
"text":"a",
"confidence":0.9763874
},
{
"property":{
"detectedBreak":{
"type":"LINE_BREAK"
}
},
"text":"y",
"confidence":0.9850642
}
],
"confidence":0.98132277
}
],
"confidence":0.98132277
}
],
"blockType":"TEXT",
"confidence":0.98132277
}
],
"confidence":0.98132277
}
],
"text":"May"
},
"context":{
"uri":"gs://my-ocr-bucket/php8723/sample.pdf",
"pageNumber":2
}
}
]
}
现在我一直在将这些 OCR 数据(json 文件)嵌入到 PDF 中以使 PDF 可搜索。这意味着我需要编辑 PDF 并在其中添加 OCR 数据。
我的问题:如何在 PDF 文件中插入/添加 Google Vision 生成的 JSON 格式的 OCR 数据并使 PDF 可搜索?
感谢您到目前为止阅读本文。
如果我们在文本编辑器中删除所有容易删除的部分,那么 JSON 是一种过于模糊的存储 PDF 数据的方法。大约 10,600 个无内容可以减少到更紧凑的 600 字节。
如果置信度不是 95%,则可能不值得包含在内,正如您此时所看到的,整个批次可能会更小。
Filename /sample.pdf
TextAnnotation
Page /MediaBox[595 841]
"Area":[ {"x":47 "y":42} {"x":69 "y":42} {"x":69 "y":50} {"x":47 "y":51} ]
"symbols":[ { "text":"A" } { "text":"p" } { "text":"r" } { "text":"i" }
"property":{ "detectedBreak":{ "type":"LINE_BREAK" } } "text":"l" }
"text":"April"
"context":{ "uri":"sample.pdf", "pageNumber":1 }
TextAnnotation
Page /MediaBox[595 841]
"Area":[{ "x":47 "y":46} { "x":66 "y":47} { "x":66 "y":55} { "x":46 "y":54}
"symbols": { "text":"M" } { "text":"a" }
"property":{ "detectedBreak":{ "type":"LINE_BREAK" } } "text":"y" }
"text":"May"
"context":{ "uri":"sample.pdf", "pageNumber":2 }
因此,您的编程只需寻找
"pageNumber":1
,然后查看之前的内容,例如 "text":"April"
,并使用需要通过将位置乘以 MediaBox X,Y 来转换为 PDF 单位的边界值。
结果
Filename /sample.pdf
"pageNumber":1
Page /MediaBox[595 841]
TextAnnotation
"text":"April"
"Height":8.5
"Position": {"x":47 "y":42}
"pageNumber":2
Page /MediaBox[595 841]
TextAnnotation
"text":"May"
"Height":10
"Position":{"x":47 "y":46}
现在任何图书馆都应该能够将这 2 个条目写为 2 个单独的页面。