在 PDF 文件中嵌入/插入/添加由“Google Cloud Vision (OCR)”生成的 JSON OCR 数据并使 PDF 可搜索

Question

我正在使用 Google Cloud Vision API (OCR) 使用 PHP API 库检测 PDF 文件中的文本。 OCR 已完美完成，我已保存了完整的 JSON 输出文件集（例如 output-1-to-2.json）以及完整的 OCR 数据（其中包含位置详细信息、置信度、全文等）。

这是一个简单的 2 页 PDF 的 JSON 输出示例（文件：output-1-to-2.json），其中第 1 页中包含单词“April”，第 2 页中包含“May”（如图像） :

{
   "inputConfig":{
      "gcsSource":{
         "uri":"gs://my-ocr-bucket/php8723/sample.pdf"
      },
      "mimeType":"application/pdf"
   },
   "responses":[
      {
         "fullTextAnnotation":{
            "pages":[
               {
                  "property":{
                     "detectedLanguages":[
                        {
                           "languageCode":"en",
                           "confidence":1
                        }
                     ]
                  },
                  "width":595,
                  "height":841,
                  "blocks":[
                     {
                        "boundingBox":{
                           "normalizedVertices":[
                              {
                                 "x":0.0789916,
                                 "y":0.049940545
                              },
                              {
                                 "x":0.11596639,
                                 "y":0.049940545
                              },
                              {
                                 "x":0.11596639,
                                 "y":0.059453033
                              },
                              {
                                 "x":0.0789916,
                                 "y":0.060642093
                              }
                           ]
                        },
                        "paragraphs":[
                           {
                              "boundingBox":{
                                 "normalizedVertices":[
                                    {
                                       "x":0.0789916,
                                       "y":0.049940545
                                    },
                                    {
                                       "x":0.11596639,
                                       "y":0.049940545
                                    },
                                    {
                                       "x":0.11596639,
                                       "y":0.059453033
                                    },
                                    {
                                       "x":0.0789916,
                                       "y":0.060642093
                                    }
                                 ]
                              },
                              "words":[
                                 {
                                    "property":{
                                       "detectedLanguages":[
                                          {
                                             "languageCode":"en",
                                             "confidence":1
                                          }
                                       ]
                                    },
                                    "boundingBox":{
                                       "normalizedVertices":[
                                          {
                                             "x":0.0789916,
                                             "y":0.049940545
                                          },
                                          {
                                             "x":0.11596639,
                                             "y":0.049940545
                                          },
                                          {
                                             "x":0.11596639,
                                             "y":0.059453033
                                          },
                                          {
                                             "x":0.0789916,
                                             "y":0.060642093
                                          }
                                       ]
                                    },
                                    "symbols":[
                                       {
                                          "text":"A",
                                          "confidence":0.98833746
                                       },
                                       {
                                          "text":"p",
                                          "confidence":0.9870904
                                       },
                                       {
                                          "text":"r",
                                          "confidence":0.99477327
                                       },
                                       {
                                          "text":"i",
                                          "confidence":0.9951743
                                       },
                                       {
                                          "property":{
                                             "detectedBreak":{
                                                "type":"LINE_BREAK"
                                             }
                                          },
                                          "text":"l",
                                          "confidence":0.98942703
                                       }
                                    ],
                                    "confidence":0.9909605
                                 }
                              ],
                              "confidence":0.9909605
                           }
                        ],
                        "blockType":"TEXT",
                        "confidence":0.9909605
                     }
                  ],
                  "confidence":0.9909605
               }
            ],
            "text":"April"
         },
         "context":{
            "uri":"gs://my-ocr-bucket/php8723/sample.pdf",
            "pageNumber":1
         }
      },
      {
         "fullTextAnnotation":{
            "pages":[
               {
                  "width":595,
                  "height":841,
                  "blocks":[
                     {
                        "boundingBox":{
                           "normalizedVertices":[
                              {
                                 "x":0.0789916,
                                 "y":0.05469679
                              },
                              {
                                 "x":0.11092437,
                                 "y":0.05588585
                              },
                              {
                                 "x":0.11092437,
                                 "y":0.065398335
                              },
                              {
                                 "x":0.07731093,
                                 "y":0.064209275
                              }
                           ]
                        },
                        "paragraphs":[
                           {
                              "boundingBox":{
                                 "normalizedVertices":[
                                    {
                                       "x":0.0789916,
                                       "y":0.05469679
                                    },
                                    {
                                       "x":0.11092437,
                                       "y":0.05588585
                                    },
                                    {
                                       "x":0.11092437,
                                       "y":0.065398335
                                    },
                                    {
                                       "x":0.07731093,
                                       "y":0.064209275
                                    }
                                 ]
                              },
                              "words":[
                                 {
                                    "boundingBox":{
                                       "normalizedVertices":[
                                          {
                                             "x":0.0789916,
                                             "y":0.05469679
                                          },
                                          {
                                             "x":0.11092437,
                                             "y":0.05588585
                                          },
                                          {
                                             "x":0.11092437,
                                             "y":0.065398335
                                          },
                                          {
                                             "x":0.07731093,
                                             "y":0.064209275
                                          }
                                       ]
                                    },
                                    "symbols":[
                                       {
                                          "text":"M",
                                          "confidence":0.98251665
                                       },
                                       {
                                          "text":"a",
                                          "confidence":0.9763874
                                       },
                                       {
                                          "property":{
                                             "detectedBreak":{
                                                "type":"LINE_BREAK"
                                             }
                                          },
                                          "text":"y",
                                          "confidence":0.9850642
                                       }
                                    ],
                                    "confidence":0.98132277
                                 }
                              ],
                              "confidence":0.98132277
                           }
                        ],
                        "blockType":"TEXT",
                        "confidence":0.98132277
                     }
                  ],
                  "confidence":0.98132277
               }
            ],
            "text":"May"
         },
         "context":{
            "uri":"gs://my-ocr-bucket/php8723/sample.pdf",
            "pageNumber":2
         }
      }
   ]
}

现在我一直在将这些 OCR 数据（json 文件）嵌入到 PDF 中以使 PDF 可搜索。这意味着我需要编辑 PDF 并在其中添加 OCR 数据。

我的问题：如何在 PDF 文件中插入/添加 Google Vision 生成的 JSON 格式的 OCR 数据并使 PDF 可搜索？

感谢您到目前为止阅读本文。

Answer 1

如果我们在文本编辑器中删除所有容易删除的部分，那么 JSON 是一种过于模糊的存储 PDF 数据的方法。大约 10,600 个无内容可以减少到更紧凑的 600 字节。

如果置信度不是 95%，则可能不值得包含在内，正如您此时所看到的，整个批次可能会更小。

Filename /sample.pdf

TextAnnotation
Page  /MediaBox[595 841]
"Area":[ {"x":47 "y":42} {"x":69 "y":42} {"x":69 "y":50} {"x":47 "y":51} ]
"symbols":[ { "text":"A" } { "text":"p" } { "text":"r" } { "text":"i" }
"property":{ "detectedBreak":{ "type":"LINE_BREAK" } } "text":"l" }
"text":"April"
"context":{ "uri":"sample.pdf", "pageNumber":1 }

TextAnnotation
Page  /MediaBox[595 841]
"Area":[{ "x":47 "y":46} { "x":66 "y":47} { "x":66 "y":55} { "x":46 "y":54}
"symbols": { "text":"M" } { "text":"a" } 
"property":{ "detectedBreak":{ "type":"LINE_BREAK" } }    "text":"y" }
"text":"May" 
"context":{ "uri":"sample.pdf", "pageNumber":2 }

因此，您的编程只需寻找

"pageNumber":1

，然后查看之前的内容，例如

"text":"April"

，并使用需要通过将位置乘以 MediaBox X,Y 来转换为 PDF 单位的边界值。结果

Filename /sample.pdf

"pageNumber":1
Page  /MediaBox[595 841]

TextAnnotation
"text":"April"
"Height":8.5
"Position": {"x":47 "y":42} 

"pageNumber":2
Page  /MediaBox[595 841]

TextAnnotation
"text":"May" 
"Height":10
"Position":{"x":47 "y":46}

现在任何图书馆都应该能够将这 2 个条目写为 2 个单独的页面。

在 PDF 文件中嵌入/插入/添加由“Google Cloud Vision (OCR)”生成的 JSON OCR 数据并使 PDF 可搜索

问题描述投票：0回答：1

1个回答

最新问题

在 PDF 文件中嵌入/插入/添加由“Google Cloud Vision (OCR)”生成的 JSON OCR 数据并使 PDF 可搜索

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1