如何从使用拆分技能组的azure ai搜索索引中获取页码?

问题描述 投票:0回答:1

我们使用 Split Skillset 创建了一个 Azure AI 搜索,以按页面分块我们的文档,并且语义配置也是如此。现在我们希望将页码包含在块内容的末尾或作为单独的索引字段。我们的索引中有以下字段:parent_id、title、chunk_id、chunk。

以下是分割技能组的参数:-

text_split_mode="pages",  
context="/document",  
maximum_page_length=2048,  
page_overlap_length=20,  

我尝试从 chunk_id 中提取页码:0ffd977ce09f_aHR0cHM6Ly9hbmh@cmFpbnN0b3JhZ2VhY2NvdW50LmJsb2luS53aW5kb3dzLm5ldC9maWx1dXBsb2FkLXRlc3QtaW5kZXgvQmx1ZV9 aZWJyYV9MYW5kbG9yZF9JbnN1cmFuY2VfQwNjaWRlbnRhbF9EYW1hZ2VFUERTXIwMjMwNzAxLnBkZg_pages_2 但是当我用实际文档检查这些块的内容时,块id中的页码不正确。在文档中,内容为 21,但在块中显示为 26。 我该如何处理这个问题?

azure-ai-search
1个回答
0
投票

您可以使用

imageAction
配置从每个页面中提取页码和内容。

通过使用此功能,您将获得一个名为

pageNumber
的字段名称,并且您将从每个页面获取内容作为二级索引中的单独文档。

以下是示例定义。

主索引字段。

enter image description here

技能组定义 - 使用 OCR 技能组从每个页面提取文本并在二级索引上进行投影。

{
  "name": "skillset1",
  "description": "",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "name": "#1",
      "description": "",
      "context": "/document/normalized_images/*",
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "text"
        }
      ],
      "defaultLanguageCode": "en",
      "detectOrientation": true,
      "lineEnding": "Space"
    }
  ],
  "@odata.etag": "\"0x8DCFB1DC59EC9ED\"",
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "desidx",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/normalized_images/*",
        "mappings": [
          {
            "name": "text",
            "source": "/document/normalized_images/*/text"
          },
          {
            "name": "pageNumber",
            "source": "/document/normalized_images/*/pageNumber"
          },
          {
            "name": "metadata_storage_path",
            "source": "/document/metadata_storage_name"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

每当您将图像操作指定为

generateNormalizedImagePerPage
时,每个页面数据都将位于
/document/normalized_images/*
上下文中。

索引器

{
  "@odata.context": "https://jgsai.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DCFB1DD13443EF\"",
  "name": "azureblob-indexer",
  "description": "",
  "dataSourceName": "ds",
  "skillsetName": "skillset1",
  "targetIndexName": "srcidx",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": null,
    "maxFailedItems": 0,
    "maxFailedItemsPerBatch": 0,
    "base64EncodeKeys": null,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default",
      "imageAction": "generateNormalizedImagePerPage"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "metadata_storage_path",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

二级索引字段

enter image description here

输出:

enter image description here

这里是图像动作的更多属性。

© www.soinside.com 2019 - 2024. All rights reserved.