我们使用 Split Skillset 创建了一个 Azure AI 搜索,以按页面分块我们的文档,并且语义配置也是如此。现在我们希望将页码包含在块内容的末尾或作为单独的索引字段。我们的索引中有以下字段:parent_id、title、chunk_id、chunk。
以下是分割技能组的参数:-
text_split_mode="pages",
context="/document",
maximum_page_length=2048,
page_overlap_length=20,
我尝试从 chunk_id 中提取页码:0ffd977ce09f_aHR0cHM6Ly9hbmh@cmFpbnN0b3JhZ2VhY2NvdW50LmJsb2luS53aW5kb3dzLm5ldC9maWx1dXBsb2FkLXRlc3QtaW5kZXgvQmx1ZV9 aZWJyYV9MYW5kbG9yZF9JbnN1cmFuY2VfQwNjaWRlbnRhbF9EYW1hZ2VFUERTXIwMjMwNzAxLnBkZg_pages_2 但是当我用实际文档检查这些块的内容时,块id中的页码不正确。在文档中,内容为 21,但在块中显示为 26。 我该如何处理这个问题?
您可以使用
imageAction
配置从每个页面中提取页码和内容。
通过使用此功能,您将获得一个名为
pageNumber
的字段名称,并且您将从每个页面获取内容作为二级索引中的单独文档。
以下是示例定义。
主索引字段。
技能组定义 - 使用 OCR 技能组从每个页面提取文本并在二级索引上进行投影。
{
"name": "skillset1",
"description": "",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
"name": "#1",
"description": "",
"context": "/document/normalized_images/*",
"inputs": [
{
"name": "image",
"source": "/document/normalized_images/*",
"inputs": []
}
],
"outputs": [
{
"name": "text",
"targetName": "text"
}
],
"defaultLanguageCode": "en",
"detectOrientation": true,
"lineEnding": "Space"
}
],
"@odata.etag": "\"0x8DCFB1DC59EC9ED\"",
"indexProjections": {
"selectors": [
{
"targetIndexName": "desidx",
"parentKeyFieldName": "parent_id",
"sourceContext": "/document/normalized_images/*",
"mappings": [
{
"name": "text",
"source": "/document/normalized_images/*/text"
},
{
"name": "pageNumber",
"source": "/document/normalized_images/*/pageNumber"
},
{
"name": "metadata_storage_path",
"source": "/document/metadata_storage_name"
}
]
}
],
"parameters": {
"projectionMode": "skipIndexingParentDocuments"
}
}
}
每当您将图像操作指定为
generateNormalizedImagePerPage
时,每个页面数据都将位于 /document/normalized_images/*
上下文中。
索引器
{
"@odata.context": "https://jgsai.search.windows.net/$metadata#indexers/$entity",
"@odata.etag": "\"0x8DCFB1DD13443EF\"",
"name": "azureblob-indexer",
"description": "",
"dataSourceName": "ds",
"skillsetName": "skillset1",
"targetIndexName": "srcidx",
"disabled": null,
"schedule": null,
"parameters": {
"batchSize": null,
"maxFailedItems": 0,
"maxFailedItemsPerBatch": 0,
"base64EncodeKeys": null,
"configuration": {
"dataToExtract": "contentAndMetadata",
"parsingMode": "default",
"imageAction": "generateNormalizedImagePerPage"
}
},
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "metadata_storage_path",
"mappingFunction": {
"name": "base64Encode",
"parameters": null
}
}
],
"outputFieldMappings": [],
"cache": null,
"encryptionKey": null
}
二级索引字段
输出:
这里是图像动作的更多属性。