我是 Azure AI 搜索的新手,我想从该技能集中获取属性块索引,以了解该块位于文档中的哪个索引。 分割后的页面内容如下所示
{'values': [{'recordId': '0', 'data': {'text': 'sample data 1 '}}, {'recordId': '1', 'data': {'text': 'sample data 1'}}, {'recordId': '2', 'data': {'text': 'sample data 3'}}
如何将 recordId 值复制为字段。
{
"name": "testing-phase-1-docs-skillset",
"description": "Skillset to chunk documents and generate embeddings",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "#3",
"description": "Split skill to chunk documents",
"context": "/document",
"inputs": [
{
"name": "text",
"source": "/document/content",
"inputs": []
}
],
"outputs": [
{
"name": "textItems",
"targetName": "pages"
}
],
"defaultLanguageCode": "en",
"textSplitMode": "pages",
"maximumPageLength": 2000,
"pageOverlapLength": 500,
"unit": "characters"
}
],
"@odata.etag": "\"0x8DD029DA50735BD\"",
"indexProjections": {
"selectors": [
{
"targetIndexName": "testing-phase-1-docs-index",
"parentKeyFieldName": "parent_id",
"sourceContext": "/document/pages/*",
"mappings": [
{
"name": "content",
"source": "/document/pages/*"
}, // want to add a recordId here
{
"name": "metadata_title",
"source": "/document/metadata_title"
}
]
}
],
"parameters": {
"projectionMode": "skipIndexingParentDocuments"
}
}
}
如何在azure AI搜索中使用Split Skill获取块索引?
添加自定义技能,为每个块分配一个
chunkIndex
字段,代表其位置,通过此您可以在使用 SplitSkill
拆分后跟踪文档中的块索引。
chunkIndex
投影到搜索索引中,使您能够了解每个块在原始文档中的确切位置。{
"name": "testing-phase-1-docs-skillset",
"description": "Skillset to chunk documents, assign a recordId and chunkIndex to each chunk, and generate embeddings",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "documentChunkingSkill",
"description": "Splits document into chunks",
"context": "/document",
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "pages"
}
],
"defaultLanguageCode": "en",
"textSplitMode": "pages",
"maximumPageLength": 2000,
"pageOverlapLength": 500,
"unit": "characters"
},
{
"@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
"name": "generateRecordIdAndChunkIndexSkill",
"description": "Generates a unique recordId and chunkIndex for each chunk",
"context": "/document/pages/*",
"inputs": [
{
"name": "text",
"source": "/document/pages/*/text"
}
],
"outputs": [
{
"name": "recordId",
"targetName": "recordId"
},
{
"name": "chunkIndex",
"targetName": "chunkIndex"
}
]
}
],
"indexProjections": {
"selectors": [
{
"targetIndexName": "testing-phase-1-docs-index",
"parentKeyFieldName": "parent_id",
"sourceContext": "/document/pages/*",
"mappings": [
{
"name": "content",
"source": "/document/pages/*/text"
},
{
"name": "recordId",
"source": "/document/pages/*/recordId"
},
{
"name": "chunkIndex",
"source": "/document/pages/*/chunkIndex"
},
{
"name": "metadata_title",
"source": "/document/metadata_title"
}
]
}
],
"parameters": {
"projectionMode": "skipIndexingParentDocuments"
}
}
}
在上面,我添加了一个自定义技能来检查每个块并获得一个独特的
recordId
。
它根据指定的页面长度将文档分割成块。
它为每个块生成一个唯一的
recordId
。
它将拆分内容和
recordId
映射到最终索引。
recordId
。