如何在azure AI搜索中使用Split Skill获取块索引?

问题描述 投票:0回答:1

我是 Azure AI 搜索的新手,我想从该技能集中获取属性块索引,以了解该块位于文档中的哪个索引。 分割后的页面内容如下所示

{'values': [{'recordId': '0', 'data': {'text': 'sample data 1 '}}, {'recordId': '1', 'data': {'text': 'sample data 1'}}, {'recordId': '2', 'data': {'text': 'sample data 3'}}

如何将 recordId 值复制为字段。

{
  "name": "testing-phase-1-docs-skillset",
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#3",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ],
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "unit": "characters"
    }
  ],
  "@odata.etag": "\"0x8DD029DA50735BD\"",
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "testing-phase-1-docs-index",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "content",
            "source": "/document/pages/*"
          }, // want to add a recordId here
  
          {
            "name": "metadata_title",
            "source": "/document/metadata_title"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}
python azure azure-ai-search
1个回答
0
投票

如何在azure AI搜索中使用Split Skill获取块索引?

添加自定义技能,为每个块分配一个

chunkIndex
字段,代表其位置,通过此您可以在使用
SplitSkill
拆分后跟踪文档中的块索引。

    然后可以将
  • chunkIndex
    投影到搜索索引中,使您能够了解每个块在原始文档中的确切位置。
{
  "name": "testing-phase-1-docs-skillset",
  "description": "Skillset to chunk documents, assign a recordId and chunkIndex to each chunk, and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "documentChunkingSkill",
      "description": "Splits document into chunks",
      "context": "/document",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ],
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "unit": "characters"
    },
    {
      "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
      "name": "generateRecordIdAndChunkIndexSkill",
      "description": "Generates a unique recordId and chunkIndex for each chunk",
      "context": "/document/pages/*",
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*/text"
        }
      ],
      "outputs": [
        {
          "name": "recordId",
          "targetName": "recordId"
        },
        {
          "name": "chunkIndex",
          "targetName": "chunkIndex"
        }
      ]
    }
  ],
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "testing-phase-1-docs-index",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "content",
            "source": "/document/pages/*/text"
          },
          {
            "name": "recordId",
            "source": "/document/pages/*/recordId"
          },
          {
            "name": "chunkIndex",
            "source": "/document/pages/*/chunkIndex"
          },
          {
            "name": "metadata_title",
            "source": "/document/metadata_title"
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  }
}

在上面,我添加了一个自定义技能来检查每个块并获得一个独特的

recordId

  • 它根据指定的页面长度将文档分割成块。

  • 它为每个块生成一个唯一的

    recordId

  • 它将拆分内容和

    recordId
    映射到最终索引。

按预期为每个块生成了

recordId

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.