DocumentExtractionSkill 不会从 blob 数据中提取文本

问题描述 投票:0回答:1

我有一个数据库

TABLE FileStorage (
    ID INT IDENTITY(1,1) PRIMARY KEY,
    FileName NVARCHAR(255) NOT NULL,     -- Name of the file
    FileData VARBINARY(MAX) NOT NULL,    -- Binary data of the file PDF
    UploadDate DATETIME DEFAULT GETDATE() -- Optional: Date of file upload
);

在认知搜索中: 索引:

{
  "name": "files-index",
  "fields": [
    {
      "name": "ID",
      "type": "Edm.String",
      "key": true,
      "retrievable": true,
      "stored": true,
      "searchable": false,
      "filterable": true,
      "sortable": true,
      "facetable": true,
      "synonymMaps": []
    },
    {
      "name": "fileName",
      "type": "Edm.String",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": true,
      "filterable": true,
      "sortable": true,
      "facetable": true,
      "synonymMaps": []
    },
    {
      "name": "content",
      "type": "Edm.String",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": true,
      "filterable": true,
      "sortable": true,
      "facetable": true,
      "analyzer": "standard.lucene",
      "synonymMaps": []
    },
    {
      "name": "metadata",
      "type": "Edm.String",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": true,
      "filterable": true,
      "sortable": true,
      "facetable": true,
      "synonymMaps": []
    },
    {
      "name": "uploadDate",
      "type": "Edm.DateTimeOffset",
      "key": false,
      "retrievable": true,
      "stored": true,
      "searchable": false,
      "filterable": true,
      "sortable": true,
      "facetable": true,
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "suggesters": [],
  "analyzers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "normalizers": [],
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
  },
  "@odata.etag": ""
}

技能

{
  "name": "testtwo",
  "description": "testtwo",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
      "name": "#1",
      "description": "Extract text and metadata from binary documents (PDF, DOCX, etc.)",
      "context": "/document",
      "inputs": [
        {
          "name": "file_data",
          "source": "/document/file_data",
          "inputs": []
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "content"
        },
        {
          "name": "metadata",
          "targetName": "metadata"
        }
      ],
      "parsingMode": "default",
      "dataToExtract": "contentAndMetadata",
      "_configuration": {}
    }
  ],
  "@odata.etag": ""
}

索引器:

{
    "name": "filestorage-indexer",
    "dataSourceName": "filestorage-datasource",
    "targetIndexName": "files-index",
    "skillsetName": "testtwo",
    "fieldMappings": [
        {
            "sourceFieldName": "ID",
            "targetFieldName": "id"
        },
        {
            "sourceFieldName": "FileName",
            "targetFieldName": "fileName"
        }
    ],
    "outputFieldMappings": [
        {
            "sourceFieldName": "/document/content",
            "targetFieldName": "content"
        },
        {
            "sourceFieldName": "/document/metadata",
            "targetFieldName": "metadata"
        }
    ]
}

但结果内容为空:

{
  "@odata.context": "https://someurl.windows.net/indexes('files-index')/$metadata#docs(*)",
  "@odata.count": 29,
  "value": [
    {
      "@search.score": 1,
      "ID": "20",
      "fileName": "firstfile.pdf",
      "content": null,
      "metadata": null,
      "uploadDate": null
    },{
      "@search.score": 1,
      "ID": "3",
      "fileName": "blabla.pdf",
      "content": null,
      "metadata": null,
      "uploadDate": null
    },

我目前正在开发一个项目,需要提取多个文件的内容并将其放入内容字段中。然而,我遇到了障碍,不知道如何继续。我尝试了各种方法,但我似乎陷入了这一点。问题是我似乎无法按预期将文件中的内容获取到字段中。

在这个阶段,我不确定问题是否出在我的技术技能上,或者我的整体方法是否有缺陷。我已经检查了好几次代码并仔细检查了我的逻辑,但仍然没有达到预期的结果。

可能是我遗漏了一个步骤,或者可能是我尚未考虑的处理文件内容的更有效方法。我愿意接受有关如何继续或解决问题的替代方法的建议。

azure artificial-intelligence azure-cognitive-services
1个回答
0
投票

我已经多次检查了代码并仔细检查了我的逻辑,但仍然没有达到预期的结果。

上述 JSON 技能组在

inputs
字段的
file_data
方面存在问题。具体来说,
inputs
属性中的
source
数组不是必需的,并且可能会导致问题。

  • 技能组应直接匹配
    FileData
    字段以进行二进制内容提取,并删除不必要的属性,例如
    inputs: []
    内的
    source
{
  "name": "testtwo",
  "description": "testtwo",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
      "name": "#1",
      "description": "Extract text and metadata from binary documents (PDF, DOCX, etc.)",
      "context": "/document",
      "inputs": [
        {
          "name": "file_data",
          "source": "/document/FileData"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "content"
        },
        {
          "name": "metadata",
          "targetName": "metadata"
        }
      ],
      "parsingMode": "default",
      "dataToExtract": "contentAndMetadata"
    }
  ]
}

索引器必须将数据库字段正确映射到索引字段,并检查技能组的输出映射是否已正确分配。

样本数据:

enter image description here

限制索引器处理单个文档。这有助于识别特定记录的问题。

  • 在索引器配置中,添加一个过滤器以包含一个
    ID
    (例如,
    WHERE ID = 1
    )。
{
  "name": "filestorage-indexer",
  "dataSourceName": "filestorage-datasource",
  "targetIndexName": "files-index",
  "skillsetName": "testtwo",
  "fieldMappings": [
    {
      "sourceFieldName": "ID",
      "targetFieldName": "ID"
    },
    {
      "sourceFieldName": "FileName",
      "targetFieldName": "fileName"
    },
    {
      "sourceFieldName": "UploadDate",
      "targetFieldName": "uploadDate"
    }
  ],
  "outputFieldMappings": [
    {
      "sourceFieldName": "/document/content",
      "targetFieldName": "content"
    },
    {
      "sourceFieldName": "/document/metadata",
      "targetFieldName": "metadata"
    }
  ],
  "parameters": {
    "batchSize": 100,
    "maxFailedItems": 0,
    "maxFailedItemsPerBatch": 0
  }
}

输出字段 (

content
metadata
) 与目标索引字段正确对齐。

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.