我有一个数据库
TABLE FileStorage (
ID INT IDENTITY(1,1) PRIMARY KEY,
FileName NVARCHAR(255) NOT NULL, -- Name of the file
FileData VARBINARY(MAX) NOT NULL, -- Binary data of the file PDF
UploadDate DATETIME DEFAULT GETDATE() -- Optional: Date of file upload
);
在认知搜索中: 索引:
{
"name": "files-index",
"fields": [
{
"name": "ID",
"type": "Edm.String",
"key": true,
"retrievable": true,
"stored": true,
"searchable": false,
"filterable": true,
"sortable": true,
"facetable": true,
"synonymMaps": []
},
{
"name": "fileName",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": true,
"sortable": true,
"facetable": true,
"synonymMaps": []
},
{
"name": "content",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": true,
"sortable": true,
"facetable": true,
"analyzer": "standard.lucene",
"synonymMaps": []
},
{
"name": "metadata",
"type": "Edm.String",
"key": false,
"retrievable": true,
"stored": true,
"searchable": true,
"filterable": true,
"sortable": true,
"facetable": true,
"synonymMaps": []
},
{
"name": "uploadDate",
"type": "Edm.DateTimeOffset",
"key": false,
"retrievable": true,
"stored": true,
"searchable": false,
"filterable": true,
"sortable": true,
"facetable": true,
"synonymMaps": []
}
],
"scoringProfiles": [],
"suggesters": [],
"analyzers": [],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"normalizers": [],
"similarity": {
"@odata.type": "#Microsoft.Azure.Search.BM25Similarity"
},
"@odata.etag": ""
}
技能
{
"name": "testtwo",
"description": "testtwo",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
"name": "#1",
"description": "Extract text and metadata from binary documents (PDF, DOCX, etc.)",
"context": "/document",
"inputs": [
{
"name": "file_data",
"source": "/document/file_data",
"inputs": []
}
],
"outputs": [
{
"name": "text",
"targetName": "content"
},
{
"name": "metadata",
"targetName": "metadata"
}
],
"parsingMode": "default",
"dataToExtract": "contentAndMetadata",
"_configuration": {}
}
],
"@odata.etag": ""
}
索引器:
{
"name": "filestorage-indexer",
"dataSourceName": "filestorage-datasource",
"targetIndexName": "files-index",
"skillsetName": "testtwo",
"fieldMappings": [
{
"sourceFieldName": "ID",
"targetFieldName": "id"
},
{
"sourceFieldName": "FileName",
"targetFieldName": "fileName"
}
],
"outputFieldMappings": [
{
"sourceFieldName": "/document/content",
"targetFieldName": "content"
},
{
"sourceFieldName": "/document/metadata",
"targetFieldName": "metadata"
}
]
}
但结果内容为空:
{
"@odata.context": "https://someurl.windows.net/indexes('files-index')/$metadata#docs(*)",
"@odata.count": 29,
"value": [
{
"@search.score": 1,
"ID": "20",
"fileName": "firstfile.pdf",
"content": null,
"metadata": null,
"uploadDate": null
},{
"@search.score": 1,
"ID": "3",
"fileName": "blabla.pdf",
"content": null,
"metadata": null,
"uploadDate": null
},
我目前正在开发一个项目,需要提取多个文件的内容并将其放入内容字段中。然而,我遇到了障碍,不知道如何继续。我尝试了各种方法,但我似乎陷入了这一点。问题是我似乎无法按预期将文件中的内容获取到字段中。
在这个阶段,我不确定问题是否出在我的技术技能上,或者我的整体方法是否有缺陷。我已经检查了好几次代码并仔细检查了我的逻辑,但仍然没有达到预期的结果。
可能是我遗漏了一个步骤,或者可能是我尚未考虑的处理文件内容的更有效方法。我愿意接受有关如何继续或解决问题的替代方法的建议。
我已经多次检查了代码并仔细检查了我的逻辑,但仍然没有达到预期的结果。
上述 JSON 技能组在
inputs
字段的 file_data
方面存在问题。具体来说, inputs
属性中的 source
数组不是必需的,并且可能会导致问题。
FileData
字段以进行二进制内容提取,并删除不必要的属性,例如 inputs: []
内的 source
。{
"name": "testtwo",
"description": "testtwo",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Util.DocumentExtractionSkill",
"name": "#1",
"description": "Extract text and metadata from binary documents (PDF, DOCX, etc.)",
"context": "/document",
"inputs": [
{
"name": "file_data",
"source": "/document/FileData"
}
],
"outputs": [
{
"name": "text",
"targetName": "content"
},
{
"name": "metadata",
"targetName": "metadata"
}
],
"parsingMode": "default",
"dataToExtract": "contentAndMetadata"
}
]
}
索引器必须将数据库字段正确映射到索引字段,并检查技能组的输出映射是否已正确分配。
样本数据:
限制索引器处理单个文档。这有助于识别特定记录的问题。
ID
(例如,WHERE ID = 1
)。{
"name": "filestorage-indexer",
"dataSourceName": "filestorage-datasource",
"targetIndexName": "files-index",
"skillsetName": "testtwo",
"fieldMappings": [
{
"sourceFieldName": "ID",
"targetFieldName": "ID"
},
{
"sourceFieldName": "FileName",
"targetFieldName": "fileName"
},
{
"sourceFieldName": "UploadDate",
"targetFieldName": "uploadDate"
}
],
"outputFieldMappings": [
{
"sourceFieldName": "/document/content",
"targetFieldName": "content"
},
{
"sourceFieldName": "/document/metadata",
"targetFieldName": "metadata"
}
],
"parameters": {
"batchSize": 100,
"maxFailedItems": 0,
"maxFailedItemsPerBatch": 0
}
}
输出字段 (
content
、metadata
) 与目标索引字段正确对齐。