我们有一个记录呼叫中心呼叫的平台,并在 wav 文件末尾添加一些 xml,其中保存有关该呼叫的重要元数据。
我正在尝试读取这些 wav 文件的文件夹并将元数据拉入用户的列表中,他们的偏好是列表位于 Excel 中,但是我正在努力寻找一种可靠工作的方法在普通的 Windows 计算机上,无需安装特殊的东西,例如 Python。
就像 Excel 有一个 xml 导入功能,但是失败了,因为 xml 位于文件的末尾,而 excel 从头开始读取并被音频部分混淆,只需跳到
我在powershell中尝试了以下操作:
$directory = "C:\test"
$wavFiles = Get-ChildItem -Path $directory -Filter *.wav
foreach ($file in $wavFiles) {
Write-Host "Processing file: $($file.Name)"
$content = Get-Content -Path $file.FullName -Raw -Encoding Byte
$decodedContent = [System.Text.Encoding]::UTF8.GetString($content)
$match = [regex]::Match($decodedContent, '<recording>.+?</recording>')
if ($match.Success) {
$xmlContent = $match.Value
Write-Host "Found XML in file $($file.Name):"
Write-Host $xmlContent
} else {
Write-Host "No XML found in file $($file.Name)."
}
}
这可以正确定位文件,但无法解析 xml。在记事本++等文本编辑器中打开文件时可以看到它
处理文件:131346032527__8115_02-13-2024-11-12-58.wav
在文件 131346032527__8115_02-13-2024-11-12-58.wav 中找不到 XML。
有什么想法吗?
终于提取出Xml了。我使用Xml Linq来解析。
using assembly System.Xml.Linq
$filename = 'c:\temp\test.wav';
enum Wave {
PCM = 0x0001
IBM_MULAW = 0x0101
IBM_ALAW = 0x0102
IBM_ADPCM = 0x0103
}
$stream = [System.IO.File]::OpenRead($filename);
$bReader = [System.IO.BinaryReader]::New($stream);
$length = $bReader.BaseStream.Length;
while ($bReader.BaseStream.Position -lt $length)
{
$identifier = $bReader.ReadBytes(4);
$identifiedStr = [System.Text.Encoding]::UTF8.GetString($identifier);
$size = [System.BitConverter]::ToInt32($bReader.ReadBytes(4), 0);
$chunk = $bReader.ReadBytes($size);
#if odd read padding byte
if($size % 2 -eq 1) { $bReader.ReadByte(); }
#if odd read padding byte
if($size % 2 -eq 1) {$bReader.ReadByte();}
if ($identifiedStr -eq 'RIFF')
{
$chunkIdent = [System.Text.Encoding]::UTF8.GetString($chunk, 0, 4).Trim();
$offset = 4;
if($chunkIdent -eq "WAVE")
{
while ($offset -lt $chunk.Length)
{
$chunktypeStr = [System.Text.Encoding]::UTF8.GetString($chunk, $offset, 4).Trim();
Write-Host 'Chunk Ident = '$chunkIdent ', Chunk Type = '$chunktypeStr
$chunkTypeLength = [System.BitConverter]::ToInt32($chunk, $offset + 4);
switch ($chunktypeStr)
{
'fmt' {
$category = [Wave][System.BitConverter]::ToInt16($chunk, $offset + 8);
$channels = [System.BitConverter]::ToInt16($chunk, $offset + 10);
$samplePerChannel = [System.BitConverter]::ToInt32($chunk, $offset + 12);
$averageBytesPerSample = [System.BitConverter]::ToInt32($chunk, $offset + 16);
Write-Host 'Format : Category = '$category.ToString() ', Channels = '$channels ', Samples per channel = '$samplePerChannel ', Average Bytes per Sample = '$averageBytesPerSample
$blockAlign = [System.BitConverter]::ToInt16($chunk, $offset + 20);
}
'data' {
$data = [System.Text.Encoding]::UTF8.GetString($chunk, $offset,$chunk.Length - $offset);
}
'iXML' {
$xml = [System.Text.Encoding]::UTF8.GetString($chunk, $offset + 8, $chunkTypeLength - $offset - 8);
$doc = [System.Xml.Linq.XDocument]::Parse($xml);
}
default {
}
}
$offset += $chunkTypeLength + 8
if($chunkTypeLength %2 -eq 1) { $offset += 1}
}
}
}
}
我做了一些尝试,在让正则表达式匹配时遇到了类似的问题。我发现通过删除设置中的“r”n 字符并将整个原始字符串折叠为 1 行,我能够使模式匹配对齐。请参阅示例。
$decodedContent = @"
This is a bunch of garbage
<recording>
<Title>My Thing</Title>
<Date>1-2-24</Date>
<meta>12345</meta>
</recording>
A bunch of other garbage
"@
$match = [regex]::Match(($decodedContent.Replace("`r`n","")),'<recording>.+?</recording>')
$xmlContent = $match.Value
[xml]$xml = $xmlContent