我有一个脚本,可以查看某些文件夹中的所有文件,然后搜索其中的文本。问题是我希望它有选择地仅搜索本质上不是二进制的文件。这似乎很难做到。我尝试结合多种技术,但它仍然无法正常工作(在许多文件中,例如,
C:\csb.log
,它似乎是某种类型的系统文件,被标记为二进制文件,但它们不是,并且是只是文本文件,还有像 PDF 或 EPUB / MOBI 这样的文件,它们是文本文件,但又不是;这很令人困惑)。我特别不喜欢有像下面这样的行
$binaryExtensions = @('.exe', '.dll', '.bin', '.iso', '.zip', '.tar', '.rar', '.7z', '.gz', '.pdf', '.epub', '.mobi', '.azw', '.azw2', '.azw3')
我认为我们可以快速检测文件的性质,而不依赖于文件扩展名。
我们如何检测(希望是简单的)非二进制文件,以便可以清晰地搜索文本,也可能检测部分二进制和部分文本的文件,如 PDF、EPUB、MOBI,并忽略二进制部分,但干净地搜索搜索非二进制部分中的文本)?
这是我迄今为止的 PowerShell 尝试。
function Is-BinaryFile {
param (
[string]$FilePath,
[switch]$errors
)
# Define the log file path based on the function name
$functionName = $MyInvocation.MyCommand.Name
$logFile = "$Env:TEMP\$functionName.log"
# If $errors is specified without $FilePath, output the log file contents
if ($errors -and -not $FilePath) {
if (Test-Path $logFile) {
Get-Content -Path $logFile
} else {
Write-Host "Log file not found: $logFile"
}
return
}
# If the FilePath is a directory, consider it to be a 'binary file' for this and return true
if ((Test-Path $FilePath) -and (Get-Item $FilePath).PSIsContainer) {
return $true
}
# Check for common binary file extensions before reading the file
$binaryExtensions = @('.exe', '.dll', '.bin', '.iso', '.zip', '.tar', '.rar', '.7z', '.gz', '.pdf', '.epub', '.mobi', '.azw', '.azw2', '.azw3')
$fileExtension = [System.IO.Path]::GetExtension($FilePath).ToLower()
if ($binaryExtensions -contains $fileExtension) {
return $true
}
# Retry opening the file if it's locked using FileStream in shared mode
$maxRetries = 5
$retryDelay = 2 # in seconds
$attempt = 0
$fileStream = $null
while ($attempt -lt $maxRetries -and -not $fileStream) {
try {
# Open file with shared read/write mode
$fileStream = [System.IO.FileStream]::new($FilePath, [System.IO.FileMode]::Open, [System.IO.FileAccess]::Read, [System.IO.FileShare]::ReadWrite)
} catch {
Write-Host "File is locked, retrying in $retryDelay seconds..."
Start-Sleep -Seconds $retryDelay
$attempt++
}
}
if (-not $fileStream) {
$timestamp = Get-Date -Format "yyyy-MM-dd_HH-mm-ss"
$logMessage = "[$timestamp] Failed to open '$FilePath' after $maxRetries attempts"
Write-Host $logMessage
$logMessage | Out-File -FilePath $logFile -Append
return $false
}
# Rest of your code to check if it's a binary file...
$reader = $null
try {
$reader = New-Object System.IO.BinaryReader($fileStream)
# Check for BOM (Byte Order Mark)
$bomBuffer = New-Object byte[] 4
$bytesRead = $reader.Read($bomBuffer, 0, $bomBuffer.Length)
if ($bytesRead -ge 2 -and (
($bomBuffer[0] -eq 0xEF -and $bomBuffer[1] -eq 0xBB -and $bomBuffer[2] -eq 0xBF) -or # UTF-8 BOM
($bomBuffer[0] -eq 0xFF -and $bomBuffer[1] -eq 0xFE) -or # UTF-16 LE BOM
($bomBuffer[0] -eq 0xFE -and $bomBuffer[1] -eq 0xFF) # UTF-16 BE BOM
)) {
return $false # It's a text file
}
# If no BOM, continue checking for non-printable characters
$binaryBytes = 0
$textBytes = 0
$buffer = New-Object byte[] 1024
while (($bytesRead = $reader.Read($buffer, 0, $buffer.Length)) -gt 0) {
for ($i = 0; $i -lt $bytesRead; $i++) {
if ($buffer[$i] -eq 0) {
$binaryBytes++
} elseif ($buffer[$i] -lt 32 -and $buffer[$i] -ne 9 -and $buffer[$i] -ne 10 -and $buffer[$i] -ne 13) {
$binaryBytes++
} else {
$textBytes++
}
}
}
return $binaryBytes -gt $textBytes
} finally {
if ($reader) { $reader.Close() }
if ($fileStream) { $fileStream.Close() }
}
}
似乎是一种合理的方法是这个答案中的方法,本质上,用
StreamReader
打开文件(它尝试检测正确的编码)并逐个字符读取以确定该字符是否是控制字符,在本例中它使用:
0
) 且小于 Back Space (8
) OR13
) 且小于 替换 (26
)如果满足该条件,则该文件是二进制文件,否则它是文本文件。当然这不是绝对的,它可能会失败,但在大多数情况下应该有效。需要注意的是,OP 在他的回答中决定读取字节直到 EOF,这在某些情况下可能会很慢。我决定简单地将 C# 代码转换为 PowerShell,但是您可以在决定它是否是二进制文件之前添加要读取的字符数的限制。
function Test-Binary {
[CmdletBinding(DefaultParameterSetName = 'Path')]
param(
[Parameter(
ParameterSetName = 'LiteralPath',
ValueFromPipelineByPropertyName,
Mandatory)]
[Alias('PSPath')]
[string[]] $LiteralPath,
[Parameter(
ParameterSetName = 'Path',
Mandatory,
ValueFromPipeline,
Position = 0)]
[SupportsWildcards()]
[string[]] $Path
)
begin {
# Taken from: https://stackoverflow.com/a/26652983/15339544
class Utils {
hidden static [char] $NUL = [char] 0 # Null char
hidden static [char] $BS = [char] 8 # Back Space
hidden static [char] $CR = [char] 13 # Carriage Return
hidden static [char] $SUB = [char] 26 # Substitute
hidden static [bool] IsControlChar([int] $ch) {
return ($ch -gt [Utils]::NULL -and $ch -lt [Utils]::BS) -or
($ch -gt [Utils]::CR -and $ch -lt [Utils]::SUB)
}
static [bool] IsBinary([System.IO.FileInfo] $fileinfo) {
if ($fileinfo.Length -eq 0) {
return $false
}
$reader = $null
try {
$reader = [System.IO.StreamReader]::new($fileinfo.FullName)
while (($char = $reader.Read()) -ne -1) {
if ([Utils]::IsControlChar($char)) {
return $true
}
}
return $false
}
finally {
if ($reader) {
$reader.Dispose()
}
}
}
}
}
process {
foreach ($item in Get-Item @PSBoundParameters) {
if ($item -isnot [System.IO.FileInfo]) {
Write-Error "Item is a directory: '$($item.FullName)'..."
continue
}
try {
[pscustomobject]@{
Path = $item.FullName
IsBinary = [Utils]::IsBinary($item)
}
}
catch {
$PSCmdlet.WriteError($_)
}
}
}
}
用法很简单:
PS> Test-Binary myFile.ext
PS> Get-ChildItem -File | Test-Binary
PS> Test-Binary *.ext