如何比较文件中的两个连续字符串

问题描述 投票:1回答:3

我有一个大文件,包括每个项目的“之前”和“之后”案例,如下所示:

case1 (BEF) ACT
      (AFT) BLK
case2 (BEF) ACT
      (AFT) ACT
case3 (BEF) ACT
      (AFT) CLC
...

我需要选择所有在“first”字符串上有(BEF) ACT的字符串和在“second”字符串上的(AFT) BLK,并将结果放在一个文件中。

想法是创建一个像这样的子句

IF (stringX.LineNumber consists of "(BEF) ACT" AND stringX+1.LineNumber consists of (AFT) BLK)
{OutFile $stringX+$stringX+1}

对不起语法,我刚刚开始使用PS :)

$logfile = 'c:\temp\file.txt'
$matchphrase = '\(BEF\) ACT'
$linenum=Get-Content $logfile | Select-String $matchphrase | ForEach-Object {$_.LineNumber+1}
$linenum 
#I've worked out how to get a line number after the line with first required phrase

使用如下结果创建一个新文件:带有“(BEF)ACT”的字符串,后跟带有“(AFT)BLK”的字符串

powershell
3个回答
1
投票
Select-String -SimpleMatch -CaseSensitive '(BEF) ACT' c:\temp\file.txt -Context 0,1 |
  ForEach-Object {
    $lineAfter = $_.Context.PostContext[0]
    if ($lineAfter.Contains('(AFT) BLK')) {
      $_.Line, $lineAfter  # output
    }
  } # | Set-Content ...
  • -SimpleMatch执行字符串文字子字符串匹配,这意味着您可以按原样传递搜索字符串,而无需转义它。 但是,如果你需要进一步约束搜索,例如确保它只发生在一行($)的末尾,你确实需要一个带有(隐含的)regular expression参数的-Pattern'\(BEF\) ACT$' 另请注意,PowerShell默认情况下通常不区分大小写,这就是使用switch -CaseSensitive的原因。
  • 注意Select-String如何直接接受文件路径 - 不需要前面的Get-Content调用。
  • -Context 0,1在每场比赛之后捕获0线和1线,并将它们包含在[Microsoft.PowerShell.Commands.MatchInfo]输出的Select-String实例中。
  • ForEach-Object脚本块中,$_.Context.PostContext[0]在匹配后检索该行,.Contains()在其中执行文字子串搜索。 请注意,.Contains()是.NET System.String类型的一种方法,这种方法 - 与PowerShell不同 - 默认情况下区分大小写,但您可以使用可选参数来更改它。
  • 如果在后续行中找到子字符串,则输出当前行和后续行。
  • 以上查找输入文件中的所有匹配对;如果您只想找到第一对,请将| Select-Object -First 2附加到Select-String电话。

1
投票

另一种方法是将$ logFile作为单个字符串读取,并使用RegEx匹配来获取所需的部分:

$logFile = 'c:\temp\file.txt'
$outFile = 'c:\temp\file2.txt'

# read the content of the logfile as a single string
$content = Get-Content -Path $logFile -Raw

$regex = [regex] '(case\d+\s+\(BEF\)\s+ACT\s+\(AFT\)\s+BLK)'
$match = $regex.Match($content)
($output = while ($match.Success) {
    $match.Value
    $match = $match.NextMatch()
}) | Set-Content -Path $outFile -Force

使用时结果如下:

case1 (BEF) ACT
      (AFT) BLK
case7 (BEF) ACT
      (AFT) BLK

正则表达式详细信息:

(              Match the regular expression below and capture its match into backreference number 1
   case        Match the characters “case” literally
   \d          Match a single digit 0..9
      +        Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   \s          Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
      +        Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   \(          Match the character “(” literally
   BEF         Match the characters “BEF” literally
   \)          Match the character “)” literally
   \s          Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
      +        Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   ACT         Match the characters “ACT” literally
   \s          Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
      +        Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   \(          Match the character “(” literally
   AFT         Match the characters “AFT” literally
   \)          Match the character “)” literally
   \s          Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
      +        Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   BLK         Match the characters “BLK” literally
)

1
投票
  • My other answer完成了您自己的基于Select-String的解决方案尝试。 Select-String是多功能的,但速度很慢,虽然它适合处理文件太大而无法整合到内存中,因为它会逐行处理文件。 但是,PowerShell提供了更快的逐行处理替代方案: switch -File - 请参阅下面的解决方案。
  • Theo's helpful answer,它首先将整个文件读入内存,根据文件大小的不同,可能总体上表现最佳,但由于严重依赖于.NET功能的直接使用而导致复杂性增加。

$(
  $firstLine = ''
  switch -CaseSensitive -Regex -File t.txt {
    '\(BEF\) ACT' { $firstLine = $_; continue }
    '\(AFT\) BLK' { 
      # Pair found, output it.
      # If you don't want to look for further pairs, 
      # append `; break` inside the block.
      if ($firstLine) { $firstLine, $_ }
      # Look for further pairs.
      $firstLine = ''; continue
    }
    default { $firstLine = '' }
  } 
) # | Set-Content ...

注意:仅当您想要将输出直接发送到管道到cmdlet(例如$(...))时,才需要包含Set-Content;捕获变量中的输出不需要它:$pair = switch ...

  • -Regex将分支条件解释为regular expressions
  • $_在分支的动作脚本块内({ ... }指的是手头的线。
  • 总体方法是: $firstLine存储了第一条感兴趣的线路,当找到第二条线的模式并且设置了$firstLine(非空)时,输出该线对。 default处理程序重置$firstLine,以确保只考虑包含感兴趣字符串的两个连续行。
© www.soinside.com 2019 - 2024. All rights reserved.