在TXT / xml文件中查找多行,并在符合条件时删除

问题描述 投票:1回答:3

想知道是否可以制作一个简单的脚本来检查是否满足多个标准并对文件进行必要的修改。

继续举例说明我拥有的和我想要实现的目标。

我有一个4行的xml文件 - 数字,年份,模型和人。

如果<man>是福特或道奇,我希望不做任何修改。但如果<man>不是那个,那么我想检查<year><model>是否为“NA”并删除“NA”行。

<?xml version="1.0" encoding="UTF-8"?>
<CarStuff>
    <fileName>CarExpor201217.xml</fileName>
    <numberCars>5</numberCars>
    <ref>2017XY</ref>
    <carExo id="CAR0001_01">
        <dealVen id="CAR0001_02">
            <name>John</name>
            <surname>Smith</surname>
        </dealVen>
        <soldCar id="CAR0001_03">
            <amount>1811.10</amount>
            <lotNumber>1</lotNumber>
            <year>NA</year>             - Line must be removed
            <model>NA</model>           - Line must be removed
            <man>Acura</man>
        </soldCar>
    </carExo>
    <carExo id="CAR0002_01">
        <dealVen id="CAR0002_02">
            <name>John</name>
            <surname>Smith</surname>
        </dealVen>
        <soldCar id="CAR0002_03">
            <amount>1811.10</amount>
            <lotNumber>1</lotNumber>
            <year>NA</year>         - Line must be kept
            <model>NA</model>       - Line must be kept
            <man>Ford</man>
        </soldCar>
    </carExo>
    <carExo id="CAR0003_01">
        <dealVen id="CAR0003_02">
            <name>John</name>
            <surname>Smith</surname>
        </dealVen>
        <soldCar id="CAR0003_03">
            <amount>1811.10</amount>
            <lotNumber>1</lotNumber>
            <year>1997</year>       - Line must be kept
            <model>NA</model>       - Line must be removed
            <man>Bugati</man>
        </soldCar>
    </carExo>
    <carExo id="CAR0004_01">
        <dealVen id="CAR0004_02">
            <name>John</name>
            <surname>Smith</surname>
        </dealVen>
        <soldCar id="CAR0004_03">
            <amount>1811.10</amount>
            <lotNumber>1</lotNumber>
            <year>1997</year>       - Line must be kept
            <model>NA</model>       - Line must be kept
            <man>Dodge</man>
        </soldCar>
    </carExo>
    <carExo id="CAR0005_01">
        <dealVen id="CAR0005_02">
            <name>John</name>
            <surname>Smith</surname>
        </dealVen>
        <soldCar id="CAR0005_03">
            <amount>1811.10</amount>
            <lotNumber>2</lotNumber>
            <year>NA</year>         - Line must be kept
            <model>Charger</model>  - Line must be kept
            <man>Dodge</man>
        </soldCar>
    </carExo>
    <carExo id="CAR0005_01">
        <dealVen id="CAR0005_02">
            <name>John</name>
            <surname>Smith</surname>
        </dealVen>
        <soldCar id="CAR0005_03">
            <amount>1811.10</amount>
            <lotNumber>3</lotNumber>
            <year>NA</year>         - Line must be removed
            <model>Dot</model>      - Line must be kept
            <man>Datsun</man>
        </soldCar>
    </carExo>
</CarStuff>

感谢所有评论和想法。

xml vba notepad++ xmldom
3个回答
0
投票

Soluton vja Hmldom

您可以使用XMLDom和XPath在所谓的NodeList中搜索不包含Dodge或Ford字符串的<man>标记,并检查所有兄弟节点是否包含“NA”以便删除它们。下面的代码使用后期绑定。顺便说一句,你的OP中的xml没有很好地形成(关闭标签</carStuf>而不是</carStuff> - 我在加载时添加了一个小的解析错误例程来检查它。

Option Explicit

Sub checkNA()
Dim xDoc        As Object    ' xml document
Dim noli, noli2 As Object    ' node list
Dim no, no2     As Object    ' node
Dim noMan       As Object    ' node <man> to check if no Dodge or Ford
Dim s           As String
Dim sFile       As String    ' xml file name

  sFile = ThisWorkbook.Path & "\xml\na_test.xml"  ' <<< change to your xml file name

' late binding xml
  Set xDoc = CreateObject("MSXML2.DOMDocument.6.0")
  xDoc.async = False: xDoc.validateOnParse = False
  xDoc.setProperty "SelectionLanguage", "XPath"
' load xml
  If xDoc.Load(sFile) Then
    Debug.Print "Loaded successfully"
  Else
    Dim xPE        As Object    ' Set xPE = CreateObject("MSXML2.IXMLDOMParseError")
    Dim strErrText As String
    Set xPE = xDoc.parseError
    With xPE
    strErrText = "Load error " & .ErrorCode & " xml file " & vbCrLf & _
                 Replace(.URL, "file:///", "") & vbCrLf & vbCrLf & _
                 xPE.reason & _
                 "Source Text: " & .srcText & vbCrLf & vbCrLf & _
                 "Line No.:    " & .Line & vbCrLf & _
                 "Line Pos.: " & .linepos & vbCrLf & _
                "File Pos.:  " & .filepos & vbCrLf & vbCrLf
    End With
    MsgBox strErrText, vbExclamation
    Set xPE = Nothing
    Exit Sub
  End If

' check items
  s = "carExo/soldCar"
  Set noli = xDoc.DocumentElement.SelectNodes(s)
  For Each no In noli
      Set noMan = no.SelectSingleNode("man")
      If Not noMan Is Nothing Then
         If InStr("Ford.Dodge" & ".", noMan.Text & ".") = 0 Then
            Debug.Print "delete", noMan.Text
            ' delete all subtags containing "NA" as text
            Set noli2 = no.SelectNodes("*")
            For Each no2 In noli2
                If no2.Text = "NA" Then
                   ' delete item
                     Debug.Print , no2.nodename & "=" & no2.Text
                     no2.ParentNode.RemoveChild no2
                End If
            Next no2

         Else
            ' Debug.Print "keep", noman.Text
         End If
      End If
  Next no

' save
  ' Debug.Print xDoc.XML
  xDoc.Save sFile      
' close
  Set xDoc = Nothing
End Sub

编辑12/29 - 附录

我使用一些额外的XPath添加了' check items部分的第二个可行版本。这种替代方案简单地避免了普通代码中的两个If条件,因为它缩小了两个节点列表中找到的节点的范围。

' check items
  s = "carExo/soldCar[man!='Ford'][man!='Dodge']"   ' << (1) added condition to XPath
  Set noli = xDoc.DocumentElement.SelectNodes(s)
  For Each no In noli
      Set noMan = no.SelectSingleNode("man")
      If Not noMan Is Nothing Then
         Debug.Print "delete", noMan.Text
       ' delete all subtags containing "NA" as text
         Set noli2 = no.SelectNodes("*[.='NA']")    ' << (2)added condition to XPath
         For Each no2 In noli2
           ' delete item
             Debug.Print , no2.nodename & "=" & no2.Text
             no2.ParentNode.RemoveChild no2
         Next no2
      End If
  Next no

暗示

当然有许多通往罗马的街道,请参阅下面的@Parfait的XSLT方法。


1
投票

只需使用XSLT,这种专用语言旨在通过根据各种标准删除节点来完全转换原始XML文件。

具体来说,下面运行Identity Transform按原样复制XML,然后按照您的模型/年/人的标准排除节点。

XSLT(另存为.xsl,一个特殊的.xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="soldCar[man != 'Ford' and man != 'Dodge']">
    <xsl:copy>
        <xsl:copy-of select="amount|lotNumber"/>
        <xsl:if test="model != 'NA'">
            <xsl:copy-of select="model"/>
        </xsl:if>
        <xsl:if test="year != 'NA'">
            <xsl:copy-of select="year"/>
        </xsl:if>
        <xsl:copy-of select="man"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

VBA

Public Sub RunXSLT()
    Dim strFile As String, strPath As String
    ' REFERENCE MS XML, v6.0
    Dim xmlDoc As New MSXML2.DOMDocument60, xslDoc As New MSXML2.DOMDocument60 
    Dim newDoc As New MSXML2.DOMDocument60

    ' LOAD XML SOURCE
    xmlDoc.Load "C:\Path\To\Input.xml"

    ' LOAD XSL SOURCE
    xslDoc.Load "C:\Path\To\XSLT\Script.xsl"

   ' TRANSFORM SOURCE
   xmlDoc.transformNodeToObject xslDoc, newDoc
   newDoc.Save "C:\Path\To\Output.xml"

   ' RELEASE DOM OBJECTS
    Set xmlDoc = Nothing: Set xslDoc = Nothing: Set newDoc = Nothing
End Sub

产量

<?xml version="1.0" encoding="utf-8"?>
<CarStuff>
  <fileName>CarExpor201217.xml</fileName>
  <numberCars>5</numberCars>
  <ref>2017XY</ref>
  <carExo id="CAR0001_01">
    <dealVen id="CAR0001_02">
      <name>John</name>
      <surname>Smith</surname>
    </dealVen>
    <soldCar>
      <amount>1811.10</amount>
      <lotNumber>1</lotNumber>
      <man>Acura</man>
    </soldCar>
  </carExo>
  <carExo id="CAR0002_01">
    <dealVen id="CAR0002_02">
      <name>John</name>
      <surname>Smith</surname>
    </dealVen>
    <soldCar id="CAR0002_03">
      <amount>1811.10</amount>
      <lotNumber>1</lotNumber>
      <year>NA</year>
      <model>NA</model>
      <man>Ford</man>
    </soldCar>
  </carExo>
  <carExo id="CAR0003_01">
    <dealVen id="CAR0003_02">
      <name>John</name>
      <surname>Smith</surname>
    </dealVen>
    <soldCar>
      <amount>1811.10</amount>
      <lotNumber>1</lotNumber>
      <year>1997</year>
      <man>Bugati</man>
    </soldCar>
  </carExo>
  <carExo id="CAR0004_01">
    <dealVen id="CAR0004_02">
      <name>John</name>
      <surname>Smith</surname>
    </dealVen>
    <soldCar id="CAR0004_03">
      <amount>1811.10</amount>
      <lotNumber>1</lotNumber>
      <year>1997</year>
      <model>NA</model>
      <man>Dodge</man>
    </soldCar>
  </carExo>
  <carExo id="CAR0005_01">
    <dealVen id="CAR0005_02">
      <name>John</name>
      <surname>Smith</surname>
    </dealVen>
    <soldCar id="CAR0005_03">
      <amount>1811.10</amount>
      <lotNumber>2</lotNumber>
      <year>NA</year>
      <model>Charger</model>
      <man>Dodge</man>
    </soldCar>
  </carExo>
  <carExo id="CAR0005_01">
    <dealVen id="CAR0005_02">
      <name>John</name>
      <surname>Smith</surname>
    </dealVen>
    <soldCar>
      <amount>1811.10</amount>
      <lotNumber>3</lotNumber>
      <model>Dot</model>
      <man>Datsun</man>
    </soldCar>
  </carExo>
</CarStuff>

0
投票

听起来你需要删除包含>NA<的所有行。

这不是一个真正的编程问题(所以它是off-topic)但是这里是使用Notepad++的快速回答:

  • 按Ctrl + H以显示查找替换对话框。
  • Find what:文本框中包含你的正则表达式:.*>NA<.*\r?\n(如果文件没有Windows行结尾,则\r是可选的)。
  • Replace with:文本框保留为空。
  • 确保选中搜索模式区域中的Regular Expression单选按钮。
  • Replace All和瞧!包含>NA<的所有行都已删除。

NPP repl line example

(答案改编自this)。

© www.soinside.com 2019 - 2024. All rights reserved.