如何检查xml文件是否包含连续节点?

问题描述 投票:3回答:3

我有一些看起来像的xml文件

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="jats-html.xsl"?>
<!--<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with OASIS Tables v1.0 20120330//EN" "JATS-journalpublishing-oasis-article1.dtd">-->
<article article-type="proceedings" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id" />
<journal-title-group>
<journal-title>Eleventh &#x0026; Tenth International Conference on Correlation Optics</journal-title>
</journal-title-group>
<issn pub-type="epub">0277-786X</issn>
<publisher>
<publisher-name>Springer</publisher-name>
</publisher>
</journal-meta>
<fig-count count="0" />
<table-count count="0" />
<equation-count count="0" />
</front>
<body>
<sec id="s1">
<label>a.</label>
<title>INTRODUCTION</title>
<p>One of approaches of solving<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref13">[13]</xref>, <xref ref-type="bibr" rid="ref8">[8]</xref> the problem <xref ref-type="bibr" rid="ref1">[1]</xref>, <xref ref-type="bibr" rid="ref5">[2]</xref>, <xref ref-type="bibr" rid="ref6">[6]</xref> <xref ref-type="bibr" rid="ref7">[6]</xref> of light propagation in scattering media is the method of Monte Carlo statistical simulation<sup><xref ref-type="bibr" rid="c1">1</xref>–<xref ref-type="bibr" rid="c5">5</xref></sup>. It is a set of techniques that allow us to find the necessary solutions by repetitive random sampling. Estimates of the unknown quantities are statistical means.</p>
<p>For the case of radiation transport in scattering <xref ref-type="bibr" rid="ref6">6</xref> <xref ref-type="bibr" rid="ref8">8</xref> <xref ref-type="bibr" rid="ref9">9</xref> <xref ref-type="bibr" rid="ref10">10</xref> medium Monte Carlo method consists in repeated calculation of the trajectory <xref ref-type="bibr" rid="ref7">6</xref> <xref ref-type="bibr" rid="ref7">7</xref> <xref ref-type="bibr" rid="ref8">8</xref> <xref ref-type="bibr" rid="ref9">[9]</xref> of a photon in a medium based on defined environment parameters. Application of Monte Carlo method is based on the use of macroscopic optical properties of the medium which are considered homogeneous within small volumes of tissue. Models that are based on this method can be divided into two types: models that take into account the polarization of the radiation, and models that ignore it.</p>
<p>Simulation that is based on the previous models usually discards the details of the radiation energy distribution within a single scattering particle. This disadvantage can be ruled out (in the case of scattering particles whose size exceeds the wavelength) by using another method - reverse ray tracing. This method is like the one mentioned before on is based on passing a large number of photons through a medium that is simulated. The difference is that now each scattering particle has a certain geometric topology and scattering is now calculated using the Fresnel equations. The disadvantage of this method is that it can give reliable results only if the particle size is much greater than the wavelength (at least an order of magnitude).</p>
</sec>
</body>
</article>

其中有<xref ref-type="bibr" rid="ref...">...</xref>形式的链接节点。如何查找是否有3个或更多连续的链接节点(用逗号和空格分隔,或者只是文件中的空格,并将它们输出到txt文件。

我可以像(?:<xref type="bibr" rid="ref\d+">\[\d+\]</xref>\s*,\s*){2,}<xref type="bibr" rid="ref\d+">\[\d+\]</xref>那样进行正则表达式搜索,它会找到由“,SPACE”或“SPACE”分隔的3个或更多链接节点,但它们不一定必须具有连续的id。我该怎么做呢?

c# xml
3个回答
2
投票

因此,为了符合您的要求,我在此向您提出我的问题解决方案。我没有彻底测试重复的可能性。即一些参考可能只是前一个结果的一个子集。但要解决它们应该没问题。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Text.RegularExpressions;


public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.Load("article.xml");

    //only selects <p>'s that already have 3 or more refs. No need to check paragraphs that don't even have enough refs
    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//*[count(xref[@ref-type='bibr' and starts-with(@rid,'ref')])>2]");

    List<string> results = new List<string>();

    //Foreach <p>
    foreach (XmlNode x in nodes)
    {
        XmlNodeList xrefs = x.SelectNodes(".//xref[@ref-type='bibr' and starts-with(@rid,'ref')]");
        List<StartEnd> startEndOfEachTag = new List<StartEnd>(); // we mark the start and end of each ref.
        string temp = x.OuterXml; //the paragraph we're checking

        //finds start and end of each tag xref tag
        foreach (XmlNode xN in xrefs){ //We find the start and end of each paragraph
            StartEnd se = new StartEnd(temp.IndexOf(xN.OuterXml), temp.IndexOf(xN.OuterXml) + xN.OuterXml.Length);
            startEndOfEachTag.Add(se);  
        }

        /* This comment shows the regex command used and how we build the regular expression we are checking with.
        string regexTester = Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref>")+"([ ]|(, ))" + Regex.Escape("<xref ref-type=\"bibr\" rid=\"ref3\">3</xref>");
        Match matchTemp = Regex.Match("<xref ref-type=\"bibr\" rid=\"ref2\">2</xref> <xref ref-type=\"bibr\" rid=\"ref3\">3</xref>", regexTester);
        Console.WriteLine(matchTemp.Value);*/

        //we go through all the xrefs
        for (int i=0; i<xrefs.Count; i++)
        {
            int newIterator = i; //This iterator prevents us from creating duplicates.
            string regCompare = Regex.Escape(xrefs[i].OuterXml); // The start xref

            int count = 1; //we got one xref to start with we need at least 3
            string tempRes = ""; //the string we store the result in

            int consecutive = Int32.Parse(xrefs[i].Attributes["rid"].Value.Substring(3));

            for (int j=i+1; j<xrefs.Count; j++) //we check with the other xrefs to see if they follow immediately after.
            {
                if(consecutive == Int32.Parse(xrefs[j].Attributes["rid"].Value.Substring(3)) - 1)
                {
                    consecutive++;
                }
                else { break; }

                regCompare += "([ ]|(, ))" + Regex.Escape(xrefs[j].OuterXml); //we check that the the xref comes exactly after a space or a comma and space
                regCompare += "([ ]|(, ))" + Regex.Escape(xrefs[j].OuterXml); //we check that the the xref comes exactly after a space or a comma and space

                Match matchReg;

                try
                {
                    matchReg = Regex.Match(temp.Substring(startEndOfEachTag[i].start, startEndOfEachTag[j].end - startEndOfEachTag[i].start),
                        regCompare); //we get the result
                }
                catch
                {
                    i = j; // we failed and i should start from here now.
                    break;
                }

                if (matchReg.Success){
                    count++; //it was a success so we increment the number of xrefs we matched
                    tempRes = matchReg.Value; // we add it to out temporary result.
                    newIterator = j; //update where i should start from next time.
                }
                else {
                    i = j; // we failed and i should start from here now.
                    break;
                }
            }
            i = newIterator;
            if (count > 2)
            {
                results.Add(tempRes); 
            }
        }
    }
    Console.WriteLine("Results: ");
    foreach(string s in results)
    {
            Console.WriteLine(s+"\n");
    }

    Console.ReadKey();
}

缺少的课程

class StartEnd
{
    public int start=-1;
    public int end = -1;

    public StartEnd(int start, int end)
    {
        this.start = start;
        this.end = end;
    }
}

1
投票

我的xpath有点生疏了。但我相信你可以制作一个比我下面提到的更好的xpath。更好的xpath只会选择具有3个或更多bibr类型的节点的节点,并且包含以ref开头的rid。任何谁。这是我获取所需节点的解决方案。

public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.Load("article.xml");

    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//xref[@ref-type='bibr' and starts-with(@rid,'ref')]/parent::*");

    foreach(XmlNode x in nodes)
    {
        XmlNodeList temp = x.SelectNodes("//xref[@ref-type='bibr' and starts-with(@rid,'ref')]");
        //we only select those that have 3 or more references.
        if (temp.Count >= 3)
        {
            Console.WriteLine(x.InnerText);
        }
    }

    Console.ReadKey();

}

编辑我玩了一下,下面的代码有一个更新的xpath,它应该得到你想要的一切。

public static void Main(string[] args)
{
    XmlDocument doc = new XmlDocument();
    doc.Load("article.xml");

    XmlNodeList nodes = doc.DocumentElement.SelectNodes("//*[count(xref[@ref-type='bibr' and starts-with(@rid,'ref')])>2]");

    foreach(XmlNode x in nodes){
        Console.WriteLine(x.InnerText);
    }

    Console.ReadKey();

}

1
投票

正则表达式对于分层语法并不是很好。我会编写C#代码来读取XML并跟踪仅由“,”或“”分隔的连续外部参照点的数量。

  static void Main(string[] args)
  {
     using (var xmlStream = System.Reflection.Assembly.GetExecutingAssembly().GetManifestResourceStream("ConsoleApp1.XMLFile1.xml"))
     {
        int state = 0; // 0 = Look for xref; 1 = look for separator
        string[] simpleSeparators = { " ", ", " };
        string rid = "0";
        StringBuilder nodeText = new StringBuilder();
        string[] consecutiveNodes = new string[3];

        System.Xml.XmlReaderSettings settings = new System.Xml.XmlReaderSettings();
        settings.DtdProcessing = System.Xml.DtdProcessing.Ignore;
        using (var reader = System.Xml.XmlReader.Create(xmlStream, settings))
        {
           while (reader.Read())
           {
              if (reader.IsStartElement("xref"))
              {
                 nodeText.Append("<xref");
                 if (reader.HasAttributes)
                 {
                    while (reader.MoveToNextAttribute())
                       nodeText.AppendFormat(" {0}=\"{1}\"", reader.Name, reader.Value);
                 }
                 nodeText.Append(">");
                 string nextRid = reader.GetAttribute("rid");
                 switch (state)
                 {
                    case 0:
                       break;
                    case 2:
                    case 4:
                       if (Math.Abs(GetIndex(nextRid) - GetIndex(rid)) > 1)
                          state = 0;
                       break;
                 }
                 state++;
                 rid = nextRid;
              }
              else if (reader.NodeType == System.Xml.XmlNodeType.Text)
              {
                 if (state > 0)
                    nodeText.Append(reader.Value);
                 if ((state % 2 == 1) && simpleSeparators.Contains(reader.Value))
                       state++;
              }
              else if ((reader.NodeType == System.Xml.XmlNodeType.EndElement) && (state > 0))
              {
                 nodeText.AppendFormat("</{0}>", reader.Name);
                 consecutiveNodes[state / 2] = nodeText.ToString();
                 nodeText.Clear();
                 if (state > 3)
                 {
                    Console.WriteLine("{0}{1}{2}", consecutiveNodes[0], consecutiveNodes[1], consecutiveNodes[2]);
                    state = 0;
                 }
              }
              else if (reader.IsStartElement())
              {
                 nodeText.Clear();
                 state = 0;
              }
           }
        }
     }
  }

  static int GetIndex(string rid)
  {
     int start = rid.Length;
     while ((start > 0) && Char.IsDigit(rid, --start)) ;

     start++;
     if (start < rid.Length)
        return int.Parse(rid.Substring(start));
     return 0;
  }

此示例在您的示例数据输出上运行:

<xref ref-type="bibr" rid="ref2">[2]</xref>, <xref ref-type="bibr" rid="ref3">[3]</xref>, <xref ref-type="bibr" rid="ref4">[4]</xref>
<xref ref-type="bibr" rid="rid6">6</xref><xref ref-type="bibr" rid="rid6">9</xref><xref ref-type="bibr" rid="rid6">10</xref>

我更新了代码以排除:

<xref ref-type="bibr" rid="ref11">[11]</xref>, <xref ref-type="bibr" rid="ref13">[13]</xref>, <xref ref-type="bibr" rid="ref8">[8]</xref>

因为ref11,ref13和ref8不是你问题中要求的连续id。

© www.soinside.com 2019 - 2024. All rights reserved.