Jsoup选择器:h2之后第二个div

问题描述 投票:0回答:1

我有以下HTML:

<html>
<body>

...

<h2> Blah Blah 1</h2>
<p>blah blah</p>
<div>
    <div>
        <table>
            <tbody>
                <tr><th>Col 1 Header</th><th>Col 2 Header</th></tr>
                <tr><td>Line 1.1 Value</td><td>Line 2.1 Header</td></tr>
                <tr><td>Line 2.1 Value</td><td>Line 2.2 Value</td></tr>
            </tbody>
        </table>
    </div>
</div>
<div>
    <div>
        <table>
            <tbody>
                <tr><th>Col 1 Header T2</th><th>Col 2 Header T2</th></tr>
                <tr><td>Line 1.1 Value T2</td><td>Line 2.1 Header T2</td></tr>
                <tr><td>Line 2.1 Value T2</td><td>Line 2.2 Value T2</td></tr>
                </tbody>
        </table>
    </div>
</div>

<h2> Blah Blah 2</h2>

<div>
    <div>
        <table>
            <tbody>
                <tr><th>XCol 1 Header</th><th>XCol 2 Header</th></tr>
                <tr><td>XLine 1.1 Value</td><td>XLine 2.1 Header</td></tr>
                <tr><td>XLine 2.1 Value</td><td>XLine 2.2 Value</td></tr>
            </tbody>
        </table>
    </div>
</div>
<p>blah blah</p>
<div>
    <div>
        <table>
            <tbody>
                <tr><th>XCol 1 Header T2</th><th>XCol 2 Header T2</th></tr>
                <tr><td>XLine 1.1 Value T2</td><td>XLine 2.1 Header T2</td></tr>
                <tr><td>XLine 2.1 Value T2</td><td>XLine 2.2 Value T2</td></tr>
                </tbody>
        </table>
    </div>
</div>

</body>
</html>

我想在包含给定文本的h2标记之后提取第二个DIV。

您可能会在第一和第二个div中注意到p标签不在同一位置。

要在第一个h2之后提取DIV,以下公式将起作用:

h2:contains(Blah 1) + p + div +div

但是要提取第二个,将“ Blah 1”替换为“ Blah 2”将不起作用,因为“” p“”标签位于其他位置,因此静态选择器将是:

h2:contains(Blah 2) + div + p +div

我需要的是一个选择器公式,无论p块位于何处,更改文本都可以使它起作用。>

我尝试了几种方法:例如...选择器nth-of-type

也不起作用,因为我知道DIV的位置仅与h2不是DIV的父级,而是前面的同级...] >

请帮助

我有以下HTML:

...

Blah Blah 1

blah blah

... []]]]]]] >

这里是一个简单的答案,但是它没有使用JSoup库。您可以尝试使用其他解析器,更新器等(仅在您愿意的情况下)

import Torello.HTML.*;
import Torello.HTML.NodeSearch.*;
import Torello.Java.FileRW;

import java.util.*;

public class SO_01_31
{
    private static final String tokenToFind = "Blah Blah 2";

    public static void main(String[] argv) throws java.io.IOException
    {
        String              html        = FileRW.loadFileToString("example.html");
            // Load the HTML provided in the HTML-Text above in your SO Question.

        Vector<HTMLNode>    page        = HTMLPage.getPageTokens(html, false);
            // Parse the HTML into an "HTML Vector"  Each node will be an HTML Tag, or a TextNode (no comment-nodes here)

        Vector<SubSection>  h2List      = TagNodePeekInclusive.all(page, "h2");
            // Return each <H2>...</H2> found on this page.  Return results have the nodes, and the vector-indexes, together,
            // in an instance of "SubSection"

        int                 foundPos    = -1;
            // We are looking for a sub-vector (SubSection) that looks like: <H2> 'tokenToFind' </H2>

        HTMLNode            n;
            // Temp Variable

            // Iterate through each of the "<H2>...</H2>" subsections that were returned, above, by TagNodeGetInclusive
        for (SubSection s : h2List) 
                // Make sure that the first node after the opening "<H2>" tag is, indeed, a TextNode
            if ((n = s.html.elementAt(1)) instanceof TextNode) 
                    // Make sure that this TextNode (the one between the <H2>...</H2>) has the string "tokenToFind"
                if (((TextNode) n).str.contains(tokenToFind))
                        // Record this position.
                    { foundPos = s.location.start + 1;  break; }

        // Exit if this appropriate-match was not found.
        if (foundPos == -1) { System.out.println("The specified H2 Title String-Token wasn't found on your page... Exiting.");  System.exit(0); }

        // Return the second opening <DIV> ... </DIV> subsection that was found.
        Vector<HTMLNode> divToFind = TagNodeGetInclusive.nth(page, 2, foundPos, -1, "div");

        // Print it out
        System.out.println(Util.pageToString(divToFind));
    }
}

以上带有文档的代码将以下输出输出到UNIX终端:

@cloudshell:~$ java SO_01_31 
<div>
        <table>
            <tbody>
                <tr><th>XCol 1 Header</th><th>XCol 2 Header</th></tr>
                <tr><td>XLine 1.1 Value</td><td>XLine 2.1 Header</td></tr>
                <tr><td>XLine 2.1 Value</td><td>XLine 2.2 Value</td></tr>
            </tbody>
        </table>
    </div>
css-selectors jsoup
1个回答
0
投票

这里是一个简单的答案,但是它没有使用JSoup库。您可以尝试使用其他解析器,更新器等(仅在您愿意的情况下)

© www.soinside.com 2019 - 2024. All rights reserved.