如何解决 XPath 在网页上找不到元素的问题?

问题描述 投票:0回答:1

我正在使用 Bash 脚本


# Define variables for the URL and browser
sGDomain="idealista"
sGCitta="fucecchio-firenze"
sGTypo="vendita-case"
iGPagina=1

# Start of the loop
while :; do

    # Build the URL with the iGPagina variable
    url="https://www.$sGDomain.it/$sGTypo/$sGCitta/lista-$iGPagina.htm"
    #echo "$url"
    
    # Get the HTML content of the page
    html_content=$(curl -s -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0" "$url")

    echo "$html_content" > htmlcompleto.txt
    
    # Check if the error string is not present in the HTML content
    if [[ ! $html_content =~ "Successiva" ]]; then
        break  # Exit the loop if the error string is not present
    fi
    
    # Use xidel to extract the ads
    xidel_output=$(xidel --silent --xpath '
        //div[contains(@class, "item-info-container")] ! string-join(
            (
                ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ),
                ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
                ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
                ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
            ),
            codepoints-to-string(9)
        )
    ' -)

    # Check if the temporary file exists and delete it if present
    if [ -f "temp.txt" ]; then
        rm temp.txt
    fi

    # Replace special characters from "desc=" to the end of each line in semi.txt
    echo "$xidel_output" | sed -e "s/desc=\(.*\)\(['\"]\)/desc=\1 /g" > semi.txt

    sed -i 's/\([0-9]\{1,\}\)\.\([0-9]\{1,\}\),[0-9]\{2\}/\1\2/g' semi.txt
    sed -i 's/m²//g' semi.txt

    # Concatenate semi.txt with debugtxt.txt for debugging purposes
    cat semi.txt >> debugtxt.txt

    # Connect to the SQLite database
    db_file="immo.db"
 
    # Loop through the lines and insert them into the SQLite database
    while IFS= read -r line; do
        # Extract price, size, link, and description values from the lines using awk
        prezzo=$(echo "$line" | awk -F 'price=' '{print $2}' | awk -F 'size=' '{print $1}')
        size=$(echo "$line" | awk -F 'size=' '{print $2}' | awk -F 'link=' '{print $1}')
        link=$(echo "$line" | awk -F 'link=' '{print $2}' | awk -F 'desc=' '{print $1}')
        descrizione=$(echo "$line" | awk -F 'desc=' '{print $2}')

        # Determine if the description contains "asta"
        if [[ $descrizione =~ "asta" ]]; then
            asta=1
        else
            asta=0
        fi

        # Insert the data into the SQLite database
        sqlite3 "$db_file" "INSERT INTO $sGDomain (prezzo, link, descrizione, metratura, asta) VALUES ('$prezzo', '$link', '$descrizione', '$size', $asta)"
    done < semi.txt

    # Increment the iGPagina variable for the next iteration**your text**
    ((iGPagina++))
done

使用特定的 XPath 表达式搜索网页。尽管我相信 XPath 是正确的,但脚本无法在页面上找到任何内容。 网页网址: https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm使用的XPath表达式:

 xidel_output=$(xidel --silent --xpath ' //div[contains(@class, "item-info-container")] ! string-join( ( ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ), ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ), ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ), ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) ) ), codepoints-to-string(9) ) ' -)
**

预期结果:我希望从网页上的每个列表中提取价格、列表链接、描述和平方米。 我也尝试过这个 xpath 表达式

        //div[contains(@class, "items-container items-list")] ! string-join(
            (
                ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ),
                ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
                ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
                ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
            ),
            codepoints-to-string(9)
        )
    ' -)```
with items-container items-list, but nothing 
bash xpath
1个回答
0
投票

HTML 的相关部分类似于:

...
<div class="item-info-container ">
  <div class="price-row">
    <span class="item-price ">650.000<span>€</span></span>
    <span class="item-parking">Garage/posto auto compreso</span>
  </div>
  <div class="item-detail-char">
    <span class="item-detail">12 locali</span>
    <span class="item-detail">400 m2</span>
  </div>
  <div class="item-description "><p class="ellipsis">
FUCECCHIO ... aria
</p></div>
  ...
</div>
...

yu 可以将其转换为 yout TSV 格式:

xidel --xpath '
    //div[contains-token(@class, "item-info-container")] ! string-join(
        (
          "price=" || normalize-space(.//span[contains-token(@class, "item-price")]/text()[1]),
          "size="  || normalize-space(.//span[contains-token(@class, "item-detail") and contains(.,"m2")]),
          "link="  || normalize-space(.//a[contains-token(@class, "item-link")]/@href),
          "desc="  || normalize-space(.//div[contains-token(@class, "item-description")])
        ),
        codepoints-to-string(9)
    )
' htmlcompleto.txt

注意: 当您需要使用 XPath 3.1 测试 CSS 类时,您应该使用

contains-token
而不是
contains

输出:

price=650.000    size=400 m2  link=/immobile/2...0/  desc=FUCECCHIO ... aria
price=150.000    size=85 m2   link=/immobile/2...9/  desc=Rif1263 ... condizion
price=75.000     size=60 m2   link=/immobile/1...4/  desc=FUCECCHIO ... 1406.
price=200.000    size=270 m2  link=/immobile/2...1/  desc=RIFCS83 ... accessori
price=60.000     size=120 m2  link=/immobile/2...7/  desc=
price=130.000    size=108 m2  link=/immobile/2...9/  desc=Rif1121 ... condizioni.
...

备注: 您会得到一些空描述,因此您可能更愿意提取存储在

.//a[contains-token(@class, "item-link")]/@title

中的简短描述
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.