R-高级网页抓取-使用 xmlTreeParse() 绕过 aspNetHidden

Question

这个问题需要一点时间来介绍，请耐心等待。如果你能到达那里，解决这个问题将会很有趣。该抓取将使用循环复制到该网站上的数千个页面。

我正在尝试抓取网站 http://www.digikey.com/product-detail/en/207314-1/A25077-ND/，希望捕获表中带有 Digi-Key 部件号的数据，可用数量等，包括右侧的价格细分、单价、扩展价格。

使用 R 函数 readHTMLTable() 不起作用，仅返回 NULL 值。其原因（我认为）是因为该网站在 html 代码中使用标签“aspNetHidden”隐藏了其内容。

因此，我还发现使用 htmlTreeParse() 和 xmlTreeParse() 时遇到困难，整个部分的父部分没有出现在结果中。

使用 scrapeR 包中的 R 函数 scrape()

require(scrapeR)

URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")

确实返回完整的 html 代码，包括感兴趣的行：

<th align="right">Digi-Key Part Number</th>
<td id="reportpartnumber">
<meta itemprop="productID" content="sku:A25077-ND">A25077-ND</td>

<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>

但是，我无法从该代码块中选择节点，并返回错误：

no applicable method for 'xpathApply' applied to an object of class "list"

我使用不同的函数收到了该错误，例如：

xpathSApply(URL,'//*[@id="pricing"]/tbody/tr[2]')

getNodeSet(URL,"//html[@class='rd-product-details-page']")

我不是最熟悉 xpath，但一直在使用网页上的检查元素来识别 xpath 并复制 xpath。

您能提供的任何帮助将不胜感激！

Answer 1

您还没有阅读 scrape 的帮助吗？它返回一个列表，您需要获取该列表的一部分（如果 parse=TRUE）等等。

我还认为该网页正在进行一些繁重的浏览器检测。如果我尝试从命令行访问

wget

页面，我会得到一个错误页面，

scrape

函数会得到一些可用的东西（但对你来说似乎不同），而 Chrome 会得到包含所有编码内容的完整垃圾。恶心。这对我有用：

> URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")
> tables = xpathSApply(URL[[1]],'//table')
> tables[[2]]
<table class="product-details" border="1" cellspacing="1" cellpadding="2">
  <tr class="product-details-top"/>
  <tr class="product-details-bottom">
    <td class="pricing-description" colspan="3" align="right">All prices are in US dollars.</td>
  </tr>
  <tr>
    <th align="right">Digi-Key Part Number</th>
    <td id="reportpartnumber"><meta itemprop="productID" content="sku:A25077-ND"/>A25077-ND</td>
    <td class="catalog-pricing" rowspan="6" align="center" valign="top">
      <table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
        <tr>
          <th>Price Break</th>
          <th>Unit Price</th>
          <th>Extended Price&#13;
</th>
        </tr>
        <tr>
          <td align="center">1</td>
          <td align="right">2.75000</td>
          <td align="right">2.75</td>

根据您的用例进行调整，在这里我获取所有表格并显示第二个表格，其中包含您想要的信息，其中一些在

pricing

表中，您可以直接使用：

pricing = xpathSApply(URL[[1]],'//table[@id="pricing"]')[[1]]

> pricing
<table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
  <tr>
    <th>Price Break</th>
    <th>Unit Price</th>
    <th>Extended Price&#13;
</th>
  </tr>
  <tr>
    <td align="center">1</td>
    <td align="right">2.75000</td>
    <td align="right">2.75</td>
  </tr>

等等。

R-高级网页抓取-使用 xmlTreeParse() 绕过 aspNetHidden

问题描述投票：0回答：1

1个回答

最新问题

R-高级网页抓取-使用 xmlTreeParse() 绕过 aspNetHidden

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1