使用R来废弃文本的html_nodes

问题描述 投票:0回答:1

其实我正在尝试获取此代码的sku号码(此号码 - > 111653240199):

<body>
 <div id= ‘a page’>
   <div class =”spaui-squishy-container” style=”display:table:table-row;”>
    <div class =”spaui-squishy-inner-container” style=”display:table-row;”>
     <div class =”spaui-squishy-content” style=display:table-cell;”>
      <div id=”myi-table-center” class=”a-container Madagascar-main-body”>
       <div id=”miytable” class=”mt-container clearfix””>
        <div class="mt-content clearfix">
         ::before
          <div class="mt-content clearfix">
           ::before
            <table class="a-bordered a-horizontal-stripes  mt-table">
             <tbody>
               <tr id="head-row" class="mt-head">
               <tr id="MTExNjUzMjQwMTk5" data-delayed-dependency-data="{&quot;MYIService&quot;(…)
                <td id= MTExNjUzMjQwMTk5-sku” data-colum=”sku” data-row=” MTExNjUzMjQwMTk5”>
                 <div class="mt-combination mt-layout-block">
                  <div id="MTExNjUzMjQwMTk5-sku-sku" data-column="sku" data-row="ExNjUzMjQwMTk5">
                   <div class="clamped wordbreak">
                    <div class="mt-text mt-wrap-bw"> 
                     <span class="mt-text-content mt-table-main">
                      111653240199
                     </span>

我在R中的脚本有这样的:

  dades<-read_html(url)

  id<-dades %>% html_nodes("#mt-table-container.clearfix .mt-link.mt-wrap-bw.clamped.wordbreak a") %>% html_text()

但结果是 - >字符为空

我究竟做错了什么?

在此先感谢您的帮助和时间:-)

html r web-scraping
1个回答
1
投票

以下一种方式:

library(rvest)
read_html(text) %>%
  html_nodes('div.mt-text') %>%
  html_text() %>%
  #the following removes whitespaces
  trimws()

  #[1] "111653240199"
© www.soinside.com 2019 - 2024. All rights reserved.