无法从特定页面中删除main_container

问题描述 投票:1回答:2

所以我试图从这个url刮掉。你可以检查它有很多细节,这些细节在一个div下,类为main_container。但每当我试图刮掉它时,它就不会在汤中加入这一部分。

<div class="main_container o-hidden" id="tfullview">

所以我研究并了解可能有两种方法:

  1. 页面是从客户端加载的,因为它可能是脚本加载,所以我使用PyQt4从这个网站上刮掉。代码是最后的

因此,此代码显示无表示未找到标记。

  1. 我也尝试了selenium方法,它基本上先加载页面然后从中抓取数据。这也显示无回应。我没有准备好那个代码。

这个div也有一个o-hidden属性,是否会停止加载?这是div:

pyqt的代码:

    import sys
    from PyQt4.QtGui import QApplication
    from PyQt4.QtCore import QUrl
    from PyQt4.QtWebKit import QWebPage
    import bs4 as bs
    import requests

class Client(QWebPage):

    def __init__(self,url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self.on_page_load)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def on_page_load(self):
        self.app.quit()

url = 'https://eprocure.gov.in/cppp/tendersfullview/MjMyODQwA13h1OGQ2NzAxYTMwZTJhNTIxMGNiNmEwM2EzNmNhYWZhODk=A13h1OGQ2NzAxYTMwZTJhNTIxMGNiNmEwM2EzNmNhYWZhODk=A13h1MTU1MzU4MDQwNQ==A13h1NzIxMTUvODUwOCA4NTA5LzE4L0NPVy9PV0M=A13h1MjAxOV9JSFFfNDU4NjEzXzE='
client_response = Client(url)
source = client_response.mainFrame().toHtml()
soup = bs.BeautifulSoup(source,'lxml')
test = soup.find("div",class_="main_container")
print(test)
python html web-scraping
2个回答
1
投票

所以,刺激与requests重写。 Session需要允许以后使用列表中的链接。您可以轻松地适应allLinks中所有网址的循环。我展示了第一个。

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://eprocure.gov.in/cppp/latestactivetendersnew/cpppdata?page=1'

with requests.Session() as s:

    r = s.get(url)
    soup = bs(r.content, 'lxml')

    ## all table links to individual tenders
    titles, allLinks = zip(*[(item.text, item['href']) for item in soup.select('td:nth-of-type(5) a')])

    r = s.get(allLinks[0]) #choose first link from table
    soup = bs(r.content, 'lxml')
    # container = soup.select_one('#tender_full_view')
    tables = pd.read_html(r.content)

    for table in tables:
        print(table.fillna(''))

如果selenium是一个选项,您可以执行以下操作以从第1页登陆收集所有投标链接。然后,您可以索引到任何单个投标的网址列表。我收集链接标题以防你想要搜索,然后使用索引。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

d = webdriver.Chrome()
url = 'https://eprocure.gov.in/cppp/latestactivetendersnew/cpppdata?page=1'

d.get(url)
## all table links to individual tenders
titles, allLinks = zip(*[(item.text, item.get_attribute('href')) for item in WebDriverWait(d,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'td:nth-of-type(5) a')))])

d.get(allLinks[0]) #choose first link from table

container = WebDriverWait(d,5).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#tender_full_view')))
html = container.get_attribute('innerHTML')
tables = pd.read_html(html)

for table in tables:
    print(table.fillna(''))

0
投票

我使用requestslxml为您编写了一个快速工作示例,不需要selenium

import requests
import lxml.html


headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
}

_session = requests.Session()
_session.headers.update(headers)

latest_tender_url = "https://eprocure.gov.in/cppp/latestactivetendersnew/cpppdata?page=1"
resp = _session.get(latest_tender_url)
xml = lxml.html.fromstring(resp.content)
tender_urls = xml.xpath('//a[contains(@href, "tendersfullview")]//@href')

for url in tender_urls:
    t_resp = _session.get(url)
    t_xml = lxml.html.fromstring(t_resp.content)
    details = t_xml.xpath('//td[@id="tenderDetailDivTd"]')
    [print(elm.text_content()) for elm in details]
© www.soinside.com 2019 - 2024. All rights reserved.