如何在Colly中递归爬行？

Question

我正在尝试在Colly中实现一个爬虫，它可以从新闻源中爬取几篇文章。我给出了两个起始 URL。我想要的行为是，通过分页和外链，Colly 会继续抓取网页（可能无限长）。

但是，我观察到的行为是 Colly 在一级后停止其访问过程。 Colly 的分页不会超过一页，这一事实表明了这一点。我确信链接、XPATH 等是准确的。我还在初始化收集器对象时将最大深度指定为 10，但它似乎没有按预期工作。

这是我的代码：

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {
    // Instantiate default collector

    c := colly.NewCollector(
        // Using IndonesiaX as sample
        colly.AllowedDomains("www.zerohedge.com"),

        // Cache responses to prevent multiple download of pages
        // even if the collector is restarted
        colly.CacheDir("./cache"),
        colly.MaxDepth(10),
        colly.Async(true),
    )

    // On every a element which has href attribute call callback
    c.OnXML("//h2[contains(@class,'Article_title')]//a[@href]", func(e *colly.XMLElement) {
        link := e.Attr("href")
        // fmt.Println(link)
        // start scraping the page under the link found
        e.Request.Visit(link)
    })

    c.OnXML("//div[contains(@class, 'SimplePaginator')]//a[@href]", func(e *colly.XMLElement) {
        link := e.Attr("href")
        // fmt.Println(link)
        // start scraping the page under the link found
        e.Request.Visit(link)
    })

    c.OnXML("//header[contains(@class, 'ArticleFull')]//h1/text()", func(e *colly.XMLElement) {
        // fmt.Println(e.Text)
        // start scraping the page under the link found
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
        r.Visit(r.URL.String())
    })

    c.Visit("https://www.zerohedge.com/covid-19")
    c.Visit("https://www.zerohedge.com/medical")

    c.Wait()
}

如有任何帮助，我们将不胜感激。

Answer 1

经过一番调查。我发现元素 SimplePaginator 并不总是存在。该网站正在采取一些技巧来阻止爬虫。

如何在Colly中递归爬行？

问题描述投票：0回答：1

1个回答

最新问题

如何在Colly中递归爬行？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1