从Google搜索中收集链接文本和链接href

问题描述 投票:0回答:1

我尝试从谷歌搜索收集链接和链接文本(只有前10个),这是我的代码:

import requests
from lxml import html
import time
import re
headers={'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'}
sentence = "hello world"
url = 'https://google.com/search?q={}'.format(sentence)
res= requests.get(url, headers=headers)
tree= html.fromstring(res.text)
li = tree.xpath("//a[@href]")
y = [link for link in li if link.get('href').startswith(("https://", "http://")) if "google" not in link.get('href')][:10]
for i in y:
    print("{}:\t{}".format(i.text_content(), i.get('href')))

这是输出:

10
1:56hello world:    https://www.youtube.com/watch?v=Yw6u6YkTgQ4
4:23BUMP OF CHICKEN「Hello,world!」:  https://www.youtube.com/watch?v=rOU4YiuaxAM
5:24Lady Antebellum - Hello World:  https://www.youtube.com/watch?v=al2DFQEZl4M
"Hello, World!" program - Wikipediahttps://en.wikipedia.org/wiki/%22Hello,_World!%22_program:   https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
Hello World (disambiguation):   https://en.wikipedia.org/wiki/Hello_World_(disambiguation)
Sanity check:   https://en.wikipedia.org/wiki/Sanity_check
Just another Perl hacker:   https://en.wikipedia.org/wiki/Just_another_Perl_hacker
Hello, World! - Learn Python - Free Interactive Python Tutorialhttps://www.learnpython.org/en/Hello,_World!:    https://www.learnpython.org/en/Hello,_World!
Hello World Kids: HWKhelloworldkids.org/:   http://helloworldkids.org/
About Us:   http://helloworldkids.org/about-us/

列表是正确的,但是,有时我在获取重复链接时,我print,如何从输出中删除重复的链接

python web-scraping
1个回答
0
投票

您可以使用此代码,我已对您的代码进行了一些更改,它将起作用

import requests
from lxml import html
import time
import re
headers={'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'}
sentence = "hello world"
url = 'https://google.com/search?q={}'.format(sentence)
res= requests.get(url, headers=headers)
tree= html.fromstring(res.text)
li = tree.xpath("//a[@href]")
y = [link for link in li if link.get('href').startswith(("https://", "http://")) if 
"google" not in link.get('href')][:10]

links=[]
for i in y:
    #print("{}:\t{}".format(i.text_content(), i.get('href')))
    if (i.get('href')) not in links:
        links.append( i.get('href') )

for l in links:
   print(l)

列表“链接”将仅包含不同的链接

© www.soinside.com 2019 - 2024. All rights reserved.