[类似Python的jquery HTML解析？

Question

是否有任何Python库可以让我解析类似于jQuery的HTML文档？

即我希望能够使用CSS选择器语法从文档中获取任意节点集，读取其内容/属性，等等。

我之前使用过的唯一的Python HTML解析库是BeautifulSoup，尽管很好，但我一直认为如果我有jQuery语法，解析起来会更快。：D

Answer 1

用法：

>>> from BeautifulSoup import BeautifulSoup as Soup >>> from soupselect import select >>> import urllib >>> soup = Soup(urllib.urlopen('http://slashdot.org/')) >>> select(soup, 'div.title h3') [<h3><span><a href='//science.slashdot.org/'>Science</a>:</span></h3>, <h3><a href='//slashdot.org/articles/07/02/28/0120220.shtml'>Star Trek</h3>, ..]

Answer 2

http://packages.python.org/pyquery/

>>> from pyquery import PyQuery as pq >>> from lxml import etree >>> import urllib >>> d = pq("<html></html>") >>> d = pq(etree.fromstring("<html></html>")) >>> d = pq(url='http://google.com/') >>> d = pq(url='http://google.com/', opener=lambda url: urllib.urlopen(url).read()) >>> d = pq(filename=path_to_html_file) >>> d("#hello") [<p#hello.hello>] >>> p = d("#hello") >>> p.html() 'Hello world !' >>> p.html("you know <a href='http://python.org/'>Python</a> rocks") [<p#hello.hello>] >>> p.html() u'you know <a href="http://python.org/">Python</a> rocks' >>> p.text() 'you know Python rocks'

Answer 3

css selectors

import requests
from bs4 import BeautifulSoup as Soup
html = requests.get('https://stackoverflow.com/questions/3051295').content
soup = Soup(html)

[[[this
问题的标题

soup.select('h1.grid--cell :first-child')[0].text问题投票数

# first item soup.select_one('[itemprop="upvoteCount"]').text
使用Python Requests获取html页面

[类似Python的jquery HTML解析？

问题描述投票：64回答：4

4个回答

`import requests from bs4 import BeautifulSoup as Soup html = requests.get('https://stackoverflow.com/questions/3051295').content soup = Soup(html)`[[[this
问题的标题

最新问题

[类似Python的jquery HTML解析？

问题描述 投票：64回答：4

4个回答

import requests from bs4 import BeautifulSoup as Soup html = requests.get('https://stackoverflow.com/questions/3051295').content soup = Soup(html) [[[this问题的标题

最新问题

问题描述投票：64回答：4

`import requests from bs4 import BeautifulSoup as Soup html = requests.get('https://stackoverflow.com/questions/3051295').content soup = Soup(html)`[[[this
问题的标题