在 Shell 脚本中使用 CURL 解析 HTML

Question

Answer 1

使用 xmllint：

a='<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>'

xmllint --html --xpath 'concat(//div[@class="tracklistInfo"]/p[1]/text(), "#", //div[@class="tracklistInfo"]/p[2]/text())' <<<"$a"

您将获得：

Diplo - Justin Bieber - Skrillex#Where Are U Now

可以轻松分离。

Answer 2

您的标题以“Parse HTML with CURL”开头，但

curl

不是 html 解析器。如果您想使用命令行工具，请使用xidel。

xidel -s "<url>" -e '//div[@class="tracklistInfo"]/p'
Diplo - Justin Bieber - Skrillex
Where Are U Now

xidel -s "<url>" -e '//div[@class="tracklistInfo"]/join(p," | ")'
Diplo - Justin Bieber - Skrillex | Where Are U Now

Answer 3

不要。使用 HTML 解析器。例如，Python 的 BeautifulSoup 很容易使用，并且可以很容易地做到这一点。

话虽这么说，请记住

grep

适用于 lines。该模式匹配每条行，而不是整个字符串。

您可以使用

-A

来打印比赛后的行：

grep -A2 -E -m 1 '<div class="tracklistInfo">'

应输出：

<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>

然后，您可以通过管道将其传递到

tail

来获取最后一行或倒数第二行：

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1
<p>Where Are U Now</p>

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1
<p class="artist">Diplo - Justin Bieber - Skrillex</p>

并使用

sed

去除 HTML：

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' | tail -n1
Where Are U Now

$ grep -A2 -E -m 1 '<div class="tracklistInfo">' |  tail -n2 | head -n1 | sed 's/<[^>]*>//g'
Diplo - Justin Bieber - Skrillex

但如前所述，这是善变的，可能会损坏，而且不太漂亮。顺便说一句，这与 BeautifulSoup 相同：

html = '''<body>
<p>Blah text</p>
<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div>
</body>'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for track in soup.find_all(class_='tracklistInfo'):
    print(track.find_all('p')[0].text)
    print(track.find_all('p')[1].text)

这也适用于多行

tracklistInfo

- 将其添加到 shell 命令需要更多工作;-)

Answer 4

cat - > file.html << EOF
<div class="tracklistInfo">
<p class="artist">Diplo - Justin Bieber - Skrillex</p>
<p>Where Are U Now</p>
</div><div class="tracklistInfo">
<p class="artist">toto</p>
<p>tata</p>
</div>
EOF


cat file.html | tr -d '\n'  | sed -e "s/<\/div>/<\/div>\n/g" | sed -n 's/^.*class="artist">\([^<]*\)<\/p> *<p>\([^<]*\)<.*$/artist : \1\ntitle : \2\n/p'

Answer 5

因为这会出现在搜索中，所以这里还有一些用于从 HTML 中提取数据的 CLI 工具：

xidel：使用 CSS 选择器、XPath/XQuery 3.0 从 HTML/XML 页面下载和提取数据，以及查询 JSON
htmlq：类似于 jq，但用于 HTML。
pup：使用 CSS 选择器处理 HTML 的命令行工具
tq：通过 CSS 选择器对 HTML 输入执行查找
html-xml-utils：hxextract（提取所选元素）和hxselect（提取与（CSS）选择器匹配的元素）
hq：使用 CSS 和 XPath 选择器的轻量级命令行 HTML 处理器
cascadia：CSS 选择器 CLI 工具
xpe：易于使用的命令行xpath工具
hred：html reduce …从标准输入读取 HTML 并输出 JSON
parsel：根据CSS选择器选择HTML文档的部分

这是 github 上这些项目的受欢迎程度图表：

在 Shell 脚本中使用 CURL 解析 HTML

问题描述投票：0回答：5

5个回答

最新问题

在 Shell 脚本中使用 CURL 解析 HTML

问题描述 投票：0回答：5

5个回答

最新问题

问题描述投票：0回答：5