如何从 RSS feed 的“content:encoded”部分中提取 HTML 元素？

Question

我正在尝试生成一份时事通讯，其中除其他内容外，还包括网站上存在的新闻条目。该网站是用 WordPress 构建的，并有一个 RSS 提要，虽然未经常使用，但现在可以方便地解析新闻条目。

我正在使用 xmlstarlet 在 Bash 中编写一个简单的生成器脚本。特别是，我能够获取新闻条目的标题、描述和 URL（我使用

$itemnum

作为索引来迭代它们）：

TITLE=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "title" feed.xml);
DESC=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "description" feed.xml);
URL=$(xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "link" feed.xml);

但现在我还想获取缩略图的 URL 和新闻条目的日期。这些基本上是两个不同的问题，所以我只询问缩略图 URL（关于日期：很容易从

<pubDate>...</pubDate>

获得，但它没有本地化）。 URL 位于

<content:encoded>...</content:encoded>

标签中，其中包含许多不同的 HTML 标签。

我知道 xmlstarlet 有一个 HTML 选项，但不知道当 HTML 嵌入到 XML 元素中时如何使用它。如果我尝试解析

的输出

xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "content:encoded" feed.xml | xmlstarlet sel -t -c "//img[@class='size-medium wp-image-2821 alignright'][1]"

它给出错误：

-:1.1: Start tag expected, '<' not found
&lt;div&gt;
^

原因可能是当获得

xmlstarlet sel -t -m "/rss/channel/item[$itemnum]" -v "content:encoded" feed.xml

它将所有标签括号

和

翻译为

&lt;

和

&gt;

，我不知道如何解决它。

编辑：

新闻条目如下所示：

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
    xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
    
    xmlns:georss="http://www.georss.org/georss"
    xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
    >

<channel>
    <title>This is the title</title>
    <atom:link href="https://link.to/feed" rel="self" type="application/rss+xml" />
    <link>https://website.url</link>
    <description>This is the description</description>
    <lastBuildDate>Wed, 20 Dec 2023 04:49:30 +0000</lastBuildDate>
    <language>de-DE</language>
    <sy:updatePeriod>
    hourly  </sy:updatePeriod>
    <sy:updateFrequency>
    1   </sy:updateFrequency>
    <generator>https://wordpress.org/?v=6.3.2</generator>
<site xmlns="com-wordpress:feed-additions:1">124249965</site>   

<item>
        <title>A title</title>
        <link>https://link.to/the-news-entry</link>
        
        <dc:creator><![CDATA[HP-Admin]]></dc:creator>
        <pubDate>Wed, 20 Dec 2023 04:49:30 +0000</pubDate>
                <category><![CDATA[Uncategorized]]></category>
        <guid isPermaLink="false">https://perma.link/p123</guid>

                    <description><![CDATA[a short description]]></description>
                                        <content:encoded><![CDATA[<p>A paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p><img decoding="async" fetchpriority="high" class="size-medium wp-image-2821 alignright" src="https://link.to/first-image.jpg" alt="" width="200" height="300" srcset="https://link.to/first-image.jpg 200w, https://link.to/first-image.jpg 683w, https://link.to/first-image.jpg 768w, https://link.to/first-image.jpg 1024w, https://link.to/first-image.jpg 1365w, https://link.to/first-image.jpg 367w, https://link.to/first-image.jpg 16w, https://link.to/first-image.jpg 24w, https://link.to/first-image.jpg 32w, https://link.to/first-image.jpg 1707w" sizes="(max-width: 200px) 100vw, 200px" /></p>
<p><img decoding="async" class="size-medium wp-image-2820 alignright" src="https://link.to/second-image.jpg" alt="" width="200" height="300" srcset="https://link.to/second-image.jpg 200w, https://link.to/second-image.jpg 683w, https://link.to/second-image.jpg 768w, https://link.to/second-image.jpg 1024w, https://link.to/second-image.jpg 1365w, https://link.to/second-image.jpg 367w, https://link.to/second-image.jpg 16w, https://link.to/second-image.jpg 24w, https://link.to/second-image.jpg 32w, https://link.to/second-image.jpg 1707w" sizes="(max-width: 200px) 100vw, 200px" /></p>
<p>Another paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p>Another paragraph is written here.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
                    
        
        
        <post-id xmlns="com-wordpress:feed-additions:1">2818</post-id>  </item>
    </channel>
</rss>

现在我注意到这两个图像实际上都不是我需要的标题图像...标题图像的 URL 根本没有出现在 feed xml 中...我真的很困惑为什么会发生这种情况。

Answer 1

提取简单变量，例如：

# shellcheck shell=sh disable=SC2016

xmlstarlet select -T -t \
  --var idx -o "${itemnum:=1}" -b \
  --var q1 -o "'" -b \
  -m '/rss/channel/item[$idx]' \
    -v 'concat("title=",$q1,str:replace(title,$q1,concat($q1,"\",$q1,$q1)),$q1)' -n \
    -v 'concat("desc=",$q1,str:replace(description,$q1,concat($q1,"\",$q1,$q1)),$q1)' -n \
    -v 'concat("url=",$q1,link,$q1)' -n \
    -v 'concat("pubdate=",$q1,pubDate,$q1)' -n \
feed.xml

哪里

```
xmlstarlet select
```
的
```
 -T
```
（又名
```
--text
```
）选项用于纯文本输出
```
--var
```
定义一个命名变量，参见
```
xmlstarlet.txt
```
举例
```
itemnum
```
shell 变量使用以下方式传入 shell 参数扩展
```
title
```
和
```
description
```
中任何嵌入的单引号字符使用以下方式引用元素（请参阅输出中的示例） EXSLT str：替换函数

输出：

title='A '\''modified'\'' title'
desc='a short description'
url='https://link.to/the-news-entry'
pubdate='Wed, 20 Dec 2023 04:49:30 +0000'

使用 GNU

date

进行本地化，例如

date -Isec -d "${pubdate}"

。

从嵌入的 HTML 中提取图像 URL，例如：

# shellcheck shell=sh disable=SC2016

xmlstarlet select -T -t \
  --var idx -o "${itemnum:=1}" -b \
  -v '/rss/channel/item[$idx]/content:encoded' \
feed.xml |
xmlstarlet format -R -H -D |
# tee /dev/stderr |
xmlstarlet select -T -t \
  --var cls -o "${class:=wp-image-2821}" -b \
  --var q1 -o "'" -b \
   -m 'str:split(//img[contains(@class,$cls)]/@srcset,",")' \
     --var url_sz='str:split(.," ")' \
     -v 'concat("url_",$url_sz[2],"=",$q1,$url_sz[1],"?width=",substring-before($url_sz[2],"w"),$q1)' -n

使用
```
xmlstarlet select
```
的
```
 -T
```
（又名
```
--text
```
）选项进行纯文本输出
shell 变量
```
itemnum
```
和
```
class
```
使用以下方式传入 shell 参数扩展
使用
```
xmlstarlet format
```
```
--recover --html --drop-dtd
```
进行转换 HTML 4.0 到 XML，请注意，诸如
```
&nbsp;
```
之类的 HTML 实体是已转换（取消注释
```
tee
```
行即可查看）
使用
```
xmlstarlet select
```
和 XPath
```
contains
```
函数提取适当的
```
img/@srcset
```
文本，
```
str:split
```
它，先用逗号，然后用空格，和
```
concat
```
```
substring
```
是一个有用的格式

输出：

url_200w='https://link.to/first-image.jpg?width=200'
url_683w='https://link.to/first-image.jpg?width=683'
url_768w='https://link.to/first-image.jpg?width=768'
url_1024w='https://link.to/first-image.jpg?width=1024'
url_1365w='https://link.to/first-image.jpg?width=1365'
url_367w='https://link.to/first-image.jpg?width=367'
url_16w='https://link.to/first-image.jpg?width=16'
url_24w='https://link.to/first-image.jpg?width=24'
url_32w='https://link.to/first-image.jpg?width=32'
url_1707w='https://link.to/first-image.jpg?width=1707'

如何从 RSS feed 的“content:encoded”部分中提取 HTML 元素？

问题描述投票：0回答：1

1个回答

最新问题

如何从 RSS feed 的“content:encoded”部分中提取 HTML 元素？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1