如何使用Python抓取XML?

问题描述 投票:0回答:3

我正在尝试使用 Python 解析以下 XML。我正在使用:

thumbnail_tag = dom.getElementsByTagName('media:thumbnail')[0].toxml()

这将选择第一个。我知道我可以将

[0]
更改为
[1]
来获取带有
yt:name="mqdefault"
的标签,但是还有其他方法可以更改上面语句中的参数(向
media:thumbnail
添加一些内容)吗?

<entry>
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" height="90" width="120" time="00:01:48.500" yt:name="default" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" height="180" width="320" yt:name="mqdefault" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" height="360" width="480" yt:name="hqdefault" />
</entry>

<entry>
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" height="90" width="120" time="00:01:48.500" yt:name="default" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" height="180" width="320" yt:name="mqdefault" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" height="360" width="480" yt:name="hqdefault" />
</entry>
python xml dom xml-parsing
3个回答
0
投票

要创建此 xml 字符串的 dom 对象,您必须在根标签或同一标签中定义 XML 命名空间

命名空间由元素开头的 xmlns 属性定义。

命名空间声明具有以下语法:

xmlns:prefix="URI"

例如:

<root>
    <h:table xmlns:h="http://bluejson.com/W3C/">
        <h:tr>
            <h:td>JSON</h:td>
            <h:td>JavaScript</h:td>
            <h:td>Python</h:td>
        </h:tr>
    </h:table>

    <f:table xmlns:f="http://bluejson.com/W3C/">
        <f:name>My Study Room</f:name>
        <f:width>800</f:width>
        <f:height>420</f:height>
        <f:length>1120</f:length>
    </f:table>
</root>

在上面的示例中,标记中的 xmlns 属性给出了限定名称空间的 h: 和 f: 前缀。

为元素定义命名空间后,所有具有相同前缀的子元素都与相同的命名空间关联。

命名空间可以在使用它们的元素或 XML 根元素中声明:

<root xmlns:h="http://bluejson.com/W3C/" xmlns:f="http://bluejson.com/W3C/">
    <h:table>
        <h:tr>
            <h:td>JSON</h:td>
            <h:td>JavaScript</h:td>
            <h:td>Python</h:td>
        </h:tr>
    </h:table>

    <f:table>
        <f:name>My Study Room</f:name>
        <f:width>800</f:width>
        <f:height>420</f:height>
        <f:length>1120</f:length>
    </f:table>
</root>

现在,创建 xml dom 对象并获取属性的 Python 代码

import xml.dom.minidom

dom = xml.dom.minidom.parseString("""
<root xmlns:media="http://media/" xmlns:yt="http://media/yt/">
    <media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" height="90" width="120" time="00:01:48.500" yt:name="default" />
    <media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" height="180" width="320" yt:name="mqdefault" />
    <media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" height="360" width="480" yt:name="hqdefault" />
</root>""")

media_thumbnail = dom.getElementsByTagNameNS("http://media/","thumbnail")
print media_thumbnail[0].getAttribute("height")
print media_thumbnail[0].getAttribute("width")
print media_thumbnail[0].getAttribute("time")
print media_thumbnail[0].getAttributeNS("http://media/yt/","name")
media_thumbnail[0].setAttribute("unit","px")
media_thumbnail[0].setAttributeNS("http://media/yt/","value","1")
print dom.toxml()

输出:

90
120
00:01:48.500
default
<?xml version="1.0" ?><root xmlns:media="http://media/" xmlns:yt="http://media/yt/">
    <media:thumbnail height="90" time="00:01:48.500" unit="px" url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" value="1" width="120" yt:name="default"/>
    <media:thumbnail height="180" url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" width="320" yt:name="mqdefault"/>
    <media:thumbnail height="360" url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" width="480" yt:name="hqdefault"/>
</root>

0
投票

这里是有关 python XML 解析器的文档。

对于您的实施,您可以使用:

for element in thumbnail_tag:
    attr = element.getAttribute('yt:name')

要更改属性的值:

for element in thumbnail_tag:
    attr = element.getAttribute('yt:name')
    if attr == 'mqdefault':
        element.setAttribute('yt:name', 'new_value')
        break

0
投票

我建议使用标准的

xml.etree.ElementTree
而不是 DOM。虽然 DOM 更传统,但它也更丑陋且更难使用。请参阅深入了解 Python 3,第 12 章。XML

标准模块支持 XPath 语言的子集,这可能对您的情况有用。

这是从

sample.xml
中提取所需元素的示例代码:

import xml.etree.ElementTree as et

tree = et.parse('sample.xml')

root = tree.getroot()     # the root element of the tree

##et.dump(root)             # here is how the input file looks inside

print '==========================================='
print 'Iterate through all media:thumbnail:'

# XPath expressions that describe the wanted elements. Here we have 3 ones;
# however, they are just strings and can be constructed on the fly.
xp_default = ".//{http://search.yahoo.com/mrss/}thumbnail[" \
                 "@{http://gdata.youtube.com/schemas/2007}name='default']"

xp_mqdefault = ".//{http://search.yahoo.com/mrss/}thumbnail[" \
                 "@{http://gdata.youtube.com/schemas/2007}name='mqdefault']"

xp_hqdefault = ".//{http://search.yahoo.com/mrss/}thumbnail[" \
                 "@{http://gdata.youtube.com/schemas/2007}name='hqdefault']"

for e in root.iterfind(xp_default):
    et.dump(e)
    print '-------------------------------------------'

for e in root.iterfind(xp_mqdefault):
    et.dump(e)
    print '-------------------------------------------'

for e in root.iterfind(xp_hqdefault):
    et.dump(e)
    print '-------------------------------------------'
    print 'The e.attrib is a dictionary of attributes:'
    print e.attrib

它打印以下内容...:

c:\tmp\___python\sharataka\so12776774>py a.py
===========================================
Iterate through all media:thumbnail:
<ns0:thumbnail xmlns:ns0="http://search.yahoo.com/mrss/" xmlns:ns1="http://gdata
.youtube.com/schemas/2007" height="90" time="00:01:41" url="http://img.youtube.c
om/vi/jXE6G9CYcJs/default.jpg" width="120" ns1:name="default" />

-------------------------------------------
<ns0:thumbnail xmlns:ns0="http://search.yahoo.com/mrss/" xmlns:ns1="http://gdata
.youtube.com/schemas/2007" height="180" url="http://img.youtube.com/vi/jXE6G9CYc
Js/mqdefault.jpg" width="320" ns1:name="mqdefault" />

-------------------------------------------
<ns0:thumbnail xmlns:ns0="http://search.yahoo.com/mrss/" xmlns:ns1="http://gdata
.youtube.com/schemas/2007" height="360" url="http://img.youtube.com/vi/jXE6G9CYc
Js/hqdefault.jpg" width="480" ns1:name="hqdefault" />

-------------------------------------------
The e.attrib is a dictionary of attributes:
{'url': 'http://img.youtube.com/vi/jXE6G9CYcJs/hqdefault.jpg', 'width': '480', '
height': '360', '{http://gdata.youtube.com/schemas/2007}name': 'hqdefault'}

...对于

sample.xml
(在某处找到,缩短)及其内容:

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns='http://www.w3.org/2005/Atom'
  xmlns:media='http://search.yahoo.com/mrss/'
  xmlns:yt='http://gdata.youtube.com/schemas/2007'>
  <entry>
    <media:group>
      <media:title type='plain'>Learning the ABCs</media:title>
      <media:description type='plain'>
        A great method for teaching kids the alphabet.
      </media:description>
      <media:keywords>alphabet, teaching, children</media:keywords>
      <yt:duration seconds='202'/>
      <yt:videoid>jXE6G9CYcJs</yt:videoid>
      <media:credit role='uploader' scheme='urn:youtube'
          yt:display='GoogleDeveloperssFriend'>GoogleDeveloperssFriend</media:credit>
      <media:category label='Education'
        scheme='http://gdata.youtube.com/schemas/2007/categories.cat'>
        Education</media:category>
      <media:content url='http://www.youtube.com/v/jXE6G9CYcJs'
        type='application/x-shockwave-flash' medium='video' isDefault='true'
        expression='full' duration='202' yt:format='5'/>
      <media:content
        url='rtsp://rtsp2.youtube.com/ChoLENySANFEgGDA==/0/0/0/video.3gp'
        type='video/3gpp' medium='video' expression='full'
        duration='202' yt:format='1'/>
      <media:content
        url='rtsp://rtsp2.youtube.com/ChoLENySARFEgGDA==/0/0/0/video.3gp'
        type='video/3gpp' medium='video' expression='full'
        duration='202' yt:format='6'/>
      <media:player url='https://www.youtube.com/watch?v=jXE6G9CYcJs'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/default.jpg'
        height='90' width='120' time='00:01:41' yt:name='default'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/hqdefault.jpg'
        height='360' width='480' yt:name='hqdefault'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/mqdefault.jpg'
        height='180' width='320' yt:name='mqdefault'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/1.jpg'
        height='90' width='120' time='00:00:50.500' yt:name='start'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/2.jpg'
        height='90' width='120' time='00:01:41' yt:name='end'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/3.jpg'
        height='90' width='120' time='00:02:31.500' yt:name='middle'/>
    </media:group>
    <yt:statistics viewCount='286355' favoriteCount='201'/>
  </entry>
</feed>
© www.soinside.com 2019 - 2024. All rights reserved.