使用ElementTree解析XML时如何拾取子节点的文本值

问题描述 投票:0回答:2

我有一个包含一系列产品的XML购物Feed,请参见下文。如果我用漂亮的汤解析它,以创建一个熊猫数据框,我将使用类似以下内容:

def parse_shopping_feed(feed_xml):
    #response = requests.get(feed_url)
    soup = BeautifulSoup(feed_xml, "xml")
    all_products = []
    for item in soup.find_all("item"):
        new_product = {
            "id": item.id.string,
            "title": item.title.string,
            "description": item.description.string,
            "google_product_category": item.google_product_category.string,
            "product_type": item.product_type.string if  "product_type" in item else "",
            "link": item.link.string,
            "availability": item.availability.string,
            "price": item.price.string,
            "brand": item.brand.string
        }
        all_products.append(new_product)
    feed_df = pd.DataFrame(all_products)
    return feed_df

现在,对于这些提要中的一种提要,Beautiful Soup太慢了(大约300mbs,因此已经开始查看应该更快的ElementTree。但是我无法终生想出要使用ET重新创建此代码。

如何循环浏览所有项目标签并例如获取其ID和标题?

我目前的最佳猜测是这样的,但是我不知道如何选择每个ID和标题。

xml_file = open('shopping_feed.xml')
for event, element in ET.iterparse(xml_file, events=None):
    for child in element:
        print(child)
    element.clear()

有什么建议吗?明确地说,我的最终目标是数据帧,因此,如果有直接将其转换的方法,那就太好了!

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
    <title>Feed XYZ</title>
    <description></description>
    <link></link>
    <item>
        <g:id>10000005</g:id>
        <title><![CDATA[TEst Item XYZ                                           ]]></title>
        <g:google_product_category>Food and stuff</g:google_product_category>
        <g:product_type><![CDATA[Details &gt; Food and stuff]]></g:product_type>
        <g:adwords_grouping><![CDATA[Food and stuff]]></g:adwords_grouping>
        <link>https://www.abc.se/abc/abc</link>
        <g:image_link>https://www.abc.se/bilder/artiklar/10000005.jpg</g:image_link>
        <g:additional_image_link>https://www.abc.se/bilder/artiklar/zoom/10000005_1.jpg</g:additional_image_link>
        <g:condition>new</g:condition>
        <g:availability>out of stock</g:availability>
        <g:price>155 SEK</g:price>
        <g:buyprice>68.00</g:buyprice>
        <g:brand>ABC</g:brand>
        <g:gtin>8003299920846</g:gtin>
        <g:mpn>ABC01 AZ</g:mpn> 
        <g:weight>0 g</g:weight> 
        <g:item_group_id>10000008r</g:item_group_id>
        <g:color>Blue</g:color>
//100s of thousand of products
python-3.x pandas beautifulsoup xml-parsing elementtree
2个回答
0
投票

您能帮我测量以下速度吗?

from simplified_scrapy.simplified_doc import SimplifiedDoc 
import time
html = '''
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
<channel>
    <title>Feed XYZ</title>
    <description></description>
    <link></link>
    <item>
        <g:id>10000005</g:id>
        <title><![CDATA[TEst Item XYZ                                           ]]></title>
        <g:google_product_category>Food and stuff</g:google_product_category>
        <g:product_type><![CDATA[Details &gt; Food and stuff]]></g:product_type>
        <g:adwords_grouping><![CDATA[Food and stuff]]></g:adwords_grouping>
        <link>https://www.abc.se/abc/abc</link>
        <g:image_link>https://www.abc.se/bilder/artiklar/10000005.jpg</g:image_link>
        <g:additional_image_link>https://www.abc.se/bilder/artiklar/zoom/10000005_1.jpg</g:additional_image_link>
        <g:condition>new</g:condition>
        <g:availability>out of stock</g:availability>
        <g:price>155 SEK</g:price>
        <g:buyprice>68.00</g:buyprice>
        <g:brand>ABC</g:brand>
        <g:gtin>8003299920846</g:gtin>
        <g:mpn>ABC01 AZ</g:mpn> 
        <g:weight>0 g</g:weight> 
        <g:item_group_id>10000008r</g:item_group_id>
        <g:color>Blue</g:color>
    </item>
</channel>
//100s of thousand of products
'''
startTime = time.time()
all_products = []
doc = SimplifiedDoc(html)
items = doc.getElementsByTag('item')
for item in items:
  new_product = {
    "id": item.getElementByTag('g:id').html,
    "title": item.getElementByTag('title').html,
    "description": item.getElementByTag('description').html if "<description" in item.html else "",
    "google_product_category": item.getElementByTag('g:google_product_category').html,
    "product_type": item.getElementByTag('g:product_type').html if "<g:product_type" in item.html else "",
    "link": item.getElementByTag('link').html,
    "availability": item.getElementByTag('g:availability').html,
    "price": item.getElementByTag('g:price').html,
    "brand": item.getElementByTag('g:brand').html
  }
  all_products.append(new_product)
print (time.time()-startTime)
print (all_products)

0
投票

找到解决方案:

import lxml.etree as et
xml_data = open('feed.xml')
xml_data = xml_data.read()
data = et.fromstring(xml_data.encode("utf-8"))
items = data.xpath('//item')
​
all_products = []
prefix = "{http://base.google.com/ns/1.0}"
for item in items:
    new_product = {
        "id": item.find(prefix+ 'id').text,
        "title": item.find('title').text, 
        "google_product_category": item.find(prefix + 'google_product_category').text,
        "product_type": item.find(prefix + 'product_type').text,
        "link": item.find('link').text,
        "availability": item.find(prefix + 'availability').text,
        "price": item.find(prefix + 'price').text,
        "brand": item.find(prefix + 'brand').text
    }
    all_products.append(new_product)
© www.soinside.com 2019 - 2024. All rights reserved.