我想访问子节点中存在的信息。这是因为文件的结构吗?
尝试分别在文件中提取作者子节点信息并运行python代码。这很好
import urllib
import xml.etree.ElementTree as ET
url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
print 'Retrieving', url
document = urllib.urlopen (url).read()
print 'Retrieved', len(document), 'characters.'
print document[:50]
tree = ET.fromstring(document)
lst = tree.findall('title')
print lst[:100]
由于命名空间,您找不到标题元素。
下面是一个示例代码:
import xml.etree.ElementTree as ET
import urllib.request
url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
response = urllib.request.urlopen(url).read()
tree = ET.fromstring(response)
for docTitle in tree.findall('{urn:hl7-org:v3}title'):
print(docTitle.text)
for compTitle in tree.findall('.//{urn:hl7-org:v3}title'):
print(compTitle.text)
UPDATE
如果需要搜索XML节点,则应使用xPath Expressions
例:
NS = '{urn:hl7-org:v3}'
ID = '829076996' # ID TO BE FOUND
# XPATH TO FIND AUTHORS BY ID (search ID and return related author node)
xPathAuthorById = ''.join([
".//",
NS, "author/",
NS, "assignedEntity/",
NS, "representedOrganization/",
NS, "id[@extension='", ID,
"']/../../.."
])
# XPATH TO FIND AUTHOR NAME ELEMENT
xPathAuthorName = ''.join([
"./",
NS, "assignedEntity/",
NS, "representedOrganization/",
NS, "name"
])
# FOR EACH AUTHOR FOUND, SEARCH ATTRIBUTES (example name)
for author in tree.findall(xPathAuthorById):
name = author.find(xPathAuthorName)
print(name.text)
此示例打印ID 829076996的作者姓名
更新2
您可以使用findall方法轻松处理所有assignedEntity标记。对于它们中的每一个,您可以拥有多个产品,因此需要另一种findall方法(参见下面的示例)。
xPathAssignedEntities = ''.join([
".//",
NS, "author/",
NS, "assignedEntity/",
NS, "representedOrganization/",
NS, "assignedEntity/",
NS, "assignedOrganization/",
NS, "assignedEntity"
])
xPathProdCode = ''.join([
NS, "actDefinition/",
NS, "product/",
NS, "manufacturedProduct/",
NS, "manufacturedMaterialKind/",
NS, "code"
])
# GET ALL assignedEntity TAGS
for assignedEntity in tree.findall(xPathAssignedEntities):
# GET ID AND NAME OF assignedEntity
id = assignedEntity.find(NS + 'assignedOrganization/'+ NS + 'id').get('extension')
name = assignedEntity.find(NS + 'assignedOrganization/' + NS + 'name').text
# FOR EACH assignedEntity WE CAN HAVE MULTIPLE <performance> TAGS
for performance in assignedEntity.findall(NS + 'performance'):
actCode = performance.find(NS + 'actDefinition/'+ NS + 'code').get('displayName')
prodCode = performance.find(xPathProdCode).get('code')
print(id, '\t', name, '\t', actCode, '\t', prodCode)
这是结果:
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-0050
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4900
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4910
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4940
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4960
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-0050
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4900
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4910
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4940
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4960
829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4900
829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4910
829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4960
829084545 Pfizer Pharmaceuticals LLC PACK 0049-4900
829084545 Pfizer Pharmaceuticals LLC PACK 0049-4910
829084545 Pfizer Pharmaceuticals LLC PACK 0049-4960
618054084 Pharmacia and Upjohn Company LLC ANALYSIS 0049-0050
618054084 Pharmacia and Upjohn Company LLC ANALYSIS 0049-4940
829084552 Pfizer Pharmaceuticals LLC PACK 0049-4900
829084552 Pfizer Pharmaceuticals LLC PACK 0049-4910
829084552 Pfizer Pharmaceuticals LLC PACK 0049-4960
您可以使用xmltodict从请求的XML数据生成python字典。
这是一个基本的例子:
import urllib2
import xmltodict
def foobar(request):
file = urllib2.urlopen('https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml')
data = file.read()
file.close()
data = xmltodict.parse(data)
return {'xmldata': data}
我通常更喜欢beautifulsoup with lxml
解析器来解析xml。示例代码如下
import requests
from bs4 import BeautifulSoup
url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
document = requests.get(url)
soup= BeautifulSoup(document.content,"lxml-xml")
print (soup.find("title"))
产量
<title>These highlights do not include all the information needed to use ZOLOFT safely and effectively. See full prescribing information for ZOLOFT. <br/>
<br/>ZOLOFT (sertraline hydrochloride) tablets, for oral use <br/>ZOLOFT (sertraline hydrochloride) oral solution <br/>Initial U.S. Approval: 1991</title>
然后,您可以使用像find
和find_all
这样的beautifulsoup提供的方法来查找相应的节点或子节点