如何使用BeautifulSoup从JATS XML中提取日期(epub)?
<pub-date pub-type="epub">
<day>12</day>
<month>09</month>
<year>2011</year>
</pub-date>
→ 2011-09-12
<pub-date pub-type="collection">
<year>2011</year>
</pub-date>
应该被忽略。
在您的示例中,pub-type
是pub-date的属性,该属性的值为"epub"
。要以标准化的格式(如JATS XML)浏览文档树,您需要使用lxml,无论是作为standalone还是作为parser within BeautifulSoup。
这是使用lxml.etree的两个函数,仅当属性为“epub”时才使用xpath解析候选日期字段。我基于PLOS的JATS XML格式,我希望这里适用。
import datetime
import lxml.etree as et
def parse_article_date(date_element, date_format='%Y %m %d'):
"""
For an article date element, convert XML fields to a datetime object
:param date_format: string format used to convert to datetime object
:return: datetime object based on XML date fields
"""
day = ''
month = ''
year = ''
for item in date_element.getchildren():
if item.tag == 'day':
day = item.text
if item.tag == 'month':
month = item.text
if item.tag == 'year':
year = item.text
date = (year, month, day)
string_date = ' '.join(date)
date = datetime.datetime.strptime(string_date, date_format)
return date
def get_article_pubdate(article_file, tag_path_elements=None, string_=False):
"""
For a local article file, get its date of publication
:param article_file: the xml file for a single article
:param tag_path_elements: xpath search results of the location in the article's XML tree
:param string_: defaults to False. If True, returns a date string instead of datetime object
:return: dict of date type mapped to datetime object for that article
"""
pub_date = {}
if tag_path_elements is None:
tag_path_elements = ("/",
"article",
"front",
"article-meta",
"pub-date")
article_tree = et.parse(article_file)
article_root = article_tree.getroot()
tag_location = '/'.join(tag_path_elements)
pub_date_fields = article_root.xpath(tag_location)
print(pub_date_fields)
for element in pub_date_fields:
pub_type = element.get('pub-type')
if pub_type == 'epub':
date = parse_article_date(element)
pub_date[pub_type] = date
if string_:
for key, value in pub_date.items():
if value:
pub_date[key] = value.strftime('%Y-%m-%d') # you can set this to any date format
return pub_date