如何使用 python 解析电子邮件标头?

问题描述 投票:0回答:5

这是一个电子邮件标题示例,

header = """
From: Media Temple user ([email protected])
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: [email protected]
Return-Path: <[email protected]>
Envelope-To: [email protected]
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <[email protected]>) id 1KDoNH-0000f0-RL for [email protected]; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <[email protected]>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""

标头存储为字符串,如何解析该标头,以便将其映射到字典,因为标头字段是键,值是字典中的值?

我想要一本这样的词典,

header_dict = {
'From': 'Media Temple user ([email protected])',
'Subject': article: 'A sample header',
'Date': 'January 25, 2011 3:30:58 PM PDT'
'and so on': .. . . . .. . . .. . 
 . . . . .. . . . ..  . . . . .
} 

我列出了必填字段,

header_reqd = ['From:','Subject:','Date:','To:','Return-Path:','Envelope-To:','Delivery-Date:','Received:','Dkim-Signature:','Domainkey-Signature:','Message-Id:','Mime-Version:','Content-Type:','X-Spam-Status:','X-Spam-Level:','Message Body:']

这可以列出可能是字典的键的项目。

python dictionary email-headers text-processing
5个回答
6
投票

这些答案中的大多数似乎都忽略了Python电子邮件解析器,并且输出结果与值中的前缀空格不正确。此外,OP 可能会在标题字符串中包含前面的换行符,从而导致电子邮件解析器需要删除该换行符才能正常工作。

from email.parser import HeaderParser
header = header.strip() # Fix incorrect formatting
email_message = HeaderParser().parsestr(header)
dict(email_message)

输出(截断):

>>> from pprint import pprint
>>> pprint(dict(email_message))
{'Content-Type': 'multipart/alternative; '
                 'boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': 'January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': 'Tue, 25 Jan 2011 15:31:01 -0700',
 ...
 'Subject': 'article: A sample header',
 'To': '[email protected]',
 'X-Spam-Level': '***',
 'X-Spam-Status': 'score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
                  'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}

重复的标题键

请注意,电子邮件标头可能包含重复的键,如 email.message

的 Python 文档中所述

标头以保留大小写的形式存储和返回,但字段名称的匹配不区分大小写。与真正的字典不同,键是有顺序的,并且可以有重复的键。提供了其他方法来处理具有重复键的标头。

例如,将以下电子邮件消息转换为 Python 字典,仅保留第一个

Received
键。

headers = HeaderParser().parsestr("""Received: by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <[email protected]>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)
Received: by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <[email protected]>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)""")

dict(headers)
{'Received': 'by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)'}

使用 get_all 方法检查重复项:

headers.get_all('Received')
['by mx0047p1mdw1.sendgrid.net with SMTP id 6WCVv7KAWn Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'from mail-io0-f169.google.com (mail-io0-f169.google.com [209.85.223.169]) by mx0047p1mdw1.sendgrid.net (Postfix) with ESMTPS id AA9FFA817F2 for <[email protected]>; Wed, 27 Jul 2016 20:53:06 +0000 (UTC)', 'by mail-io0-f169.google.com with SMTP id b62so81593819iod.3 for <[email protected]>; Wed, 27 Jul 2016 13:53:06 -0700 (PDT)']

1
投票

您可以在换行符上分割字符串,然后在“:”上分割每一行

>>> my_header = {}
>>> for x in header.strip().split("\n"):
...     x = x.split(":", 1)
...     my_header[x[0]] = x[1]
... 

1
投票
header = """From: Media Temple user ([email protected])
Subject: article: A sample header
Date: January 25, 2011 3:30:58 PM PDT
To: [email protected]
Return-Path: <[email protected]>
Envelope-To: [email protected]
Delivery-Date: Tue, 25 Jan 2011 15:31:01 -0700
Received: from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <[email protected]>) id 1KDoNH-0000f0-RL for [email protected]; Tue, 25 Jan 2011 15:31:01 -0700
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=
Domainkey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=
Message-Id: <[email protected]>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"
X-Spam-Status: score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7
X-Spam-Level: ***
Message Body: **The email message body**
"""   

分割成单独的行,然后将每行分割一次

:

from pprint import pprint as pp
pp(dict(line.split(":",1) for line in header.splitlines()))

输出:

{'Content-Type': ' multipart/alternative; '
                 'boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': ' January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
 'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; '
                   's=gamma; '
                   'h=domainkey-signature:received:received:message-id:date:from:to '
                   ':subject:mime-version:content-type; '
                   'bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; '
                   'b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea '
                   'LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m '
                   'CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
 'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; '
                        'h=message-id:date:from:to:subject:mime-version:content-type; '
                        'b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH '
                        '36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB '
                        '6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
 'Envelope-To': ' [email protected]',
 'From': ' Media Temple user ([email protected])',
 'Message Body': ' **The email message body**',
 'Message-Id': ' '
               '<[email protected]>',
 'Mime-Version': ' 1.0',
 'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by '
             'cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from '
             '<[email protected]>) id 1KDoNH-0000f0-RL for '
             '[email protected]; Tue, 25 Jan 2011 15:31:01 -0700',
 'Return-Path': ' <[email protected]>',
 'Subject': ' article: A sample header',
 'To': ' [email protected]',
 'X-Spam-Level': ' ***',
 'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, '
                  'HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}

line.split(":",1)
确保我们只在
:
上拆分一次,因此如果值中有任何
:
,我们最终也不会拆分它。您最终得到的是键/值对的子列表,因此调用
dict
会根据每个配对创建
dict


1
投票

split
将为您工作:

演示:

>>> result = {}
>>> for i in header.split("\n"):
...    i = i.strip()
...    if i :
...       k, v = i.split(":", 1)
...       result[k] = v

输出:

>>> import pprint
>>> pprint.pprint(result)
{'Content-Type': ' multipart/alternative; boundary="----=_Part_3927_12044027.1214951458678"',
 'Date': ' January 25, 2011 3:30:58 PM PDT',
 'Delivery-Date': ' Tue, 25 Jan 2011 15:31:01 -0700',
 'Dkim-Signature': ' v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:mime-version:content-type; bh=+JqkmVt+sHDFIGX5jKp3oP18LQf10VQjAmZAKl1lspY=; b=F87jySDZnMayyitVxLdHcQNL073DytKRyrRh84GNsI24IRNakn0oOfrC2luliNvdea LGTk3adIrzt+N96GyMseWz8T9xE6O/sAI16db48q4Iqkd7uOiDvFsvS3CUQlNhybNw8m CH/o8eELTN0zbSbn5Trp0dkRYXhMX8FTAwrH0=',
 'Domainkey-Signature': ' a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:mime-version:content-type; b=wkbBj0M8NCUlboI6idKooejg0sL2ms7fDPe1tHUkR9Ht0qr5lAJX4q9PMVJeyjWalH 36n4qGLtC2euBJY070bVra8IBB9FeDEW9C35BC1vuPT5XyucCm0hulbE86+uiUTXCkaB 6ykquzQGCer7xPAcMJqVfXDkHo3H61HM9oCQM=',
 'Envelope-To': ' [email protected]',
 'From': ' Media Temple user ([email protected])',
 'Message Body': ' **The email message body**',
 'Message-Id': ' <[email protected]>',
 'Mime-Version': ' 1.0',
 'Received': ' from :po-out-1718.google.com ([72.14.252.155]:54907) by cl35.gs01.gridserver.com with esmtp (Exim 4.63) (envelope-from <[email protected]>) id 1KDoNH-0000f0-RL for [email protected]; Tue, 25 Jan 2011 15:31:01 -0700',
 'Return-Path': ' <[email protected]>',
 'Subject': ' article: A sample header',
 'To': ' [email protected]',
 'X-Spam-Level': ' ***',
 'X-Spam-Status': ' score=3.7 tests=DNS_FROM_RFC_POST, HTML_00_10, HTML_MESSAGE, HTML_SHORT_LENGTH version=3.1.7'}

0
投票

要解析电子邮件,您可以使用 Python 标准 email 库。特别是,Parser API 可用于加载电子邮件(从内存或文件)并创建相应的 EmailMessage 对象。

例如:

from email.parser import Parser
from email.policy import default as DefaultPolicy

raw_message = """From: [email protected]
Subject: Subject test
Date: January 25, 2011 3:30:58 PM PDT
To: [email protected]
Content-Type: text/plain; charset="utf-8"

Email message body test."""

message = Parser(policy=DefaultPolicy).parsestr(raw_message)

headers = {}
# Unique headers:
for header in ["Date", "From", "Reply-To", "Sender", "Subject", "To"]:
    headers[header] = message.get(header) if header in message else ""
# Duplicated headers:
for header in ["Content-Type", "Received"]:
    headers[header] = message.get_all(header) if header in message else []

print(f"Headers: {headers}")  # Headers: {'Date': 'Tue, 25 Jan 2011 03:30:58 -0000', 'From': '[email protected]', 'Reply-To': '', 'Sender': '', 'Subject': 'Subject test', 'To': '[email protected]', 'Content-Type': ['text/plain; charset="utf-8"'], 'Received': []}

注意: 请记住,如上例所示,某些标头可能会出现多次(例如,“Content-Type”“Received”)。通过

message.get(header)
message[header]
访问它们不会返回所有出现的情况。请使用
message.get_all(header)
来代替。

如果以字节而不是字符串形式检索电子邮件,您可以使用

BytesParser
而不是
Parser
:

from email.parser import BytesParser
from email.policy import default as DefaultPolicy

raw_message = b"""From: [email protected]
Subject: Subject test
Date: January 25, 2011 3:30:58 PM PDT
To: [email protected]
Content-Type: text/plain; charset="utf-8"

Email message body test."""

message = BytesParser(policy=DefaultPolicy).parsebytes(raw_message)
© www.soinside.com 2019 - 2024. All rights reserved.