我刚刚开始学习 Python,想读取 Apache 日志文件并将每行的部分内容放入不同的列表中。
文件中的行
172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv: 1.1) 壁虎/20020827"
根据Apache网站,格式为
%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{用户代理}i\
我可以打开文件并按原样读取它,但我不知道如何使其以这种格式读取,以便我可以将每个部分放入列表中。
这是正则表达式的工作。
例如:
line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
regex = '^([(\d\.)]+) [^ ]* [^ ]* \[([^ ]* [^ ]*)\] "([^"]*)" (\d+) [^ ]* "([^"]*)" "([^"]*)"'
import re
print re.match(regex, line).groups()
输出将是一个元组,其中包含该行中的 6 条信息(具体来说,该模式中括号内的组):
('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')
我创建了一个 python 库,它就是这样做的:apache-log-parser。
>>> import apache_log_parser
>>> line_parser = apache_log_parser.make_parser("%h <<%P>> %t %Dus \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %l %u")
>>> log_line_data = line_parser('127.0.0.1 <<6113>> [16/Aug/2013:15:45:34 +0000] 1966093us "GET / HTTP/1.1" 200 3478 "https://example.com/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)" - -')
>>> pprint(log_line_data)
{'pid': '6113',
'remote_host': '127.0.0.1',
'remote_logname': '-',
'remote_user': '',
'request_first_line': 'GET / HTTP/1.1',
'request_header_referer': 'https://example.com/',
'request_header_user_agent': 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)',
'response_bytes_clf': '3478',
'status': '200',
'time_received': '[16/Aug/2013:15:45:34 +0000]',
'time_us': '1966093'}
使用正则表达式将一行拆分为单独的“标记”:
>>> row = """172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827" """
>>> import re
>>> map(''.join, re.findall(r'\"(.*?)\"|\[(.*?)\]|(\S+)', row))
['172.16.0.3', '-', '-', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '-', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827']
另一种解决方案是使用专用工具,例如http://pypi.python.org/pypi/pylogsparser/0.4
考虑到格式的简单性,RegEx 似乎极端且有问题,所以我写了这个小分割器,其他人可能也会觉得有用:
def apache2_logrow(s):
''' Fast split on Apache2 log lines
http://httpd.apache.org/docs/trunk/logs.html
'''
row = [ ]
qe = qp = None # quote end character (qe) and quote parts (qp)
for s in s.replace('\r','').replace('\n','').split(' '):
if qp:
qp.append(s)
elif '' == s: # blanks
row.append('')
elif '"' == s[0]: # begin " quote "
qp = [ s ]
qe = '"'
elif '[' == s[0]: # begin [ quote ]
qp = [ s ]
qe = ']'
else:
row.append(s)
l = len(s)
if l and qe == s[-1]: # end quote
if l == 1 or s[-2] != '\\': # don't end on escaped quotes
row.append(' '.join(qp)[1:-1].replace('\\'+qe, qe))
qp = qe = None
return row
import re
HOST = r'^(?P<host>.*?)'
SPACE = r'\s'
IDENTITY = r'\S+'
USER = r'\S+'
TIME = r'(?P<time>\[.*?\])'
REQUEST = r'\"(?P<request>.*?)\"'
STATUS = r'(?P<status>\d{3})'
SIZE = r'(?P<size>\S+)'
REGEX = HOST+SPACE+IDENTITY+SPACE+USER+SPACE+TIME+SPACE+REQUEST+SPACE+STATUS+SPACE+SIZE+SPACE
def parser(log_line):
match = re.search(REGEX,log_line)
return ( (match.group('host'),
match.group('time'),
match.group('request') ,
match.group('status') ,
match.group('size')
)
)
logLine = """180.76.15.30 - - [24/Mar/2017:19:37:57 +0000] "GET /shop/page/32/?count=15&orderby=title&add_to_wishlist=4846 HTTP/1.1" 404 10202 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"""
result = parser(logLine)
print(result)
在 httpd.conf 中添加此内容以将 apache 日志转换为 json。
LogFormat "{\"time\":\"%t\", \"remoteIP\" :\"%a\", \"host\": \"%V\", \"request_id\": \"%L\", \"request\":\"%U\", \"query\" : \"%q\", \"method\":\"%m\", \"status\":\"%>s\", \"userAgent\":\"%{User-agent}i\", \"referer\":\"%{Referer}i\" }" json_log
CustomLog /var/log/apache_access_log json_log
CustomLog "|/usr/bin/python -u apacheLogHandler.py" json_log
现在您可以看到 json 格式的 access_logs。 使用下面的python代码来解析不断更新的json日志。
apacheLogHandler.py
import time
f = open('apache_access_log.log', 'r')
for line in f: # read all lines already in the file
print line.strip()
# keep waiting forever for more lines.
while True:
line = f.readline() # just read more
if line: # if you got something...
print 'got data:', line.strip()
time.sleep(1)