正则表达式:匹配任何字符零次或多次,除了“触摸”匹配组的数字

问题描述 投票:3回答:1

我很难创建一个正则表达式(在Python 3.6中),我可以使用它来解析日期时间字符串,遵循以下规则:

  • 日期始终采用YYYYMMDD的形式,其中YYYY是2000年至2099年(含)之间的任何一年,因此成为20yyMMDD
  • 时间总是以HHMMSS的形式出现
  • 日期总是在时间之前,如YYYYMMDDHHMMSS
  • 日期和时间可以分隔,不包含任何字符或任何非数字字符 YYYYMMDDHHMMSS - 好的 YYYYMMDD-HHMMSS - 好的 YYYYMMDD HHMMSS - 好的 YYYYMMDD1HHMMSS - 不接受
  • 除了“触摸”日期字符串必须是非数字字符外,前面或末尾可以有任何字符 (YYYYMMDDHHMMSS) - 好的 123-YYYYMMDDHHMMSS)123 - 好的 abc1YYYYMMDDHHMMSS - 不接受

我知道正则表达式的基础知识,阅读很多SO答案(找到Regex: match everything butRegex, every non-alphanumeric character except white space or colon和其他非常有用的),但是无法弄清楚正则表达式通过我的所有测试用例。

我需要两组实际的日期和时间解析,即(20[\d]{6})([\d]{6})。然后我添加了对额外字符.*(20[\d]{6})[^\d]?([\d]{6}).*的支持,它可以正常工作,直到前面,末尾或中间有一个数字字符,它不匹配,但它匹配。所以我开始在正面或背面添加不同的东西,例如(?<![\d]).*[^\d]?[^\d]?.*,...但不幸的是我的正则表达式知识很快就结束了,字符串变得一团糟,我不明白也不正常。

我制作了一些测试字符串(每个都有所需的结果)和一个简单的测试函数:

import datetime
import re
from typing import Tuple, List

#my_regex = r"(?<![\d])(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"
my_regex = r"\b(20[\d]{6})[^\d]?([\d]{6})[^\d]?.*"

dt = datetime.datetime(2017, 12, 17, 9, 10, 11)

tests: List[Tuple[str, datetime.datetime]] = [
    # Clean one.
    ("20171217091011", dt),
    # Character in between.
    ("20171217a091011", dt),
    ("20171217b091011", dt),
    ("20171217-091011", dt),
    ("20171217_091011", dt),
    ("20171217 091011", dt),
    ("201712170091011", None),  # Before/in between/at the end in this case.
    # Characters in front.
    ("a20171217091011", dt),
    ("b20171217091011", dt),
    (" 20171217091011", dt),
    ("-20171217091011", dt),
    ("_20171217091011", dt),
    ("020171217091011", None),
    ("aa20171217091011", dt),
    ("a1-20171217091011", dt),
    ("123_20171217091011", dt),
    ("123 20171217091011", dt),
    ("123=20171217091011", dt),
    ("201720171217091011", None),
    # Characters at the end.
    ("20171217091011a", dt),
    ("20171217091011b", dt),
    ("20171217091011 ", dt),
    ("20171217091011-", dt),
    ("20171217091011_", dt),
    ("201712170910110", None),
    ("20171217091011aa", dt),
    ("20171217091011a1", dt),
    ("20171217091011-a1", dt),
    ("20171217091011-123", dt),
    ("20171217091011_123", dt),
    ("20171217091011 123", dt),
    ("20171217091011?123", dt),
    # Characters at both ends.
    ("a20171217091011a", dt),
    ("(20171217091011)", dt),
    ("a-20171217091011 b", dt),
    ("123(20171217091011)456", dt),
    (" 20171217091011 ", dt),
    ("2017 20171217091011 2017", dt),
    ("20171218-20171217091011-070809", dt),
    # Characters at both ends and in the middle.
    ("123(20171217-091011)456", dt),
    ("a2017(20171217 091011)b", dt),
    ("2017xx(20171217?091011)cc2017", dt),
    ("2017xx(201712170091011)cc2017", None),
    ("2017xx(201712170091011", None),
    # Other cases.
    ("20171217091011 20171116080910", dt),  # Match first.
    ("A-20171116-080910-20171217091011", datetime.datetime(2017, 11, 16, 8, 9, 10)),  # Match first.
]

for test_str, test_time in tests:
    match = re.match(my_regex, test_str)
    time = None
    if match:
        try:
            time = datetime.datetime.strptime("".join(match.groups()), "%Y%m%d%H%M%S")
        except ValueError:
            pass
    if time != test_time:
        print("{: <32s} = {} instead of {}".format(test_str, time, test_time))

但我只是无法通过所有测试字符串,例如:

a20171217091011                  = None instead of 2017-12-17 09:10:11
b20171217091011                  = None instead of 2017-12-17 09:10:11
 20171217091011                  = None instead of 2017-12-17 09:10:11
-20171217091011                  = None instead of 2017-12-17 09:10:11
_20171217091011                  = None instead of 2017-12-17 09:10:11
aa20171217091011                 = None instead of 2017-12-17 09:10:11
a1-20171217091011                = None instead of 2017-12-17 09:10:11
123_20171217091011               = None instead of 2017-12-17 09:10:11
123 20171217091011               = None instead of 2017-12-17 09:10:11
123=20171217091011               = None instead of 2017-12-17 09:10:11
201712170910110                  = 2017-12-17 09:10:11 instead of None
a20171217091011a                 = None instead of 2017-12-17 09:10:11
(20171217091011)                 = None instead of 2017-12-17 09:10:11
a-20171217091011 b               = None instead of 2017-12-17 09:10:11
123(20171217091011)456           = None instead of 2017-12-17 09:10:11
 20171217091011                  = None instead of 2017-12-17 09:10:11
2017 20171217091011 2017         = None instead of 2017-12-17 09:10:11
20171218-20171217091011-070809   = 2017-12-18 20:17:12 instead of 2017-12-17 09:10:11
123(20171217-091011)456          = None instead of 2017-12-17 09:10:11
a2017(20171217 091011)b          = None instead of 2017-12-17 09:10:11
2017xx(20171217?091011)cc2017    = None instead of 2017-12-17 09:10:11
A-20171116-080910-20171217091011 = None instead of 2017-11-16 08:09:10

谢谢你的任何想法。

python regex
1个回答
2
投票

您似乎需要使用正则表达式检查常规模式,同时使用适当的Python方法验证实际日期时间值。

因此,您可以使用以下正则表达式修复代码:

r'(?<!\d)20\d{6}\D?\d{6}(?!\d)'

regex demo

细节

  • (?<!\d) - 如果当前位置左侧有一个数字,那么在比赛中失败的负面后卫
  • 20 - 一个20子串
  • \d{6} - 任何6位数字
  • \D? - 1或0个非数字字符
  • \d{6} - 任何6位数字
  • (?!\d) - 如果当前位置右侧有一个数字,则表示未通过比赛的负向前瞻。
© www.soinside.com 2019 - 2024. All rights reserved.