我目前正在尝试编写一个解析器(使用 pyparsing),它可以解析字符串,然后将其应用于(pandas)数据帧以过滤数据。经过对各种示例字符串的多次尝试和错误后,我已经让它可以工作了,但是从现在开始,我在进一步扩展它时遇到了麻烦。
首先,这是我当前的代码(如果你只是复制粘贴,应该可以工作,至少对于我的 Python 3.11.9 和 pyparsing 3.1.2 来说):
import pyparsing as pp
# Define the components of the grammar
field_name = pp.Word(pp.alphas + "_", pp.alphanums + "_")
action = pp.one_of("include exclude")
sub_action = pp.one_of("equals contains starts_with ends_with greater_than not_equals not_contains not_starts_with not_ends_with empty not_empty less_than less_than_or_equal_to greater_than_or_equal_to between regex in_list not_in_list")
# Custom regex pattern parser that handles regex ending at the first space
def regex_pattern():
def parse_regex(t):
# Join tokens to form the regex pattern
return ''.join(t[0])
return pp.Regex(r'[^ ]+')("regex").setParseAction(parse_regex)
# Define value as either a quoted string, a regex pattern, or a simple word with allowed characters
quoted_string = pp.QuotedString('"')
unquoted_value = pp.Word(pp.alphanums + "_-;, ") | pp.Regex(r'[^/]+')
value = pp.Optional(quoted_string | regex_pattern() | unquoted_value)("value")
slash = pp.Suppress("/")
filter_expr = pp.Group(field_name("field") + slash + action("action") + slash + sub_action("sub_action") + pp.Optional(slash + value, default=""))
# Define logical operators
and_op = pp.one_of("AND and")
or_op = pp.one_of("OR or")
not_op = pp.one_of("NOT not")
# Define the overall expression using infix notation
expression = pp.infixNotation(filter_expr,
[
(not_op, 1, pp.opAssoc.RIGHT),
(and_op, 2, pp.opAssoc.LEFT),
(or_op, 2, pp.opAssoc.LEFT)
])
# List of test filters
test_filters = [
"order_type/exclude/contains/STOP ORDER AND order_validity/exclude/contains/GOOD FOR DAY",
"order_status/include/regex/^New$ AND order_id/include/equals/123;124;125",
"order_id/include/equals/123;124;125",
"order_id/include/equals/125 OR currency/include/equals/EUR",
"trade_amount/include/greater_than/1500 AND currency/include/equals/USD",
"trade_amount/include/between/1200-2000 AND currency/include/in_list/USD,EUR",
"order_status/include/starts_with/New;Filled OR order_status/include/ends_with/ed",
"order_status/exclude/empty AND filter_code/include/not_empty",
"order_status/include/regex/^New$",
"order_status/include/regex/^New$ OR order_status/include/regex/^Changed$",
"order_status/include/contains/New;Changed"
]
# Loop over test filters, parse each, and display the results
for test_string in test_filters:
print(f"Testing filter: {test_string}")
try:
parse_result = expression.parse_string(test_string, parseAll=True).asList()[0]
print(f"Parsed result: {parse_result}")
except Exception as e:
print(f"Error with filter: {test_string}")
print(e)
print("\n")
现在,如果您运行代码,您会发现所有测试字符串都解析得很好,除了列表的第一个元素
"order_type/exclude/contains/STOP ORDER AND order_validity/exclude/contains/GOOD FOR DAY"
。
问题(据我所知)是“STOP”和“ORDER”之间的空白被识别为组的该部分的“值”部分的末尾,然后它就中断了。
我尝试过的是在 sub_action 部分完成后使用 Skipsto 跳到下一个逻辑运算符,但这不起作用。另外,我不确定它的可扩展性如何,因为理论上它甚至应该可以有许多链式表达式(例如,part1 AND part2 OR part3),其中每个部分由 3-4 个元素组成(field_name、action、sub_action和可选值)。
我还尝试扩展 unquoted_value 以包含空格,但这也没有改变。
我还查看了https://github.com/pyparsing/pyparsing/tree/master/examples上的一些示例,但我实际上看不到任何与我的用例类似的内容。 (也许一旦我的代码正常工作,就可以将其添加为示例,不确定我的案例对其他人有多大用处)。
与其定义一个包含空格的术语,不如定义一个解析单词的术语,这样它就可以检测到并停止发现不应包含的单词(例如逻辑运算符单词之一)。我这样做了,然后将其包装在一个组合中,a) 允许单词之间有空格 (adjacent=False),b) 将它们用单个空格重新连接在一起。
我在你的解析器中进行了这些更改,看起来它们会更适合你:
# I used CaselessKeywords so that you get a repeatable return value,
# regardless of the input
and_op = pp.CaselessKeyword("and")
or_op = pp.CaselessKeyword("or")
not_op = pp.CaselessKeyword("not")
# define an expression for any logical operator, to be used when
# defining words that unquoted_value should not include
any_operator = and_op | or_op | not_op
value_chars = pp.alphanums + "--;,"
# an unquoted_value is one or more words that are not operators
unquoted_value = pp.Combine(pp.OneOrMore(pp.Word(value_chars), stop_on=any_operator), join_string=" ", adjacent=False)
# can also be written as
# unquoted_value = pp.Combine(pp.Word(value_chars)[1, ...:any_operator], join_string=" ", adjacent=False)
# regex_pattern() can be replaced by this regex
single_word = pp.Regex(r"\S+")
value = (quoted_string | unquoted_value | single_word)("value")
最后,您的测试循环看起来很像我在许多 StackOverflow 问题回复中编写的循环。我写了很多次,最终添加了一个
ParserElement
方法 run_tests
,你可以像这样调用它来替换你的测试循环:
expression.run_tests(test_filters)