使用pyparsing来解析过滤器表达式

问题描述 投票:0回答:1

我目前正在尝试编写一个解析器(使用 pyparsing),它可以解析字符串,然后将其应用于(pandas)数据帧以过滤数据。经过对各种示例字符串的多次尝试和错误后,我已经让它可以工作了,但是从现在开始,我在进一步扩展它时遇到了麻烦。

首先,这是我当前的代码(如果你只是复制粘贴,应该可以工作,至少对于我的 Python 3.11.9 和 pyparsing 3.1.2 来说):

import pyparsing as pp

# Define the components of the grammar
field_name = pp.Word(pp.alphas + "_", pp.alphanums + "_")
action = pp.one_of("include exclude")
sub_action = pp.one_of("equals contains starts_with ends_with greater_than not_equals not_contains not_starts_with not_ends_with empty not_empty less_than less_than_or_equal_to greater_than_or_equal_to between regex in_list not_in_list")

# Custom regex pattern parser that handles regex ending at the first space
def regex_pattern():
    def parse_regex(t):
        # Join tokens to form the regex pattern
        return ''.join(t[0])
    return pp.Regex(r'[^ ]+')("regex").setParseAction(parse_regex)

# Define value as either a quoted string, a regex pattern, or a simple word with allowed characters
quoted_string = pp.QuotedString('"')
unquoted_value = pp.Word(pp.alphanums + "_-;, ") | pp.Regex(r'[^/]+')

value = pp.Optional(quoted_string | regex_pattern() | unquoted_value)("value")

slash = pp.Suppress("/")
filter_expr = pp.Group(field_name("field") + slash + action("action") + slash + sub_action("sub_action") + pp.Optional(slash + value, default=""))

# Define logical operators
and_op = pp.one_of("AND and")
or_op = pp.one_of("OR or")
not_op = pp.one_of("NOT not")

# Define the overall expression using infix notation
expression = pp.infixNotation(filter_expr,
                           [
                               (not_op, 1, pp.opAssoc.RIGHT),
                               (and_op, 2, pp.opAssoc.LEFT),
                               (or_op, 2, pp.opAssoc.LEFT)
                           ])

# List of test filters
test_filters = [
    "order_type/exclude/contains/STOP ORDER AND order_validity/exclude/contains/GOOD FOR DAY",
    "order_status/include/regex/^New$ AND order_id/include/equals/123;124;125",
    "order_id/include/equals/123;124;125",
    "order_id/include/equals/125 OR currency/include/equals/EUR",    
    "trade_amount/include/greater_than/1500 AND currency/include/equals/USD",    
    "trade_amount/include/between/1200-2000 AND currency/include/in_list/USD,EUR",
    "order_status/include/starts_with/New;Filled OR order_status/include/ends_with/ed",
    "order_status/exclude/empty AND filter_code/include/not_empty",    
    "order_status/include/regex/^New$",
    "order_status/include/regex/^New$ OR order_status/include/regex/^Changed$",
    "order_status/include/contains/New;Changed"
]

# Loop over test filters, parse each, and display the results
for test_string in test_filters:
    print(f"Testing filter: {test_string}")
    try:
        parse_result = expression.parse_string(test_string, parseAll=True).asList()[0]
        print(f"Parsed result: {parse_result}")
    except Exception as e:
        print(f"Error with filter: {test_string}")
        print(e)
    print("\n")

现在,如果您运行代码,您会发现所有测试字符串都解析得很好,除了列表的第一个元素

"order_type/exclude/contains/STOP ORDER AND order_validity/exclude/contains/GOOD FOR DAY"

问题(据我所知)是“STOP”和“ORDER”之间的空白被识别为组的该部分的“值”部分的末尾,然后它就中断了。

我尝试过的是在 sub_action 部分完成后使用 Skipsto 跳到下一个逻辑运算符,但这不起作用。另外,我不确定它的可扩展性如何,因为理论上它甚至应该可以有许多链式表达式(例如,part1 AND part2 OR part3),其中每个部分由 3-4 个元素组成(field_name、action、sub_action和可选值)。

我还尝试扩展 unquoted_value 以包含空格,但这也没有改变。

我还查看了https://github.com/pyparsing/pyparsing/tree/master/examples上的一些示例,但我实际上看不到任何与我的用例类似的内容。 (也许一旦我的代码正常工作,就可以将其添加为示例,不确定我的案例对其他人有多大用处)。

python pyparsing
1个回答
0
投票

与其定义一个包含空格的术语,不如定义一个解析单词的术语,这样它就可以检测到并停止发现不应包含的单词(例如逻辑运算符单词之一)。我这样做了,然后将其包装在一个组合中,a) 允许单词之间有空格 (adjacent=False),b) 将它们用单个空格重新连接在一起。

我在你的解析器中进行了这些更改,看起来它们会更适合你:

# I used CaselessKeywords so that you get a repeatable return value, 
#  regardless of the input
and_op = pp.CaselessKeyword("and")
or_op = pp.CaselessKeyword("or")
not_op = pp.CaselessKeyword("not")

# define an expression for any logical operator, to be used when
# defining words that unquoted_value should not include
any_operator = and_op | or_op | not_op

value_chars = pp.alphanums + "--;,"
# an unquoted_value is one or more words that are not operators
unquoted_value = pp.Combine(pp.OneOrMore(pp.Word(value_chars), stop_on=any_operator), join_string=" ", adjacent=False)
# can also be written as
# unquoted_value = pp.Combine(pp.Word(value_chars)[1, ...:any_operator], join_string=" ", adjacent=False)

# regex_pattern() can be replaced by this regex
single_word = pp.Regex(r"\S+")
value = (quoted_string | unquoted_value | single_word)("value")

最后,您的测试循环看起来很像我在许多 StackOverflow 问题回复中编写的循环。我写了很多次,最终添加了一个

ParserElement
方法
run_tests
,你可以像这样调用它来替换你的测试循环:

expression.run_tests(test_filters)
© www.soinside.com 2019 - 2024. All rights reserved.