使用pyparsing来解析过滤器表达式

Question

我目前正在尝试编写一个解析器（使用 pyparsing），它可以解析字符串，然后将其应用于（pandas）数据帧以过滤数据。经过对各种示例字符串的多次尝试和错误后，我已经让它可以工作了，但是从现在开始，我在进一步扩展它时遇到了麻烦。

首先，这是我当前的代码（如果你只是复制粘贴，应该可以工作，至少对于我的 Python 3.11.9 和 pyparsing 3.1.2 来说）：

import pyparsing as pp

# Define the components of the grammar
field_name = pp.Word(pp.alphas + "_", pp.alphanums + "_")
action = pp.one_of("include exclude")
sub_action = pp.one_of("equals contains starts_with ends_with greater_than not_equals not_contains not_starts_with not_ends_with empty not_empty less_than less_than_or_equal_to greater_than_or_equal_to between regex in_list not_in_list")

# Custom regex pattern parser that handles regex ending at the first space
def regex_pattern():
    def parse_regex(t):
        # Join tokens to form the regex pattern
        return ''.join(t[0])
    return pp.Regex(r'[^ ]+')("regex").setParseAction(parse_regex)

# Define value as either a quoted string, a regex pattern, or a simple word with allowed characters
quoted_string = pp.QuotedString('"')
unquoted_value = pp.Word(pp.alphanums + "_-;, ") | pp.Regex(r'[^/]+')

value = pp.Optional(quoted_string | regex_pattern() | unquoted_value)("value")

slash = pp.Suppress("/")
filter_expr = pp.Group(field_name("field") + slash + action("action") + slash + sub_action("sub_action") + pp.Optional(slash + value, default=""))

# Define logical operators
and_op = pp.one_of("AND and")
or_op = pp.one_of("OR or")
not_op = pp.one_of("NOT not")

# Define the overall expression using infix notation
expression = pp.infixNotation(filter_expr,
                           [
                               (not_op, 1, pp.opAssoc.RIGHT),
                               (and_op, 2, pp.opAssoc.LEFT),
                               (or_op, 2, pp.opAssoc.LEFT)
                           ])

# List of test filters
test_filters = [
    "order_type/exclude/contains/STOP ORDER AND order_validity/exclude/contains/GOOD FOR DAY",
    "order_status/include/regex/^New$ AND order_id/include/equals/123;124;125",
    "order_id/include/equals/123;124;125",
    "order_id/include/equals/125 OR currency/include/equals/EUR",    
    "trade_amount/include/greater_than/1500 AND currency/include/equals/USD",    
    "trade_amount/include/between/1200-2000 AND currency/include/in_list/USD,EUR",
    "order_status/include/starts_with/New;Filled OR order_status/include/ends_with/ed",
    "order_status/exclude/empty AND filter_code/include/not_empty",    
    "order_status/include/regex/^New$",
    "order_status/include/regex/^New$ OR order_status/include/regex/^Changed$",
    "order_status/include/contains/New;Changed"
]

# Loop over test filters, parse each, and display the results
for test_string in test_filters:
    print(f"Testing filter: {test_string}")
    try:
        parse_result = expression.parse_string(test_string, parseAll=True).asList()[0]
        print(f"Parsed result: {parse_result}")
    except Exception as e:
        print(f"Error with filter: {test_string}")
        print(e)
    print("\n")

现在，如果您运行代码，您会发现所有测试字符串都解析得很好，除了列表的第一个元素

"order_type/exclude/contains/STOP ORDER AND order_validity/exclude/contains/GOOD FOR DAY"

。

问题（据我所知）是“STOP”和“ORDER”之间的空白被识别为组的该部分的“值”部分的末尾，然后它就中断了。

我尝试过的是在 sub_action 部分完成后使用 Skipsto 跳到下一个逻辑运算符，但这不起作用。另外，我不确定它的可扩展性如何，因为理论上它甚至应该可以有许多链式表达式（例如，part1 AND part2 OR part3），其中每个部分由 3-4 个元素组成（field_name、action、sub_action和可选值）。

我还尝试扩展 unquoted_value 以包含空格，但这也没有改变。

我还查看了https://github.com/pyparsing/pyparsing/tree/master/examples上的一些示例，但我实际上看不到任何与我的用例类似的内容。（也许一旦我的代码正常工作，就可以将其添加为示例，不确定我的案例对其他人有多大用处）。

Answer 1

与其定义一个包含空格的术语，不如定义一个解析单词的术语，这样它就可以检测到并停止发现不应包含的单词（例如逻辑运算符单词之一）。我这样做了，然后将其包装在一个组合中，a) 允许单词之间有空格 (adjacent=False)，b) 将它们用单个空格重新连接在一起。

我在你的解析器中进行了这些更改，看起来它们会更适合你：

# I used CaselessKeywords so that you get a repeatable return value, 
#  regardless of the input
and_op = pp.CaselessKeyword("and")
or_op = pp.CaselessKeyword("or")
not_op = pp.CaselessKeyword("not")

# define an expression for any logical operator, to be used when
# defining words that unquoted_value should not include
any_operator = and_op | or_op | not_op

value_chars = pp.alphanums + "--;,"
# an unquoted_value is one or more words that are not operators
unquoted_value = pp.Combine(pp.OneOrMore(pp.Word(value_chars), stop_on=any_operator), join_string=" ", adjacent=False)
# can also be written as
# unquoted_value = pp.Combine(pp.Word(value_chars)[1, ...:any_operator], join_string=" ", adjacent=False)

# regex_pattern() can be replaced by this regex
single_word = pp.Regex(r"\S+")
value = (quoted_string | unquoted_value | single_word)("value")

最后，您的测试循环看起来很像我在许多 StackOverflow 问题回复中编写的循环。我写了很多次，最终添加了一个

ParserElement

方法

run_tests

，你可以像这样调用它来替换你的测试循环：

expression.run_tests(test_filters)

使用pyparsing来解析过滤器表达式

问题描述投票：0回答：1

1个回答

最新问题

使用pyparsing来解析过滤器表达式

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1