我正在尝试解析(使用 parsec)表示我定义的某种数据类型的字符串。因此需要将字符串解析为我的数据类型。字符串的一个例子是,
[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]
这将解析为以下内容,
[(Reg 1, [Cmp (Jlt, Intv (0, 0)), Op (Mod, Intv (-4,4))]), (Reg 2, [Cmp (Jge, (4,4))])]
现在这使用了一些自定义数据类型,
newtype Reg = Reg Int deriving (Eq, Show, Ord)
data LF = Op (BinAlu, Interval) | Cmp (Jcmp, Interval) | Invalid
deriving (Eq, Show, Ord)
data BinAlu
= Add
| Sub
| Mul
| Div
| Or
| And
| Lsh
| Rsh
| Mod
| Xor
| Mov
| Arsh
deriving (Eq, Show, Ord, Enum)
data Jcmp = Jeq | Jgt | Jge | Jlt | Jle | Jset | Jne | Jsgt | Jsge | Jslt | Jsle
deriving (Eq, Show, Ord, Enum)
data Interval = Bot | Intv (Int, Int)
deriving (Eq, Show, Ord)
因此我想将字符串解析为以下类型
[(Reg, [LF])]
现在我完全不知道如何真正做到这一点。我想我有一个想法,但我发现这个想法很难实现。
我的想法是先用
between (symbol "[") (symbol "]")
,希望能给我[
和]
之间的内容。然后我需要为括号做类似的事情但重复它。然后当然是解析括号内的内容。
我基本上是在寻找有关如何设置此解析器的任何建议。以及一般如何构建这样的解析器。
非常感谢任何帮助!
以下应该让你开始。我们需要一些进口:
module TupleParser where
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.String
为了正确处理空白,您应该首先编写一些组合器来处理“词素”,即期望从非空白字符开始、解析某些内容并丢弃尾随空白的解析器。虽然 Parsec 在
Text.Parsec.Token
中有一些词位支持,但它设计过度且难以使用。这是一个基于 Megaparsec 方法的简化替代方案:
-- a lexeme starts on non-whitespace, parses something,
-- and discards trailing whitespace
lexeme :: Parser a -> Parser a
lexeme p = p <* spaces
-- a symbol is a verbatim string, treated as a lexeme
symbol :: String -> Parser String
symbol s = lexeme (string s)
以下是用于解析数字的非常标准的词位:
-- an unsigned decimal number, treated as a lexeme
decimal :: (Read n, Integral n) => Parser n
decimal = read <$> many1 digit
-- combinator for signed numbers; replace "string" with
-- "symbol" if you want to allow space between dash and
-- first digit
signed :: (Num n) => Parser n -> Parser n
signed p = option id (negate <$ string "-") <*> p
还有一些其他非常标准的词位/组合器:
-- some standard names
comma :: Parser String
comma = symbol ","
parens :: Parser p -> Parser p
parens = between (symbol "(") (symbol ")")
brackets :: Parser p -> Parser p
brackets = between (symbol "[") (symbol "]")
这里有一个列表助手,因为你会在几个地方使用它。
-- a list is a bracket-delimited, comma-separated list
listOf :: Parser p -> Parser [p]
listOf p = brackets (p `sepBy` comma)
现在,我们应该定义语法的最低级“原子”:
-- (insert your data types here)
reg :: Parser Reg
reg = Reg <$> decimal
lf :: Parser LF
lf = parens
$ Op <$> ((,) <$> binalu <* comma <*> interval)
<|> Cmp <$> ((,) <$> jcmp <* comma <*> interval)
<|> Invalid <$ symbol "???"
-- I don't really understand your interval syntax, so
-- I'm just parsing any number "n" into "Intv (n,n)"
interval :: Parser Interval
interval = (\x -> Intv (x,x)) <$> signed decimal
对于
binalu
和jcmp
,一个简单的第一次尝试可能是这样的:
binalu :: Parser BinAlu
binalu
= Mod <$ symbol "%"
-- etc.
jcmp :: Parser Jcmp
jcmp
= Jlt <$ symbol "<"
<|> Jge <$ symbol ">="
-- etc.
这足以解析您的示例输入。但是,当您使用所有所需的运算符充实这些内容时,就会出现问题。例如,解析器
symbol "<"
会很乐意解析 "<="
的第一个字符,而当您接下来尝试解析逗号时,留下 "="
会导致错误。如果您订购替代品先尝试"<="
:
jcmp :: Parser Jcmp
jcmp
= Jle <$ symbol "<="
<|> Jlt <$ symbol "<"
-- etc.
这仍然不够,因为
symbol "<="
会很乐意 start 解析一个 "<"
后跟一个 "="
然后“在消耗输入后失败”,这会阻止尝试任何以后的替代方案。无论如何,您都可以使用 try
组合器继续:
jcmp :: Parser Jcmp
jcmp
= try (Jle <$ symbol "<=")
<|> Jlt <$ symbol "<"
-- etc.
但这很乏味。通常的解决方案是定义一个“运算符字符”列表:
-- include every character the appears in one of your operators
opChars :: String
opChars = "+-*/|&<=>%^!"
并定义一个
operator
组合子(注意:Parsec 称这个组合子为 reservedOp
),它解析一个运算符后跟一个运算符字符以外的东西:
operator :: String -> Parser String
operator s = lexeme $ try (string s <* notFollowedBy (oneOf opChars))
现在,您可以按任意顺序列出运算符,它们会正常工作:
jcmp :: Parser Jcmp
jcmp
= Jle <$ operator "<="
<|> Jlt <$ operator "<"
<|> Jgt <$ operator ">"
<|> Jge <$ operator ">="
-- etc.
最后,我们可以为您的高级结构定义语法。请注意,最顶层的解析器应忽略前导空格,因为所有词素解析器都希望以非空格开头,并检查输入结束。
type Program = [Statement]
type Statement = (Reg, [LF])
program :: Parser Program
program = spaces *> listOf statement <* eof
statement :: Parser Statement
statement = parens $ (,) <$> reg <* comma <*> listOf lf
这是对您建议的输入的测试:
main = parseTest program "[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]"
应该产生输出:
[(Reg 1,[Cmp (Jlt,Intv (0,0)),Op (Mod,Intv (4,4))]),(Reg 2,[Cmp (Jge,Intv (4,4))])]
完整代码:
module TupleParser where
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.String
lexeme :: Parser a -> Parser a
lexeme p = p <* spaces
symbol :: String -> Parser String
symbol s = lexeme (string s)
-- characters appearing in operators
opChars :: String
opChars = "+-*/|&<=>%^!"
-- parse an operator
operator :: String -> Parser String
operator s = lexeme $ try (string s <* notFollowedBy (oneOf opChars))
decimal :: (Read n, Integral n) => Parser n
decimal = read <$> many1 digit
signed :: (Num n) => Parser n -> Parser n
signed p = option id (negate <$ string "-") <*> p
comma :: Parser String
comma = symbol ","
parens :: Parser p -> Parser p
parens = between (symbol "(") (symbol ")")
brackets :: Parser p -> Parser p
brackets = between (symbol "[") (symbol "]")
listOf :: Parser p -> Parser [p]
listOf p = brackets (p `sepBy` comma)
newtype Reg = Reg Int deriving (Eq, Show, Ord)
data LF = Op (BinAlu, Interval) | Cmp (Jcmp, Interval) | Invalid
deriving (Eq, Show, Ord)
data BinAlu
= Add
| Sub
| Mul
| Div
| Or
| And
| Lsh
| Rsh
| Mod
| Xor
| Mov
| Arsh
deriving (Eq, Show, Ord, Enum)
data Jcmp = Jeq | Jgt | Jge | Jlt | Jle | Jset | Jne | Jsgt | Jsge | Jslt | Jsle
deriving (Eq, Show, Ord, Enum)
data Interval = Bot | Intv (Int, Int)
deriving (Eq, Show, Ord)
reg :: Parser Reg
reg = Reg <$> decimal
lf :: Parser LF
lf = parens
$ Op <$> ((,) <$> binalu <* comma <*> interval)
<|> Cmp <$> ((,) <$> jcmp <* comma <*> interval)
<|> Invalid <$ symbol "???"
binalu :: Parser BinAlu
binalu
= Mod <$ operator "%"
-- etc.
jcmp :: Parser Jcmp
jcmp
= Jlt <$ operator "<"
<|> Jge <$ operator ">="
-- etc.
-- I don't really understand your interval syntax, so
-- I'm just parsing any number "n" into "Intv (n,n)"
interval :: Parser Interval
interval = (\x -> Intv (x,x)) <$> signed decimal
type Program = [Statement]
type Statement = (Reg, [LF])
program :: Parser Program
program = spaces *> listOf statement <* eof
statement :: Parser Statement
statement = parens $ (,) <$> reg <* comma <*> listOf lf
main = parseTest program "[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]"