作者:cssic_630 | 来源:互联网 | 2023-05-18 12:01
IhaveimplementedrecursivedescentandPEG-likeparsersinthepast,whereyoucoulddothingslik
I have implemented recursive descent and PEG-like parsers in the past, where you could do things like this:
我在过去实现了递归下降和类似于peg的解析器,您可以这样做:
Path -> Segment+
Segment -> Slash Name
Segment -> /
Name -> /\w+/
Slash -> /
- where
Segment+
means "match one or more Segment
"
- 其中段+表示“匹配一个或多个段”
- and there's a plain old regular expression for matching one or more word characters with
\w+
- 还有一个普通的正则表达式,用来匹配一个或多个单词字符和\w+
How do you typically accomplish this same sort of thing with LR grammars/parsers? All of the examples of LR parsers I have seen are very basic, such as parsing 1 + 2 * 3
, or (())()
, where the patterns are very simple and don't seem to involve "one or more" functionality (or zero or more with *
, or optional with ?
). How do you do that in an LR parser generally?
如何用LR语法/解析器完成同样的事情?我所看到的LR解析器的所有示例都是非常基础的,比如解析1 + 2 * 3或(()),其中的模式非常简单,并且似乎不涉及“一个或多个”功能(或零或更多与*,或可选的)。如何在LR解析器中实现这一点?
Or does LR parsing require a lexing phase first (i.e. an LR parser requires terminal and nonterminal "tokens"). Hoping that there is a way to do LR parsing without two phases like that. The definition of an LR parser talks about "input characters" in the books/sites I've been reading, but then you see casually/subtly a line like:
或者,LR解析首先需要一个lexing阶段(即LR解析器需要终端和非终端“标记”)。希望有一种方法可以在不经历两个阶段的情况下进行LR解析。LR解析器的定义谈到了我读过的书/网站中的“输入字符”,但随后你会不经意地/微妙地看到如下一行:
The grammar's terminal symbols are the multi-character symbols or 'tokens' found in the input stream by a lexical scanner.
语法的终端符号是由词汇扫描器在输入流中找到的多字符符号或“令牌”。
And it's like what, where did the scanner come from.
就像,扫描仪是从哪里来的。
3 个解决方案
0
Whenever you have a parser operating on a stream of tokens, there is always the question of what produced the stream of tokens. With most parser generators, the grammar specification and the lexical specification of tokens are kept separate, mostly because the way the parser generator and lexer generator operate are different.
每当您有一个解析器在令牌流上操作时,总是存在一个问题,即是什么产生了令牌流。对于大多数解析器生成器,语法规范和令牌的词法规范是分开的,这主要是因为解析器生成器和词法生成器操作的方式不同。
Adding regex operators to "the grammar" is convenient. But they do not extend the power of context free grammars.
在“语法”中添加regex操作符很方便。但是它们并没有扩展上下文自由语法的力量。
You have 3 choices for using regex-like operators in grammars.
在语法中使用类似regex的操作符有三种选择。
1) Use them at the character level consistently across the grammar. If you do this, your parser operates with tokens being individual characters. This is likely to work badly with most parser generators, because the decision for most of them is based on the next input stream token only, in this case, a single character. To make this work you either need a backtracking parser or one that will try multiple paths through the space of alternative parses; LR and LL parsers don't do this. There are scannerless GLR parsers for which this would work, if you don't mind the additional overhead of GLR on a per character basis. (Also, if operating at the character level, you are likely to have explicitly specify whitespace and comment syntax).
1)在整个语法中始终如一地在字符级使用它们。如果您这样做,您的解析器将使用标记作为单个字符进行操作。对于大多数解析器生成器来说,这可能会很糟糕,因为对大多数解析器生成器的决策只基于下一个输入流令牌,在本例中是单个字符。要完成这项工作,您要么需要一个回溯解析器,要么需要一个在可选解析器空间中尝试多个路径的解析器;LR和LL解析器不会这样做。如果您不介意在每个字符的基础上增加GLR的开销,那么可以使用无扫描的GLR解析器。(另外,如果在字符级别上操作,您可能已经显式地指定了空格和注释语法)。
2) Use them as specifications of individual token character sequences (as in OP's "Name -> /w+/"). In this form, what you end up doing is writing a lexical token specifications integrated with the grammar. The grammar can be then processed into two parts: the lexical specification, and a more conventional BNF. This idea is compatible with many lexer and parser generators; it still doesn't change the expressive power.
2)使用它们作为单个令牌字符序列的规范(如OP的“名称-> /w+/”)。在这种形式中,您最终要做的是编写与语法集成的词汇标记规范。然后可以将语法处理为两部分:词汇规范和更传统的BNF。这个想法与许多lexer和解析器生成器兼容;它仍然没有改变表现力。
3) Use the regex operators only on grammar elements. These are are easily transformed into conventional BNF grammar rules:
3)仅在语法元素上使用regex操作符。这些很容易转换成传统的BNF语法规则:
Path -> Segment +
is equivalent to:
等价于:
Path -> Segment
Path -> Path Segment
After such transformations the results are compatible with most parser generators. This leaves open how the lexical syntax of grammar tokens are specified.
在这些转换之后,结果与大多数解析器生成器兼容。这将打开语法标记的语法语法语法语法语法语法语法语法语法的指定方式。
You can implement a hybrid scheme combining 1) and 2), which is appears to be what OP has done.
您可以实现1)和2)的混合方案,这似乎就是OP所做的。