I have implemented recursive descent and PEG-like parsers in the past, where you could do things like this:


Path -> Segment+
Segment -> Slash Name
Segment -> /
Name -> /\w+/
Slash -> /
  • where Segment+ means "match one or more Segment"
  • 其中段+表示“匹配一个或多个段”
  • and there's a plain old regular expression for matching one or more word characters with \w+
  • 还有一个普通的正则表达式,用来匹配一个或多个单词字符和\w+

How do you typically accomplish this same sort of thing with LR grammars/parsers? All of the examples of LR parsers I have seen are very basic, such as parsing 1 + 2 * 3, or (())(), where the patterns are very simple and don't seem to involve "one or more" functionality (or zero or more with *, or optional with ?). How do you do that in an LR parser generally?

如何用LR语法/解析器完成同样的事情?我所看到的LR解析器的所有示例都是非常基础的,比如解析1 + 2 * 3或(()),其中的模式非常简单,并且似乎不涉及“一个或多个”功能(或零或更多与*,或可选的)。如何在LR解析器中实现这一点?

Or does LR parsing require a lexing phase first (i.e. an LR parser requires terminal and nonterminal "tokens"). Hoping that there is a way to do LR parsing without two phases like that. The definition of an LR parser talks about "input characters" in the books/sites I've been reading, but then you see casually/subtly a line like:


The grammar's terminal symbols are the multi-character symbols or 'tokens' found in the input stream by a lexical scanner.


And it's like what, where did the scanner come from.


You can certainly write a scannerless grammar for a language, but in most cases it won't be LR(1), because 1 token of lookahead isn't much when the token is a single character.


Generally, LALR(1) parser generators (like bison) are used in conjunction with a scanner generator (like flex).




Whenever you have a parser operating on a stream of tokens, there is always the question of what produced the stream of tokens. With most parser generators, the grammar specification and the lexical specification of tokens are kept separate, mostly because the way the parser generator and lexer generator operate are different.


Adding regex operators to "the grammar" is convenient. But they do not extend the power of context free grammars.


You have 3 choices for using regex-like operators in grammars.


1) Use them at the character level consistently across the grammar. If you do this, your parser operates with tokens being individual characters. This is likely to work badly with most parser generators, because the decision for most of them is based on the next input stream token only, in this case, a single character. To make this work you either need a backtracking parser or one that will try multiple paths through the space of alternative parses; LR and LL parsers don't do this. There are scannerless GLR parsers for which this would work, if you don't mind the additional overhead of GLR on a per character basis. (Also, if operating at the character level, you are likely to have explicitly specify whitespace and comment syntax).


2) Use them as specifications of individual token character sequences (as in OP's "Name -> /w+/"). In this form, what you end up doing is writing a lexical token specifications integrated with the grammar. The grammar can be then processed into two parts: the lexical specification, and a more conventional BNF. This idea is compatible with many lexer and parser generators; it still doesn't change the expressive power.

2)使用它们作为单个令牌字符序列的规范(如OP的“名称-> /w+/”)。在这种形式中,您最终要做的是编写与语法集成的词汇标记规范。然后可以将语法处理为两部分:词汇规范和更传统的BNF。这个想法与许多lexer和解析器生成器兼容;它仍然没有改变表现力。

3) Use the regex operators only on grammar elements. These are are easily transformed into conventional BNF grammar rules:


  Path -> Segment +

is equivalent to:


  Path -> Segment 
  Path -> Path Segment

After such transformations the results are compatible with most parser generators. This leaves open how the lexical syntax of grammar tokens are specified.


You can implement a hybrid scheme combining 1) and 2), which is appears to be what OP has done.




Regular expressions in a grammar are called "regular right parts".
They make life easier for the guy writing the grammar, but make life harder for the parser generator.


A smart parser generator, such as LRSTAR 8.0, will expand these regular right parts into extra rules automatically for you.

一个智能解析器生成器,如LRSTAR 8.0,将自动将这些常规的正确部分扩展为额外的规则。

Yacc and Bison do not allow regular right parts (the last time I checked).


Regular right parts in grammars should have nothing to do with lexical tokens. Lexical tokens should be specified in a lexical grammar which have their own regular expressions.


PEG's don't force you to separate the lexical symbols from the grammar symbols, thereby creating some serious problems, such as requiring infinite lookahead in some cases (AFAIK).


