如何处理x、x+或x?在LR解析器中类似regex的操作符?-Howtohandlex,x+,orx?regex-likeoperatorsinanLRparser?

作者：cssic_630 | 来源：互联网 | 2023-05-18 12:01

IhaveimplementedrecursivedescentandPEG-likeparsersinthepast,whereyoucoulddothingslik

I have implemented recursive descent and PEG-like parsers in the past, where you could do things like this:

我在过去实现了递归下降和类似于peg的解析器，您可以这样做:

Path -> Segment+
Segment -> Slash Name
Segment -> /
Name -> /\w+/
Slash -> /

where Segment+ means "match one or more Segment"
其中段+表示“匹配一个或多个段”
and there's a plain old regular expression for matching one or more word characters with \w+
还有一个普通的正则表达式，用来匹配一个或多个单词字符和\w+

How do you typically accomplish this same sort of thing with LR grammars/parsers? All of the examples of LR parsers I have seen are very basic, such as parsing 1 + 2 * 3, or (())(), where the patterns are very simple and don't seem to involve "one or more" functionality (or zero or more with *, or optional with ?). How do you do that in an LR parser generally?

如何用LR语法/解析器完成同样的事情?我所看到的LR解析器的所有示例都是非常基础的，比如解析1 + 2 * 3或(())，其中的模式非常简单，并且似乎不涉及“一个或多个”功能(或零或更多与*，或可选的)。如何在LR解析器中实现这一点?

Or does LR parsing require a lexing phase first (i.e. an LR parser requires terminal and nonterminal "tokens"). Hoping that there is a way to do LR parsing without two phases like that. The definition of an LR parser talks about "input characters" in the books/sites I've been reading, but then you see casually/subtly a line like:

或者，LR解析首先需要一个lexing阶段(即LR解析器需要终端和非终端“标记”)。希望有一种方法可以在不经历两个阶段的情况下进行LR解析。LR解析器的定义谈到了我读过的书/网站中的“输入字符”，但随后你会不经意地/微妙地看到如下一行:

The grammar's terminal symbols are the multi-character symbols or 'tokens' found in the input stream by a lexical scanner.

语法的终端符号是由词汇扫描器在输入流中找到的多字符符号或“令牌”。

And it's like what, where did the scanner come from.

就像，扫描仪是从哪里来的。

3 个解决方案

#1

You can certainly write a scannerless grammar for a language, but in most cases it won't be LR(1), because 1 token of lookahead isn't much when the token is a single character.

当然，您可以为一种语言编写无扫描语法，但是在大多数情况下，它不是LR(1)，因为当一个符号是单个字符时，1个先行标记并不多。

Generally, LALR(1) parser generators (like bison) are used in conjunction with a scanner generator (like flex).

通常，LALR(1)解析器生成器(如bison)与扫描器生成器(如flex)一起使用。

#2

Whenever you have a parser operating on a stream of tokens, there is always the question of what produced the stream of tokens. With most parser generators, the grammar specification and the lexical specification of tokens are kept separate, mostly because the way the parser generator and lexer generator operate are different.

每当您有一个解析器在令牌流上操作时，总是存在一个问题，即是什么产生了令牌流。对于大多数解析器生成器，语法规范和令牌的词法规范是分开的，这主要是因为解析器生成器和词法生成器操作的方式不同。

Adding regex operators to "the grammar" is convenient. But they do not extend the power of context free grammars.

在“语法”中添加regex操作符很方便。但是它们并没有扩展上下文自由语法的力量。

You have 3 choices for using regex-like operators in grammars.

在语法中使用类似regex的操作符有三种选择。

1) Use them at the character level consistently across the grammar. If you do this, your parser operates with tokens being individual characters. This is likely to work badly with most parser generators, because the decision for most of them is based on the next input stream token only, in this case, a single character. To make this work you either need a backtracking parser or one that will try multiple paths through the space of alternative parses; LR and LL parsers don't do this. There are scannerless GLR parsers for which this would work, if you don't mind the additional overhead of GLR on a per character basis. (Also, if operating at the character level, you are likely to have explicitly specify whitespace and comment syntax).

1)在整个语法中始终如一地在字符级使用它们。如果您这样做，您的解析器将使用标记作为单个字符进行操作。对于大多数解析器生成器来说，这可能会很糟糕，因为对大多数解析器生成器的决策只基于下一个输入流令牌，在本例中是单个字符。要完成这项工作，您要么需要一个回溯解析器，要么需要一个在可选解析器空间中尝试多个路径的解析器;LR和LL解析器不会这样做。如果您不介意在每个字符的基础上增加GLR的开销，那么可以使用无扫描的GLR解析器。(另外，如果在字符级别上操作，您可能已经显式地指定了空格和注释语法)。

2) Use them as specifications of individual token character sequences (as in OP's "Name -> /w+/"). In this form, what you end up doing is writing a lexical token specifications integrated with the grammar. The grammar can be then processed into two parts: the lexical specification, and a more conventional BNF. This idea is compatible with many lexer and parser generators; it still doesn't change the expressive power.

2)使用它们作为单个令牌字符序列的规范(如OP的“名称-> /w+/”)。在这种形式中，您最终要做的是编写与语法集成的词汇标记规范。然后可以将语法处理为两部分:词汇规范和更传统的BNF。这个想法与许多lexer和解析器生成器兼容;它仍然没有改变表现力。

3) Use the regex operators only on grammar elements. These are are easily transformed into conventional BNF grammar rules:

3)仅在语法元素上使用regex操作符。这些很容易转换成传统的BNF语法规则:

  Path -> Segment +

is equivalent to:

等价于:

  Path -> Segment 
  Path -> Path Segment

After such transformations the results are compatible with most parser generators. This leaves open how the lexical syntax of grammar tokens are specified.

在这些转换之后，结果与大多数解析器生成器兼容。这将打开语法标记的语法语法语法语法语法语法语法语法语法的指定方式。

You can implement a hybrid scheme combining 1) and 2), which is appears to be what OP has done.

您可以实现1)和2)的混合方案，这似乎就是OP所做的。

#3

Regular expressions in a grammar are called "regular right parts".
They make life easier for the guy writing the grammar, but make life harder for the parser generator.

语法中的正则表达式称为“正则右部”。它们使编写语法的人更轻松，但使解析器生成器更困难。

A smart parser generator, such as LRSTAR 8.0, will expand these regular right parts into extra rules automatically for you.

一个智能解析器生成器，如LRSTAR 8.0，将自动将这些常规的正确部分扩展为额外的规则。

Yacc and Bison do not allow regular right parts (the last time I checked).

Yacc和Bison不允许常规的正确部件(上次我检查时)。

Regular right parts in grammars should have nothing to do with lexical tokens. Lexical tokens should be specified in a lexical grammar which have their own regular expressions.

语法中的常规正确部分应该与词汇标记无关。词汇标记应该在有它们自己的正则表达式的词汇语法中指定。

PEG's don't force you to separate the lexical symbols from the grammar symbols, thereby creating some serious problems, such as requiring infinite lookahead in some cases (AFAIK).

PEG并不强迫您将词汇符号与语法符号分开，因此会产生一些严重的问题，例如在某些情况下需要无限的前瞻(AFAIK)。

推荐阅读

main
gym102222KVertex Covers（高维前缀和，meet in the middle）相关的知识介绍及解题思路

本文主要介绍了gym102222KVertex Covers（高维前缀和，meet in the middle）相关的知识，包括题意、思路和解题代码。题目给定一张n点m边的图，点带点权，定义点覆盖的权值为点权之积，要求所有点覆盖的权值之和膜qn小于等于36。文章详细介绍了解题思路，通过将图分成两个点数接近的点集L和R，并分别枚举子集S和T，判断S和T能否覆盖所有内部的边。文章还提到了使用位运算加速判断覆盖和推导T'的方法。最后给出了解题的代码。 ... [详细]

蜡笔小新 2023-12-09 10:03:53
python
Python正则表达式学习记录及常用方法

本文记录了学习Python正则表达式的过程，介绍了re模块的常用方法re.search，并解释了rawstring的作用。正则表达式是一种方便检查字符串匹配模式的工具，通过本文的学习可以掌握Python中使用正则表达式的基本方法。 ... [详细]

蜡笔小新 2023-12-13 16:37:19
command
伊振华作品 | 沈阳市智慧城市运行管理中心的设计与建设

本文介绍了设计师伊振华受邀参与沈阳市智慧城市运行管理中心项目的整体设计，并以数字赋能和创新驱动高质量发展的理念，建设了集成、智慧、高效的一体化城市综合管理平台，促进了城市的数字化转型。该中心被称为当代城市的智能心脏，为沈阳市的智慧城市建设做出了重要贡献。 ... [详细]

蜡笔小新 2023-12-14 16:35:39
c语言
C语言注释工具及快捷键，删除C语言注释工具的实现思路

本文介绍了C语言中注释的两种方式以及注释的作用，提供了删除C语言注释的工具实现思路，并分享了C语言中注释的快捷键操作方法。 ... [详细]

蜡笔小新 2023-12-14 11:22:08
copy
如何在有序字符序列中插入新字符并保持有序

本文介绍了如何在给定的有序字符序列中插入新字符，并保持序列的有序性。通过示例代码演示了插入过程，以及插入后的字符序列。 ... [详细]

蜡笔小新 2023-12-14 11:16:33
input
CF：3D City Model（小思维）问题解析和代码实现

本文通过解析CF：3D City Model问题，介绍了问题的背景和要求，并给出了相应的代码实现。该问题涉及到在一个矩形的网格上建造城市的情景，每个网格单元可以作为建筑的基础，建筑由多个立方体叠加而成。文章详细讲解了问题的解决思路，并给出了相应的代码实现供读者参考。 ... [详细]

蜡笔小新 2023-12-13 14:17:11
main
开发笔记：实验7的文件读写操作

本文介绍了使用C++的ofstream和ifstream类进行文件读写操作的方法，包括创建文件、写入文件和读取文件的过程。同时还介绍了如何判断文件是否成功打开和关闭文件的方法。通过本文的学习，读者可以了解如何在C++中进行文件读写操作。 ... [详细]

蜡笔小新 2023-12-12 17:48:18
sum
编写可保护的代码的重要性及优化方法

本文讨论了编写可保护的代码的重要性，包括提高代码的可读性、可调试性和直观性。同时介绍了优化代码的方法，如代码格式化、解释函数和提炼函数等。还提到了一些常见的坏代码味道，如不规范的命名、重复代码、过长的函数和参数列表等。最后，介绍了如何处理数据泥团和进行函数重构，以提高代码质量和可维护性。 ... [详细]

蜡笔小新 2023-12-11 18:32:44
ip
Oracle存储过程写法小例子及已命名的异常

本文介绍了Oracle存储过程的基本语法和写法示例，同时还介绍了已命名的系统异常的产生原因。 ... [详细]

蜡笔小新 2023-12-11 15:10:15
dll
如何用JNI技术调用Java接口以及提高Java性能的详解

本文介绍了如何使用JNI技术调用Java接口，并详细解析了如何通过JNI技术提高Java的性能。同时还讨论了JNI调用Java的private方法、Java开发中使用JNI技术的情况以及使用Java的JNI技术调用C++时的运行效率问题。文章还介绍了JNIEnv类型的使用方法，包括创建Java对象、调用Java对象的方法、获取Java对象的属性等操作。 ... [详细]

蜡笔小新 2023-12-11 14:53:02
sum
Java SE从入门到放弃（三）的逻辑运算符详解

本文详细介绍了Java SE中的逻辑运算符，包括逻辑运算符的操作和运算结果，以及与运算符的不同之处。通过代码演示，展示了逻辑运算符的使用方法和注意事项。文章以Java SE从入门到放弃（三）为背景，对逻辑运算符进行了深入的解析。 ... [详细]

蜡笔小新 2023-12-11 14:15:55
command
使用C++编写程序实现增加或删除桌面的右键列表项

本文介绍了使用C++编写程序实现增加或删除桌面的右键列表项的方法。首先通过操作注册表来实现增加或删除右键列表项的目的，然后使用管理注册表的函数来编写程序。文章详细介绍了使用的五种函数：RegCreateKey、RegSetValueEx、RegOpenKeyEx、RegDeleteKey和RegCloseKey，并给出了增加一项的函数写法。通过本文的方法，可以方便地自定义桌面的右键列表项。 ... [详细]

蜡笔小新 2023-12-10 10:08:46
command
MySQL多表数据库操作方法及子查询详解

本文详细介绍了MySQL数据库的多表操作方法，包括增删改和单表查询，同时还解释了子查询的概念和用法。文章通过示例和步骤说明了如何进行数据的插入、删除和更新操作，以及如何执行单表查询和使用聚合函数进行统计。对于需要对MySQL数据库进行操作的读者来说，本文是一个非常实用的参考资料。 ... [详细]

蜡笔小新 2023-12-09 22:17:43
usb
如何使用MATLAB调用摄像头进行人脸检测和识别

本文介绍了如何使用MATLAB调用摄像头进行人脸检测和识别。首先需要安装扩展工具，并下载安装OS Generic Video Interface。然后使用MATLAB的机器视觉工具箱中的VJ算法进行人脸检测，可以直接调用CascadeObjectDetector函数进行检测。同时还介绍了如何调用摄像头进行人脸识别，并对每一帧图像进行识别。最后，给出了一些相关的参考资料和实例。 ... [详细]

蜡笔小新 2023-12-09 08:56:44
main
学习笔记17：Opencv处理调整图片亮度和对比度

一、理论基础在数学中我们学过线性理论，在图像亮度和对比度调节中同样适用，看下面这个公式：在图像像素中其中：参数f(x)表示源图像像素。参数g(x)表示输出图像像素。 ... [详细]

蜡笔小新 2023-10-17 19:21:01