perl中的regex比Java或其他语言中的regex快吗?-IsregexinperlfasterthaninJavaorotherlanguages?

作者：wang静的天空 | 来源：互联网 | 2023-05-18 20:04

Ihaveheardfromtimetotimefrompeople,whosaidthatregexinPerlisfasterthaninotherlang

I have heard from time to time from people, who said that regex in Perl is faster than in other languages. Also, some online documents also say Perl has advantages when it comes to regex processing. Can you guys explain if this is true, and why?

我经常听到人们说，Perl中的regex比其他语言中的regex快。此外，一些在线文档还说，Perl在regex处理方面具有优势。你们能解释一下这是真的吗?为什么?

3 个解决方案

#1

Why would you consider the speed of two engines when one of them (Java's) is notably buggy? (Search for writings by Tom "tchrist" Christiansen on the subject.) For example, \s fails to match many space characters.

为什么你要考虑两个引擎的速度，当其中一个(Java的)有明显的错误?(寻找关于这个主题的汤姆·克里斯汀森的作品。)例如，\s不能匹配许多空格字符。

Also, some online documents also say Perl has advantages when it comes to regex processing.

此外，一些在线文档还说，Perl在regex处理方面具有优势。

Here are some:

这里有一些:

Has many features you cannot find in other engines, either because the other engines haven't copied them yet, or because their design does not permit them to support those features.
有许多你在其他引擎中找不到的特性，或者因为其他引擎还没有复制它们，或者因为它们的设计不允许它们支持这些特性。
Highly optimised. Many of these optimisations help to report failed matches sooner, something not covered by many benchmarks.
高度优化。这些优化中的许多可以帮助您更早地报告失败的匹配，这是许多基准所不能涵盖的。
A leader in Unicode support. It's support is so advanced that our developers are finding problems with the Unicode standard itself and have worked to have them resolved!
Unicode支持方面的领导者。它的支持如此先进，以至于我们的开发人员发现Unicode标准本身存在问题，并努力解决它们!
Remarkly bug free.
备注错误。

#2

You may have a look at this benchmark. In the table, column patmch:1t gives the time on matching URL with /([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/?[^ ]*)/, while column patmch:2t on matching URL or email with /([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/?[^ ]*)|([^ @]+)@([^ @]+)/ (note the | operator). For the first pattern, Perl is about 10X faster than Java; for the second, they are about the same.

您可以看看这个基准。在表中,列patmch:1 t给的时间匹配的URL /([a-zA-Z][a-zA-Z0-9]*):/ /([^ /]+)(/ ?[^]*)/,而列patmch:2 t匹配的URL上和/或电子邮件([a-zA-Z][a-zA-Z0-9]*):/ /([^ /]+)(/ ?[^]*)|([^ @]+)@([^ @]+)/(注意|操作符)。对于第一个模式，Perl比Java快大约10倍;第二，它们是一样的。

In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the | operator, it may become very slow. In the extreme case, its match speed is exponential in the length of the pattern. Another type of regex engine is based on NFA. It is harder to implement but has stable performance (quadratic at the worst IIRC) for all types of input. Russ Cox has several articles about these topics, which I like a lot.

通常，Perl使用回溯regex引擎。这样的引擎是灵活的，易于实现，并且非常快速地在regex的一个子集。但是，对于其他类型的regex，例如有|操作符时，它可能会变得非常慢。在极端情况下，它的匹配速度是模式长度的指数。另一种regex引擎基于NFA。它很难实现，但对所有类型的输入都具有稳定的性能(在最坏的情况下是二次递归)。Russ Cox有几篇关于这些话题的文章，我非常喜欢。

I don't know what types of regex engine Java is using, but from the benchmark, its performance does not seem impressive. You may also be interested in this benchmark which evaluates several C/C++ libraries on regex.

我不知道regex引擎Java使用的是什么类型，但是从基准测试来看，它的性能并不令人印象深刻。您可能还对这个基准感兴趣，它在regex上评估几个C/ c++库。

EDIT: In both benchmarks, patterns are tested against an old version of Linux Howto. The vast majority of lines do not have a match.

编辑:在这两个基准测试中，模式都是针对旧版本的Linux Howto进行测试的。绝大多数的线路都没有匹配。

About DFA vs. NFA: if I am right, a pure DFA cannot capture groups, at least not easily. Only NFA can capture groups. I heard that RE2 transform local NFA to DFA for the part of regex without group captures. I do not know if this is true.

关于DFA和NFA:如果我是正确的，一个纯粹的DFA不能捕捉组，至少不容易。只有NFA可以捕捉组。我听说RE2将本地NFA转换为DFA，作为regex的一部分，没有组捕获。我不知道这是不是真的。

On PCRE: PCRE has the same problem as Perl - inefficient given complex alternations. You may have a look at the regex-dna benchmark from the Computer Language Benchmarks Game. Versions using PCRE are all much slower than the fastest version that is using TCL (maybe PCRE is not using trie?). V8 is clearly the winner in that benchmark because it does not use backtrack. IMO, for C++ programmers, the best regex library is RE2.

在PCRE上:在复杂的交互中，PCRE与Perl同样存在效率低下的问题。您可以查看计算机语言基准游戏中的regex-dna基准。使用PCRE的版本比使用TCL的最快版本慢得多(也许PCRE没有使用trie?)V8显然是这个基准测试的赢家，因为它不使用回溯。在我看来，对于c++程序员来说，最好的regex库是RE2。

#3

The point is not that Perl is or isn't faster than Java (benchmarks tests will tell you), but that regexes is really (deeply) part of the language itself. Just an example, in Perl, no need to load any module to use regex. See this relevant answer

重点不是Perl比Java快或慢(基准测试会告诉您)，而是regexes实际上(非常)是语言本身的一部分。只是一个例子，在Perl中，不需要加载任何模块来使用regex。看到这个相关的答案

Ex. a Perl one-liner in a pseudo-terminal (that prints the root's shell) :

伪终端中的Perl一行程序(打印根的外壳):

perl -nE '/^root.*:([\/\w]+)$/ and say $1' /etc/passwd

How many lines do you need to do the same thing in Java ?

在Java中做同样的事情需要多少行?

Perl is de-facto the reference language for regexes. That's why so many language use PCRE engine (that means Perl Compatible Regular Expression)

Perl实际上是regexes的引用语言。这就是为什么许多语言都使用PCRE引擎(这意味着Perl兼容正则表达式)

推荐阅读

web
如何更有效地提升对支持部门的协助与支撑？ - Enhancing Support for the Support Department: Strategies and Best Practices

尽管我们尽最大努力，任何软件开发过程中都难免会出现缺陷。为了更有效地提升对支持部门的协助与支撑，本文探讨了多种策略和最佳实践，旨在通过改进沟通、增强培训和支持流程来减少这些缺陷的影响，并提高整体服务质量和客户满意度。 ... [详细]

蜡笔小新 2024-11-07 06:55:33
config
使用Shell脚本高效部署MHA高可用集群

本文介绍了如何利用Shell脚本高效地部署MHA（MySQL High Availability）高可用集群。通过详细的脚本编写和配置示例，展示了自动化部署过程中的关键步骤和注意事项。该方法不仅简化了集群的部署流程，还提高了系统的稳定性和可用性。 ... [详细]

蜡笔小新 2024-11-10 10:15:46
version
JUC（三）：深入解析AQS

本文详细介绍了Java并发工具包中的核心类AQS（AbstractQueuedSynchronizer），包括其基本概念、数据结构、源码分析及核心方法的实现。 ... [详细]

蜡笔小新 2024-11-13 15:40:34
config
PHP预处理常量详解：如何定义与使用常量

PHP预处理常量详解：如何定义与使用常量 ... [详细]

蜡笔小新 2024-11-09 11:31:23
web
php更新数据库字段的函数是,php更新数据库字段的函数是

php更新数据库字段的函数是,php更新数据库字段的函数是 ... [详细]

蜡笔小新 2024-11-12 11:37:31
version
开机自启动的几种方式

0x01快速自启动目录快速启动目录自启动方式源于Windows中的一个目录，这个目录一般叫启动或者Startup。位于该目录下的PE文件会在开机后进行自启动 ... [详细]

蜡笔小新 2024-11-12 11:16:30
function
Python中判断一个集合是否为另一集合子集的两种高效方法及其应用场景分析

Python中判断一个集合是否为另一集合子集的两种高效方法及其应用场景分析 ... [详细]

蜡笔小新 2024-11-11 19:27:53
version
PTArchiver工作原理详解与应用分析

PTArchiver工作原理及其应用分析本文详细解析了PTArchiver的工作机制，探讨了其在数据归档和管理中的应用。PTArchiver通过高效的压缩算法和灵活的存储策略，实现了对大规模数据的高效管理和长期保存。文章还介绍了其在企业级数据备份、历史数据迁移等场景中的实际应用案例，为用户提供了实用的操作建议和技术支持。 ... [详细]

蜡笔小新 2024-11-11 13:40:49
web
WordPress Duplicator 0.4.4 版本存在跨站脚本攻击漏洞分析

在对WordPress Duplicator插件0.4.4版本的安全评估中，发现其存在跨站脚本（XSS）攻击漏洞。此漏洞可能被利用进行恶意操作，建议用户及时更新至最新版本以确保系统安全。测试方法仅限于安全研究和教学目的，使用时需自行承担风险。漏洞编号：HTB23162。 ... [详细]

蜡笔小新 2024-11-10 13:16:43
web
ElasticStack 日志监控：Logstash 编码插件详解与生产环境应用实例分析

在ElasticStack日志监控系统中，Logstash编码插件自5.0版本起进行了重大改进。插件被独立拆分为gem包，每个插件可以单独进行更新和维护，无需依赖Logstash的整体升级。这不仅提高了系统的灵活性和可维护性，还简化了插件的管理和部署过程。本文将详细介绍这些编码插件的功能、配置方法，并通过实际生产环境中的应用案例，展示其在日志处理和监控中的高效性和可靠性。 ... [详细]

蜡笔小新 2024-11-09 19:27:28
config
理解和优化进程与线程状态转换机制

在Cisco IOS XR系统中，存在提供服务的服务器和使用这些服务的客户端。本文深入探讨了进程与线程状态转换机制，分析了其在系统性能优化中的关键作用，并提出了改进措施，以提高系统的响应速度和资源利用率。通过详细研究状态转换的各个环节，本文为开发人员和系统管理员提供了实用的指导，旨在提升整体系统效率和稳定性。 ... [详细]

蜡笔小新 2024-11-09 18:33:35
hash
技术日志：使用 Ruby 爬虫抓取拉勾网职位数据并生成词云分析报告

技术日志：使用 Ruby 爬虫抓取拉勾网职位数据并生成词云分析报告 ... [详细]

蜡笔小新 2024-11-07 14:33:19
split
Vue ElementUI 实现邮箱地址自动补全功能详解

Vue ElementUI 实现邮箱地址自动补全功能详解 ... [详细]

蜡笔小新 2024-11-07 10:27:26
include
PHP 对象生命周期与内存管理

本文详细介绍了 PHP 中对象的生命周期、内存管理和魔术方法的使用，包括对象的自动销毁、析构函数的作用以及各种魔术方法的具体应用场景。 ... [详细]

蜡笔小新 2024-11-12 13:35:26
version
Delphi 7下最小化到系统托盘（主要是WM_TRAYMSG和WM_SYSCOMMAND消息）

在Delphi7下要制作系统托盘，只能制作一个比较简单的系统托盘，因为ShellAPI文件定义的TNotifyIconData结构体是比较早的版本。定义如下：1234 ... [详细]

蜡笔小新 2024-11-12 12:32:15

wang静的天空

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章