13
You may have a look at this benchmark. In the table, column patmch:1t
gives the time on matching URL with /([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/?[^ ]*)/
, while column patmch:2t
on matching URL or email with /([a-zA-Z][a-zA-Z0-9]*)://([^ /]+)(/?[^ ]*)|([^ @]+)@([^ @]+)/
(note the |
operator). For the first pattern, Perl is about 10X faster than Java; for the second, they are about the same.
您可以看看这个基准。在表中,列patmch:1 t给的时间匹配的URL /([a-zA-Z][a-zA-Z0-9]*):/ /([^ /]+)(/ ?[^]*)/,而列patmch:2 t匹配的URL上和/或电子邮件([a-zA-Z][a-zA-Z0-9]*):/ /([^ /]+)(/ ?[^]*)|([^ @]+)@([^ @]+)/(注意|操作符)。对于第一个模式,Perl比Java快大约10倍;第二,它们是一样的。
In general, Perl uses a backtrack regex engine. Such an engine is flexible, easy to implement and very fast on a subset of regex. However, for other types of regex, for example when there is the |
operator, it may become very slow. In the extreme case, its match speed is exponential in the length of the pattern. Another type of regex engine is based on NFA. It is harder to implement but has stable performance (quadratic at the worst IIRC) for all types of input. Russ Cox has several articles about these topics, which I like a lot.
通常,Perl使用回溯regex引擎。这样的引擎是灵活的,易于实现,并且非常快速地在regex的一个子集。但是,对于其他类型的regex,例如有|操作符时,它可能会变得非常慢。在极端情况下,它的匹配速度是模式长度的指数。另一种regex引擎基于NFA。它很难实现,但对所有类型的输入都具有稳定的性能(在最坏的情况下是二次递归)。Russ Cox有几篇关于这些话题的文章,我非常喜欢。
I don't know what types of regex engine Java is using, but from the benchmark, its performance does not seem impressive. You may also be interested in this benchmark which evaluates several C/C++ libraries on regex.
我不知道regex引擎Java使用的是什么类型,但是从基准测试来看,它的性能并不令人印象深刻。您可能还对这个基准感兴趣,它在regex上评估几个C/ c++库。
EDIT: In both benchmarks, patterns are tested against an old version of Linux Howto. The vast majority of lines do not have a match.
编辑:在这两个基准测试中,模式都是针对旧版本的Linux Howto进行测试的。绝大多数的线路都没有匹配。
About DFA vs. NFA: if I am right, a pure DFA cannot capture groups, at least not easily. Only NFA can capture groups. I heard that RE2 transform local NFA to DFA for the part of regex without group captures. I do not know if this is true.
关于DFA和NFA:如果我是正确的,一个纯粹的DFA不能捕捉组,至少不容易。只有NFA可以捕捉组。我听说RE2将本地NFA转换为DFA,作为regex的一部分,没有组捕获。我不知道这是不是真的。
On PCRE: PCRE has the same problem as Perl - inefficient given complex alternations. You may have a look at the regex-dna benchmark from the Computer Language Benchmarks Game. Versions using PCRE are all much slower than the fastest version that is using TCL (maybe PCRE is not using trie?). V8 is clearly the winner in that benchmark because it does not use backtrack. IMO, for C++ programmers, the best regex library is RE2.
在PCRE上:在复杂的交互中,PCRE与Perl同样存在效率低下的问题。您可以查看计算机语言基准游戏中的regex-dna基准。使用PCRE的版本比使用TCL的最快版本慢得多(也许PCRE没有使用trie?)V8显然是这个基准测试的赢家,因为它不使用回溯。在我看来,对于c++程序员来说,最好的regex库是RE2。