作者:ze602 | 来源:互联网 | 2023-09-25 20:25
I'm trying to get all the words in a sentence with regex but only the ones with [a-zA-Z]. So for "I am a boy" I want {"I", "am", "a", "boy"} but for "I a1m a b*y", I want {"I", "a"} because "a1m" and "b*y" includes characters other than [a-zA-Z].
我试图用正则表达式得到一个句子中的所有单词,但只有[a-zA-Z]的单词。因此,对于“我是男孩”,我想要{“我”,“我是”,“一个”,“男孩”}但是对于“我a1m ab * y”,我想要{“我”,“一个”}因为“ a1m“和”b * y“包括[a-zA-Z]以外的字符。
So for me to get words, I'm trying to check
所以对我来说,我正试图检查
- if it's at the beginning of the string, then I only check if there's space after word
如果它在字符串的开头,那么我只检查是否有空格
- else there's a space before and after the word
否则这个词之前和之后都有一个空格
- if it's the last word, then check if there's space before the word.
如果它是最后一个单词,那么检查单词前面是否有空格。
So I ended up with something like this in Java:
所以我在Java中得到了类似的东西:
Pattern p = Pattern.compile("^[a-zA-Z]+ |^[a-zA-Z]+$| [a-zA-Z]+$| [a-zA-Z]+");
Matcher m = p.matcher("i am good");
while(m.find()) System.out.println(m.group());
However, I only get "i " and " good". Because when I'm getting "i ", there's one space after "i". So the string left is "am good" Since "am" is not at the beginning of the string, nor does it have a space before the word, it does not get returned.
但是,我只能得到“我”和“好”。因为当我得到“我”时,“i”之后有一个空格。因此,左边的字符串是“很好”因为“am”不在字符串的开头,也没有在单词之前有空格,所以它不会被返回。
Can you guys provide any feedback on this? Is there a way to just peek at the next character and not return the space?
你们能提供任何反馈意见吗?有没有办法只是偷看下一个角色而不是返回空间?
3 个解决方案
6
Assuming your regex engine supports lookahead/lookbehind assertions, you can use something like the following:
假设您的正则表达式引擎支持前瞻/后瞻断言,您可以使用以下内容:
(^|(?<= )[a-zA-Z]+($|(?= ))
Here's a quick description of what each component does:
以下是每个组件的功能的简要说明:
(^|(?<= ))
: This says "if a word starts here, we're interested". Specifically,
^
: Match the beginning of the line, or
(?<= )
: Match any point that is preceded by a space, without actually consuming the space itself. This is called a positive lookbehind assertion.
(^ |(?<=)):这说“如果一个词从这里开始,我们就会感兴趣”。具体来说,^:匹配行的开头,或(?<=):匹配任何以空格开头的点,而不实际占用空间本身。这被称为积极的后视断言。
[a-zA-Z]+
: This should be obvious, but it matches any run of sequential ASCII alphabetic characters.
[a-zA-Z] +:这应该是显而易见的,但它匹配任何连续的ASCII字母字符。
($|(?= ))
: This says "if the word is finished here, we're done". Specifically,
$
: Match the end of the line, or
(?= )
: Match any point that is followed by a space, without actually consuming the space itself. This is called a positive lookahead assertion.
($ |(?=)):这说“如果这个词在这里完成,我们就完成了”。具体来说,$:匹配行的结尾,或(?=):匹配任何后跟空格的点,而不实际占用空间本身。这被称为积极的先行断言。
Note that this particular regex doesn't count a word as a word if it's followed by punctuation. This may actually not be what you want, but you described checking for spaces so that's what the regex does. If you want to support words that are followed by simple punctuation you might amend that last atom to be
请注意,如果单词后跟标点符号,则此特定正则表达式不会将单词计为单词。这实际上可能不是你想要的,但你描述了检查空格,这就是正则表达式所做的。如果你想支持简单标点符号后面的单词,你可以修改最后一个原子
($|(?=[ .,!?]))
which will match the word if it's followed by a space, period, comma, exclamation mark, or question mark. You can be more elaborate too if you want.
如果后跟空格,句号,逗号,感叹号或问号,则会匹配该单词。如果你愿意,你也可以更精细。
2
Could you use a simpler pattern like \b[A-Za-z]+\b
instead? (The \b metacharacter separates word characters (e.g., letters) from nonword characters (e.g., spaces and punctuation.))
您可以使用更简单的模式,例如\ b [A-Za-z] + \ b吗? (\ b元字符将单词字符(例如,字母)与非单词字符(例如,空格和标点符号)分开。))
The code
Pattern p = Pattern.compile("\\b[A-Za-z]+\\b");
Matcher m = p.matcher("i am good");
while(m.find()) System.out.println(m.group());
Produces {"i", "am", "good"} .
产生{“i”,“am”,“good”}。
Edit As mathematical.coffee commented, the above fails. The expression
编辑如math.coffee评论,上述内容失败。表达方式
(?<=^|\s)[A-Za-z]+(?=\W*(?:\s*$|\s))
may work better. For the string I a1m a b*y boy am is!! or
, matching produces "I", "a", "boy", "am", "is", "or".
可能会更好。对于字符串我a1m a b * y boy am is !!或者,匹配产生“I”,“a”,“boy”,“am”,“is”,“or”。
If in the previous expression "is!!" should be ignored, the expression (?<=^|\s)[A-Za-z]+(?=$|\s)
can be used instead. In the previous example, it does not return "is" but returns the other words (I, a, boy, am, or).
如果在前一个表达式中“是!!”应该忽略,可以使用表达式(?<= ^ | \ s)[A-Za-z] +(?= $ | \ s)代替。在前面的示例中,它不返回“is”但返回其他单词(I,a,boy,am或)。