作者:雨霖铃111130 | 来源:互联网 | 2023-09-23 19:48
1.输入一个段落,分成句子(Punkt句子分割器) importnltkimportnltk.data defsplitSentence(paragraph):tokenizern
1.输入一个段落,分成句子(Punkt句子分割器)
import nltk
import nltk.data
-
def splitSentence(paragraph):
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(paragraph)
return sentences
-
if __name__ == '__main__':
print splitSentence("My name is Tom. I am a boy. I like soccer!")
结果为[‘My name is Tom.’, ‘I am a boy.’, ‘I like soccer!’]
2.输入一个句子,分成词组
from nltk.tokenize import WordPunctTokenizer
-
def wordtokenizer(sentence):
#分段
words = WordPunctTokenizer().tokenize(sentence)
return words
-
if __name__ == '__main__':
print wordtokenizer("My name is Tom.")
结果为[‘My’, ‘name’, ‘is’, ‘Tom’, ‘.’]
nltk查看下载配置位置,在 python 环境下,输入:
import nltk
nltk.data.find(".")
电脑属性环境变量中增加 NLTK_DATA,指向通过nltk.download()下载的数据路径。
中文分句和分词可以使用pyltp。
在使用pyltp进行英文分词和分句的时候会出现错误,这个时候就可以使用NLTK进行英文的分句和分词。
http://blog.csdn.net/baidu_27438681/article/details/60468848