作者:tomorrow | 来源:互联网 | 2023-08-12 20:12
我正在使用 Spacy 从句子中提取名词。这些句子在语法上很差,也可能包含一些拼写错误。
这是我正在使用的代码:
代码
import spacy
import re
nlp = spacy.load("en_core_web_sm")
sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('W+',' ', string )
cleanString=cleanString.replace("_", " ")
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN":
print (token.text)
输出:
sfx
同样对于句子“fast foward2”,我得到 Spacy 名词为
foward2
这表明这些名词有一些无意义的词,如:sfx、foward2、ms、64x、bit、pwm、r、brailledisplayfastmovement等。
我只想保留包含合理的单词名词的短语,如 broom、ticker、pool、highway 等。
我尝试过 Wordnet 过滤 wordnet 和 spacy 之间的常用名词,但它有点严格,并且还过滤了一些合理的名词。例如,它过滤了摩托车、whoosh、手推车、金属、手提箱、拉链等名词
因此,我正在寻找一种解决方案,在该解决方案中,我可以从我获得的 spacy 名词列表中过滤掉最合理的名词。
回答
It seems you can use pyenchant
library:
Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.
More information is available on the Enchant website:
https://abiword.github.io/enchant/
Sample Python code:
import spacy, re
import enchant #pip install pyenchant
d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")
sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[W_]+',' ', sentence.lower()) # Merging W and _ into one regex
doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN" and d.check(token.text):
print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]