我正在尝试预处理单词以删除常见的前缀,如“un”和“re”,但是所有nltk的常见词干都似乎完全忽略了前缀:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
PorterStemmer().stem('unhappy')
# u'unhappi'
SnowballStemmer('english').stem('unhappy')
# u'unhappi'
LancasterStemmer().stem('unhappy')
# 'unhappy'
PorterStemmer().stem('reactivate')
# u'reactiv'
SnowballStemmer('english').stem('reactivate')
# u'reactiv'
LancasterStemmer().stem('reactivate')
# 'react'
删除常见前缀和后缀不是限制器工作的一部分吗?有没有另一个干扰器可靠地做到这一点?
你是对的。大多数词干只有后缀。事实上,Martin Porter的原始文章的标题是:
Porter,M。“后缀剥离算法”。计划14.3(1980):130-137。
并且可能唯一具有在NLTK中产生前缀的词干分析器是阿拉伯语词干分析器:
但是如果我们看看这个prefix_replace
函数,它只是删除旧的前缀并用新的前缀替换它。
def prefix_replace(original, old, new):
"""
Replaces the old prefix of the original string by a new suffix
:param original: string
:param old: string
:param new: string
:return: string
"""
return new + original[len(old):]
但我们可以做得更好!
首先,您是否有需要处理的语言的固定前缀和替换列表?
让我们使用(不幸的)事实上的语言,英语,并做一些语言学工作来找出英语前缀:
https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
没有太多工作,你可以在源自NLTK的后缀之前编写前缀词干功能,例如:
import re
from nltk.stem import PorterStemmer
# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "", # e.g. anti-goverment, anti-racist, anti-war
"auto": "", # e.g. autobiography, automobile
"de": "", # e.g. de-classify, decontaminate, demotivate
"dis": "", # e.g. disagree, displeasure, disqualify
"down": "", # e.g. downgrade, downhearted
"extra": "", # e.g. extraordinary, extraterrestrial
"hyper": "", # e.g. hyperactive, hypertension
"il": "", # e.g. illegal
"im": "", # e.g. impossible
"in": "", # e.g. insecure
"ir": "", # e.g. irregular
"inter": "", # e.g. interactive, international
"mega": "", # e.g. megabyte, mega-deal, megaton
"mid": "", # e.g. midday, midnight, mid-October
"mis": "", # e.g. misaligned, mislead, misspelt
"non": "", # e.g. non-payment, non-smoking
"over": "", # e.g. overcook, overcharge, overrate
"out": "", # e.g. outdo, out-perform, outrun
"post": "", # e.g. post-election, post-warn
"pre": "", # e.g. prehistoric, pre-war
"pro": "", # e.g. pro-communist, pro-democracy
"re": "", # e.g. reconsider, redo, rewrite
"semi": "", # e.g. semicircle, semi-retired
"sub": "", # e.g. submarine, sub-Saharan
"super": "", # e.g. super-hero, supermodel
"tele": "", # e.g. television, telephathic
"trans": "", # e.g. transatlantic, transfer
"ultra": "", # e.g. ultra-compact, ultrasound
"un": "", # e.g. under-cook, underestimate
"up": "", # e.g. upgrade, uphill
}
porter = PorterStemmer()
def stem_prefix(word, prefixes):
for prefix in sorted(prefixes, key=len, reverse=True):
# Use subn to track the no. of substitution made.
# Allow dash in between prefix and root.
word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
if nsub > 0:
return word
def porter_english_plus(word, prefixes=english_prefixes):
return porter.stem(stem_prefix(word, prefixes))
word = "extraordinary"
porter_english_plus(word)
既然我们有一个简单的前缀词干,我们可以做得更好吗?
# E.g. this is not satisfactory:
>>> porter_english_plus("united")
"ited"
如果我们检查前缀词干是否出现在某些列表中,然后阻止它呢?
import re
from nltk.corpus import words
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
# From https://dictionary.cambridge.org/grammar/british-grammar/word-formation/prefixes
english_prefixes = {
"anti": "", # e.g. anti-goverment, anti-racist, anti-war
"auto": "", # e.g. autobiography, automobile
"de": "", # e.g. de-classify, decontaminate, demotivate
"dis": "", # e.g. disagree, displeasure, disqualify
"down": "", # e.g. downgrade, downhearted
"extra": "", # e.g. extraordinary, extraterrestrial
"hyper": "", # e.g. hyperactive, hypertension
"il": "", # e.g. illegal
"im": "", # e.g. impossible
"in": "", # e.g. insecure
"ir": "", # e.g. irregular
"inter": "", # e.g. interactive, international
"mega": "", # e.g. megabyte, mega-deal, megaton
"mid": "", # e.g. midday, midnight, mid-October
"mis": "", # e.g. misaligned, mislead, misspelt
"non": "", # e.g. non-payment, non-smoking
"over": "", # e.g. overcook, overcharge, overrate
"out": "", # e.g. outdo, out-perform, outrun
"post": "", # e.g. post-election, post-warn
"pre": "", # e.g. prehistoric, pre-war
"pro": "", # e.g. pro-communist, pro-democracy
"re": "", # e.g. reconsider, redo, rewrite
"semi": "", # e.g. semicircle, semi-retired
"sub": "", # e.g. submarine, sub-Saharan
"super": "", # e.g. super-hero, supermodel
"tele": "", # e.g. television, telephathic
"trans": "", # e.g. transatlantic, transfer
"ultra": "", # e.g. ultra-compact, ultrasound
"un": "", # e.g. under-cook, underestimate
"up": "", # e.g. upgrade, uphill
}
porter = PorterStemmer()
whitelist = list(wn.words()) + words.words()
def stem_prefix(word, prefixes, roots):
original_word = word
for prefix in sorted(prefixes, key=len, reverse=True):
# Use subn to track the no. of substitution made.
# Allow dash in between prefix and root.
word, nsub = re.subn("{}[\-]?".format(prefix), "", word)
if nsub > 0 and word in roots:
return word
return original_word
def porter_english_plus(word, prefixes=english_prefixes):
return porter.stem(stem_prefix(word, prefixes, whitelist))
我们解决了不阻止前缀的问题,导致无意义的根,例如,
>>> stem_prefix("united", english_prefixes, whitelist)
"united"
但是搬运工干仍然会删除后缀-ed
,这可能是/可能不是人们想要的输出,尤其是。当目标是在数据中保留语言上合理的单位时:
>>> porter_english_plus("united")
"unit"
因此,根据任务的不同,使用引理比使用词干更有利。
也可以看看:
如果您的列表包含400,000多个英语单词,以及645前缀列表。
https://www.dictionary.com/e/affixes/
https://raw.githubusercontent.com/dwyl/english-words/master/words.txt
def englishWords():
with open(r'C:\Program Files (x86)\MyJournal\Images\American English\EnglishWords.txt') as word_file:
return set(word.strip().lower() for word in word_file)
def is_english_word(word, english_words):
return word.lower() in english_words
def removePref(word):
prefs = ['a','ab','abs','ac','acanth','acantho','acous','acr','acro','ad','aden','adeno','adren','adreno','aer','aero','af','ag','al','all','allo','alti','alto','am','amb','ambi','amphi','amyl','amylo','an','ana','andr','andro','anem','anemo','ant','ante','anth','anthrop','anthropo','anti','ap','api','apo','aqua','aqui','arbor','arbori','arch','archae','archaeo','arche','archeo','archi','arteri','arterio','arthr','arthro','as','aster','astr','astro','at','atmo','audio','auto','avi','az','azo','bacci','bacteri','bacterio','bar','baro','bath','batho','bathy','be','bi','biblio','bio','bis','blephar','blepharo','bracchio','brachy','brevi','bronch','bronchi','bronchio','broncho','caco','calci','cardio','carpo','cat','cata','cath','cato','cen','ceno','centi','cephal','cephalo','cerebro','cervic','cervici','cervico','chiro','chlor','chloro','chol','chole','cholo','chondr','chondri','chondro','choreo','choro','chrom','chromato','chromo','chron','chrono','chrys','chryso','circu','circum','cirr','cirri','cirro','cis','cleisto','co','cog','col','com','con','contra','cor','cosmo','counter','cranio','cruci','cry','cryo','crypt','crypto','cupro','cyst','cysti','cysto','cyt','cyto','dactyl','dactylo','de','dec','deca','deci','dek','deka','demi','dent','denti','dento','dentro','derm','dermadermo','deut','deutero','deuto','dextr','dextro','di','dia','dif','digit','digiti','dipl','diplo','dis','dodec','dodeca','dors','dorsi','dorso','dyna','dynamo','dys','e','ec','echin','echino','ect','ecto','ef','el','em','en','encephal','encephalo','end','endo','ennea','ent','enter','entero','ento','entomo','eo','ep','epi','equi','erg','ergo','erythr','erythro','ethno','eu','ex','exo','extra','febri','ferri','ferro','fibr','fibro','fissi','fluvio','for','fore','gain','galact','galacto','gam','gamo','gastr','gastri','gastro','ge','gem','gemmi','geo','geront','geronto','gloss','glosso','gluc','gluco','glyc','glyph','glypto','gon','gono','grapho','gymn','gymno','gynaec','gynaeco','gynec','gyneco','haem','haemato','haemo','hagi','hagio','hal','halo','hapl','haplo','hect','hecto','heli','helic','helico','helio','hem','hema','hemi','hemo','hepat','hepato','hept','hepta','heter','hetero','hex','hexa','hist','histo','hodo','hol','holo','hom','homeo','homo','hydr','hydro','hyet','hyeto','hygr','hygro','hyl','hylo','hymeno','hyp','hyper','hypn','hypno','hypo','hypso','hyster','hystero','iatro','ichthy','ichthyo','ig','igni','il','ile','ileo','ilio','im','in','infra','inter','intra','intro','ir','is','iso','juxta','kerat','kerato','kinesi','kineto','labio','lact','lacti','lacto','laryng','laryngo','lepto','leucleuco','leuk','leuko','lign','ligni','ligno','litho','log','logo','luni','lyo','lysi','macr','macro','magni','mal','malac','malaco','male','meg','mega','megalo','melan','melano','mero','mes','meso','met','meta','metr','metro','micr','micro','mid','mini','mis','miso','mon','mono','morph','morpho','mult','multi','my','myc','myco','myel','myelo','myo','n','naso','nati','ne','necr','necro','neo','nepho','nephr','nephro','neur','neuro','nocti','non','noso','not','noto','nycto','o','ob','oc','oct','octa','octo','ocul','oculo','odont','odonto','of','oleo','olig','oligo','ombro','omni','oneiro','ont','onto','oo','op','ophthalm','ophthalmo','ornith','ornitho','oro','orth','ortho','ossi','oste','osteo','oto','out','ov','over','ovi','ovo','oxy','pachy','palae','palaeo','pale','paleo','pan','panto','par','para','pari','path','patho','ped','pedo','pel','pent','penta','pente','per','peri','petr','petri','petro','phago','phleb','phlebo','phon','phono','phot','photo','phren','phreno','phyll','phyllo','phylo','picr','picro','piezo','pisci','plan','plano','pleur','pleuro','pluto','pluvio','pneum','pneumat','pneumato','pneumo','poly','por','post','prae','pre','preter','prim','primi','pro','pros','prot','proto','pseud','pseudo','psycho','ptero','pulmo','pur','pyo','pyr','pyro','quadr','quadri','quadru','quinque','re','recti','reni','reno','retro','rheo','rhin','rhino','rhiz','rhizo','sacchar','sacchari','sacchro','sacr','sacro','sangui','sapr','sapro','sarc','sarco','scelero','schisto','schizo','se','seba','sebo','selen','seleno','semi','septi','sero','sex','sexi','shiz','sider','sidero','sine','somat','somato','somn','sperm','sperma','spermat','spermato','spermi','spermo','spiro','stato','stauro','stell','sten','steno','stere','stereo','stom','stomo','styl','styli','stylo','sub','subter','suc','suf','sug','sum','sup','super','supra','sur','sus','sy','syl','sym','syn','tachy','taut','tauto','tel','tele','teleo','telo','terra','the','theo','therm','thermo','thromb','thrombo','topo','tox','toxi','toxo','tra','trache','tracheo','trans','tri','tris','ultra','un','undec','under','uni','up','uter','utero','vari','vario','vas','vaso','ventr','ventro','vice','with','xen','xeno','zo','zoo','zyg','zygo','zym','zymo']
english_words = englishWords()
for pre in prefs:
if word.startswith(pre):
withoutPref = word[len(pre):]
if is_english_word(withoutPref,english_words):
return(withoutPref)
return word
>>> removePref('reload')
'load'
>>> removePref('unhappy')
'happy'
>>>removePref('reactivate')
'activate'
>>> removePref('impertinent')
'pertinent'
>>> removePref('aerophobia')
'phobia'