我想打印文件中所有带有形态标签的标记。到目前为止,我编写了如下所示的代码。
def index(filepath, string):
import re
pattern = re.compile(r'(\w+)+')
StringList = []
StringList.append(string)
with open(filepath) as f:
for lineno, line in enumerate(f, start=1):
words = set(m.group(1) for m in pattern.finditer(line))
matches = [keyword for keyword in StringList if keyword in words]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
StringList.clear()
index('deneme.txt', '+Noun')
输出是这样的,我可以在标记和行号中找到名词,但无法打印我想要的部分。我只想要+号之前的单词部分。
Noun 1
Noun 2
Noun 3
Noun 4
Noun 5
Noun 6
Noun 7
我的文件中的行是这样的:
Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc oluş+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karşı+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylaş+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num aşama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc
Türkiye+Noun+Gen ekonomik+Adj ve+Conj insani+Adj potansiyel+Noun+P3sg ,+Punc güç+Noun+With savun+Verb+Inf2 kapasite+Noun+P3sg ,+Punc ulus+Noun+A3pl+InBetween çatış+Verb+Inf2+A3pl+Gen önle+Verb+Pass+Inf2+P3sg ve+Conj barış+Noun+P3sg inşa+Noun çaba+Noun+A3pl+P3sg+Dat aktif+Adj katılım+Noun+P3sg+Gen yanısıra+PostpPCGen ,+Punc fark+Noun+With kültür+Noun ve+Conj gelenek+Noun+A3pl+Dat ait+PostpPCDat seçkin+Adj özellik+Noun+A3pl+Acc birleş+Verb+Caus+PresPart bir+Num bünye+Noun+Dat sahip+Noun ol+Verb+Inf2+P3sg ,+Punc kendi+Pron+P3sg bölge+Noun+P3sg+Loc ve+Conj öte+Noun+P3sg+Loc önem+Noun+With rol+Noun oyna+Verb+Inf2+P3sg+Acc sağla+Verb+Fut değer+Noun+With özellik+Noun+A3pl+Cop .+Punc
Türkiye+Noun ,+Punc bu+Det önem+Noun+With katkı+Noun+Acc yap+Verb+Able+Inf1 için+PostpPCGen yeterli+Adj donanım+Noun+P3sg haiz+Adj bir+Num ülke+Noun+Cop ve+Conj gelecek+Noun nesil+Noun+A3pl için+PostpPCGen daha+Noun i+Noun+Acc bir+Num dünya+Noun oluş+Verb+Caus+Inf1 amaç+Noun+P3sg+Ins ,+Punc dost+Noun+A3pl+P3pl ve+Conj müttefik+Adj+A3pl+P3sg+Ins yakın+Noun bir+Num biçim+Noun+Loc çalış+Verb+Inf2+Dat devam+Noun et+Verb+Fut+Cop .+Punc
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj ilişki+Noun+A3pl
club+Noun toplantı+Noun+A3pl+P3sg
Türkiye+Noun -+Punc At+Noun gümrük+Noun işbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlaşma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklık+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj geliş+Verb+Inf2+P3sg+Acc sağla+Verb+Inf1 üzere+PostpPCNom ortaklık+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayılı+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc
club+Noun toplantı+Noun+A3pl+P3sg
nispi+Adj
nisbi+Adj
görece+Adj+With
izafi+Adj
obur+Adj
例如,当我编写标签时,我想获取令牌。 例如,当我写 +Adj 时,我想获取包含 +Adj 的所有标记(nispi、izafi ....(例如))。
我认为,您如何使用正则表达式的概念需要一些改进。
请注意,每个输入行包含许多“标记”,例如
terörizm+Noun+Gen
。
正如你所看到的,它包含:
+
字符。所以:
+
字符上拆分为单词,+
)是分类符号。去除终止空白字符是一个好习惯(至少
\n
)。
另请注意,您的代码包含
StringList
,因此您知道
该函数可能会查找多个中的一个或多个
分类词。
我的编程方式略有不同:
lookFor
)是单词的list,即
转换成一套(lookForSet
)。是否打印单词(令牌中的第一个单词)的决定基于 是否至少可以在
lookForSet
中找到其分类符号之一。
换句话说 - lookForSet
和 wordSet
是否有一些
公共元素(集合交集)。
所以整个脚本如下所示:
import re
def index(fileName, lookFor):
lookForSet = set(lookFor) # Set of classification symbols to look for
pat1 = re.compile(r'\s+') # Regex to split line into tokens
pat2 = re.compile(r'\+') # Regex to split a token into words
with open(fileName) as f:
for lineNo, line in enumerate(f, start=1):
line = line.rstrip()
tokens = pat1.split(line)
for token in tokens:
words = pat2.split(token)
word1 = words.pop(0) # Initial word
wordSet = set(words) # Classification words
commonWords = lookForSet.intersection(wordSet)
if commonWords:
print("{:3}: {:<15} {}".format(lineNo, word1, ', '.join(commonWords)))
index('lines.txt', ['Noun', 'Gen'])
它的一个输出,用于我的输入数据(稍微缩短的版本) 如下:
1: Türkiye Noun
1: terörizm Noun, Gen
1: kitle Noun
1: imha Noun
2: Türkiye Noun, Gen
2: potansiyel Noun
它包含:
lookFor
中的哪些分类词。拆分
\w+
删除了您要查找的内容中的 +
部分,因此我改为拆分其间的空格。然后,这只是将 for
和 in
调整为列表理解的正确顺序的情况。
def index(filepath, string):
StringList = [string]
with open(filepath) as f:
for lineno, line in enumerate(f, start=1):
words = line.split(' ')
matches = [word for keyword in StringList for word in words if keyword in word]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
index('deneme.txt', '+Adj')
这导致了结果:
küresel+Adj,karşı+Adj+P3sg+Loc,samimi+Adj 1
ekonomik+Adj,insani+Adj,aktif+Adj,seçkin+Adj 2
yeterli+Adj,haiz+Adj,müttefik+Adj+A3pl+P3sg+Ins 3
kurumsal+Adj 4
sayılı+Adj 6
nispi+Adj 8
nisbi+Adj 9
görece+Adj+With 10
izafi+Adj 11
obur+Adj 12
我删除了行
StringList.clear()
,因为它不知何故给出了错误。
适用于 Python 2.7 和 3.6+,尽管文本中的扩展 Unicode 字符会导致使用 2.7 时无法对齐。