打印文件中所有标有形态标签的标记

问题描述 投票:0回答:2

我想打印文件中所有带有形态标签的标记。到目前为止,我编写了如下所示的代码。

def index(filepath, string):

    import re
    pattern = re.compile(r'(\w+)+')
    StringList = []
    StringList.append(string)

    with open(filepath) as f:
        for lineno, line in enumerate(f, start=1):
            words = set(m.group(1) for m in pattern.finditer(line))
            matches = [keyword for keyword in StringList if keyword in words]
            if matches:
                result = "{:<15} {}".format(','.join(matches), lineno)
                print(result)

    StringList.clear()



index('deneme.txt', '+Noun')

输出是这样的,我可以在标记和行号中找到名词,但无法打印我想要的部分。我只想要+号之前的单词部分。

Noun            1
Noun            2
Noun            3
Noun            4
Noun            5
Noun            6
Noun            7

我的文件中的行是这样的:

Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc oluş+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karşı+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylaş+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num aşama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc 
Türkiye+Noun+Gen ekonomik+Adj ve+Conj insani+Adj potansiyel+Noun+P3sg ,+Punc güç+Noun+With savun+Verb+Inf2 kapasite+Noun+P3sg ,+Punc ulus+Noun+A3pl+InBetween çatış+Verb+Inf2+A3pl+Gen önle+Verb+Pass+Inf2+P3sg ve+Conj barış+Noun+P3sg inşa+Noun çaba+Noun+A3pl+P3sg+Dat aktif+Adj katılım+Noun+P3sg+Gen yanısıra+PostpPCGen ,+Punc fark+Noun+With kültür+Noun ve+Conj gelenek+Noun+A3pl+Dat ait+PostpPCDat seçkin+Adj özellik+Noun+A3pl+Acc birleş+Verb+Caus+PresPart bir+Num bünye+Noun+Dat sahip+Noun ol+Verb+Inf2+P3sg ,+Punc kendi+Pron+P3sg bölge+Noun+P3sg+Loc ve+Conj öte+Noun+P3sg+Loc önem+Noun+With rol+Noun oyna+Verb+Inf2+P3sg+Acc sağla+Verb+Fut değer+Noun+With özellik+Noun+A3pl+Cop .+Punc 
Türkiye+Noun ,+Punc bu+Det önem+Noun+With katkı+Noun+Acc yap+Verb+Able+Inf1 için+PostpPCGen yeterli+Adj donanım+Noun+P3sg haiz+Adj bir+Num ülke+Noun+Cop ve+Conj gelecek+Noun nesil+Noun+A3pl için+PostpPCGen daha+Noun i+Noun+Acc bir+Num dünya+Noun oluş+Verb+Caus+Inf1 amaç+Noun+P3sg+Ins ,+Punc dost+Noun+A3pl+P3pl ve+Conj müttefik+Adj+A3pl+P3sg+Ins yakın+Noun bir+Num biçim+Noun+Loc çalış+Verb+Inf2+Dat devam+Noun et+Verb+Fut+Cop .+Punc 
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj ilişki+Noun+A3pl 
club+Noun toplantı+Noun+A3pl+P3sg 
Türkiye+Noun -+Punc At+Noun gümrük+Noun işbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlaşma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklık+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj geliş+Verb+Inf2+P3sg+Acc sağla+Verb+Inf1 üzere+PostpPCNom ortaklık+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayılı+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc 
club+Noun toplantı+Noun+A3pl+P3sg 
nispi+Adj 
nisbi+Adj 
görece+Adj+With 
izafi+Adj 
obur+Adj 

例如,当我编写标签时,我想获取令牌。 例如,当我写 +Adj 时,我想获取包含 +Adj 的所有标记(nispi、izafi ....(例如))。

python regex nlp morphological-analysis
2个回答
1
投票

我认为,您如何使用正则表达式的概念需要一些改进。

请注意,每个输入行包含许多“标记”,例如

terörizm+Noun+Gen
。 正如你所看到的,它包含:

  • 第一个单词 - 文本中的实际单词,
  • 许多分类符号,每个符号前面都有一个
    +
    字符。

所以:

  • 每行应该被分成标记,在一系列空白字符上,
  • 每个标记应在
    +
    字符上拆分为单词,
  • 这些词中的第一个词是“实际”词,
  • 其余单词(不含
    +
    )是分类符号。

去除终止空白字符是一个好习惯(至少

\n
)。

另请注意,您的代码包含

StringList
,因此您知道 该函数可能会查找多个中的一个或多个 分类词。

我的编程方式略有不同:

  • 第二个参数(
    lookFor
    )是单词的list,即 转换成一套(
    lookForSet
    )。
  • 单词集合(token分裂的结果,减去第一个单词) 也转换成一套。

是否打印单词(令牌中的第一个单词)的决定基于 是否至少可以在

lookForSet
中找到其分类符号之一。 换句话说 -
lookForSet
wordSet
是否有一些 公共元素(集合交集)。

所以整个脚本如下所示:

import re

def index(fileName, lookFor):
    lookForSet = set(lookFor)  # Set of classification symbols to look for
    pat1 = re.compile(r'\s+')  # Regex to split line into tokens
    pat2 = re.compile(r'\+')   # Regex to split a token into words
    with open(fileName) as f:
        for lineNo, line in enumerate(f, start=1):
            line = line.rstrip()
            tokens = pat1.split(line)
            for token in tokens:
                words = pat2.split(token)
                word1 = words.pop(0)  # Initial word
                wordSet = set(words)  # Classification words
                commonWords = lookForSet.intersection(wordSet)
                if commonWords:
                    print("{:3}: {:<15} {}".format(lineNo, word1, ', '.join(commonWords)))

index('lines.txt', ['Noun', 'Gen'])

它的一个输出,用于我的输入数据(稍微缩短的版本) 如下:

1: Türkiye         Noun
1: terörizm        Noun, Gen
1: kitle           Noun
1: imha            Noun
2: Türkiye         Noun, Gen
2: potansiyel      Noun

它包含:

  • 源行数,
  • 令牌的第一个词,
  • 在此标记中找到了
    lookFor
    中的哪些分类词。

1
投票

拆分

\w+
删除了您要查找的内容中的
+
部分,因此我改为拆分其间的空格。然后,这只是将
for
in
调整为列表理解的正确顺序的情况。

def index(filepath, string):
    StringList = [string]

    with open(filepath) as f:
        for lineno, line in enumerate(f, start=1):
            words = line.split(' ')
            matches = [word for keyword in StringList for word in words if keyword in word]
            if matches:
                result = "{:<15} {}".format(','.join(matches), lineno)
                print(result)


index('deneme.txt', '+Adj')

这导致了结果:

küresel+Adj,karşı+Adj+P3sg+Loc,samimi+Adj 1
ekonomik+Adj,insani+Adj,aktif+Adj,seçkin+Adj 2
yeterli+Adj,haiz+Adj,müttefik+Adj+A3pl+P3sg+Ins 3
kurumsal+Adj    4
sayılı+Adj      6
nispi+Adj       8
nisbi+Adj       9
görece+Adj+With 10
izafi+Adj       11
obur+Adj        12

我删除了行

StringList.clear()
,因为它不知何故给出了错误。

适用于 Python 2.7 和 3.6+,尽管文本中的扩展 Unicode 字符会导致使用 2.7 时无法对齐。

© www.soinside.com 2019 - 2024. All rights reserved.