如何在维护构成每个句子的字符串的信息的同时对字符串列表进行句子标记?

问题描述 投票:0回答:1

我有如下字符串列表(从 pdf 上的 OCR 找到),对于列表中的每个字符串,我也有它们在 pdf 中的位置坐标

["Much of Singapore's infrastructure had been destroyed during the war",  
"including those needed to supply utilities."
"A shortage of food led to malnutrition, disease,"  
" and rampant crime and violence."   
"A series of strikes in 1947 caused massive",    
"stoppages in public transport and other services." ]

我想对“串联列表”(如果我串联列表中的每个项目创建的文本块)进行句子标记,但我也想保留列表中哪些项目构成每个句子的信息。

目前,我只是简单地连接列表中的每个项目 --> 将其提供给 nltk/spacy --> 获取句子。我得到以下结果:

["Much of Singapore's infrastructure had been destroyed during the war,‘including those needed to supply utilities.",  
"A shortage of food led to malnutrition, disease, and rampant crime and violence.",  
"A series of strikes in 1947 caused massive stoppages in public transport and other services."]

但我没有找到一种方法来知道列表中的哪些项目构成了句子..
我基本上需要像

{sentence 1: (line 1 ,line 2) ,  sentence 2: (line 3,line 4)}

这样的映射

当我从 lines-->sentences 出发时,我需要这样的映射来保留坐标信息(如上所述)

我使用的 OCR 服务是 azure form recognizer

python nlp spacy tokenize
1个回答
0
投票

天真的方法可能是迭代句子并寻找可能结束它们的地标并构建列表列表。如果您仍然觉得需要它,您可以从那里构建字典。

import json

data = [
    "Much of Singapore's infrastructure had been destroyed during the war",
    "including those needed to supply utilities.",
    "A shortage of food led to malnutrition, disease,",
    " and rampant crime and violence.",
    "A series of strikes in 1947 caused massive",
    "stoppages in public transport and other services."
]

## ============================
## sentences as a list of list of fragments
## ============================
data2 = []
current = []
for row in data:
    current.append(row)
    if row[-1] in [".", "?", "!"]:
        data2.append(current)
        current = []
print(json.dumps(data2, indent=4))
## ============================

## ============================
## sentences as a dictionary of list of fragments
## ============================
data2 = {
    f"sentence {i}": fragments
    for i, fragments
    in enumerate(data2, start=1)
}
print(json.dumps(data2, indent=4))
## ============================

那应该给你:

[
    [
        "Much of Singapore's infrastructure had been destroyed during the war",
        "including those needed to supply utilities."
    ],
    [
        "A shortage of food led to malnutrition, disease,",
        " and rampant crime and violence."
    ],
    [
        "A series of strikes in 1947 caused massive",
        "stoppages in public transport and other services."
    ]
]
{
    "sentence 1": [
        "Much of Singapore's infrastructure had been destroyed during the war",
        "including those needed to supply utilities."
    ],
    "sentence 2": [
        "A shortage of food led to malnutrition, disease,",
        " and rampant crime and violence."
    ],
    "sentence 3": [
        "A series of strikes in 1947 caused massive",
        "stoppages in public transport and other services."
    ]
}
© www.soinside.com 2019 - 2024. All rights reserved.