将数据从 PDF 文本文件提取到 Excel 中有时会分割句子。
按照上一篇文章,使用 PQ 我可以将文本分组在一起,然后使用正则表达式识别属于句子末尾的
.
并使用它进行拆分。这非常有效,除非副标题可能不带标点符号。这会导致句子偶尔连接在一起的小错误。例如
SUBHEADING 1
Non-sensical
sentence 1. Sentence 2.
返回为:
SUBHEADING 1 Non-sensical sentence 1.
Sentence 2.
我最初认为这是最好的办法,但是,到目前为止,重点是用
.
识别句子的结尾。此外,我们还可以利用句子往往以大写字母开头的事实进一步完善这一点。 Excel 解释 PDF 页面的方式意味着需要事先进行转换来对文本进行分组,我非常确定排除使用正则表达式。
这是我正在尝试做的事情(橙色)和我目前所在位置(绿色)的示例:
注意每个句子如何以大写字母开头并以
.
结尾。我认为这是将副标题/章节等与文本其余部分分开的合乎逻辑的方法。由此,我可以将剩余的转换应用于上一篇文章,如果我是对的,这应该会提供更好的分离。
到目前为止:
我已经确定了哪些句子以大写字母/小写字母开头,哪些句子以小写字母或
.
结尾(以数字结尾也被视为小写字母)。使用这个,我认为可以安全地假设:
如果一行以小写结尾并且下一行也包含小写, 这属于同一个句子,可以追加。
如果句子以大写开头并以句号结尾,则这是一个 完整的句子,无需更改。
如果句子以大写开头,末尾没有
.
,然后下一步
是大写的;这是一个副标题/额外信息,不是典型的
句子。
我认为可以使用这些规则生成所需的表。我无法确定按照上述规则附加句子所需的转换。我需要某种方法来识别上/下/。上面和下面的行,我只是看不出如何。
我会随着进展进行更新;但是,如果有人可以评论如何使用这些规则来实现这一目标,那就太好了。
注意当前 M 代码有一个错误,其中 3d.被识别为大写。
M代码:
let
Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then "Lowercase" else "Uppercase"),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "Custom.1", each if Text.End([Column1], 1) = "." then Text.End([Column1], 1) else if Text.End([Column1], 1) = Text.Lower(Text.End([Column1], 1)) then "Lowercase" else null)
in
#"Added Custom1"
来源数据:
SUBHEADING 1
Non-sensical
sentence 1. Sentence 2.
Sentence 3 part 3a,
part 3b, and
part 3c.
SUBHEADING 1.1.
Non-sensical
sentence 4. Sentence 5.
Sentence 6 part 3a,
part 3b, 3c and
3d.
2.0. SUBHEADING
Extra Info 1 (Not a proper sentence)
Extra Info 2
Sentence 7.
SUBHEADING 3.0
Sentence 8.
到目前为止的解决方案(如果副标题以数字开头,则有小错误)
let
Source = Excel.CurrentWorkbook(){[Name="Table5"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Start", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then "Lowercase" else "Uppercase"),
#"Added Custom1" = Table.AddColumn(#"Added Custom", "End", each if Text.End([Column1], 1) = "." then Text.End([Column1], 1) else if Text.End([Column1], 1) = Text.Lower(Text.End([Column1], 1)) then "Lowercase" else null),
#"Added Custom4" = Table.AddColumn(#"Added Custom1", "Complete Sentences", each if [Start] = "Uppercase" and [End] = "." then "COMPLETE" else null),
#"Added Index" = Table.AddIndexColumn(#"Added Custom4", "Index", 0, 1, Int64.Type),
#"Added Custom2" = Table.AddColumn(#"Added Index", "StartCheckCaseBelow", each try #"Added Index" [Start] { [Index] + 1 } otherwise "COMPLETE"),
#"Removed Columns1" = Table.RemoveColumns(#"Added Custom2",{"Index"}),
#"Added Custom3" = Table.AddColumn(#"Removed Columns1", "CompleteHeadings_CompleteText", each if [StartCheckCaseBelow]= "COMPLETE" then "COMPLETE" else if [Start] = "Uppercase" and [StartCheckCaseBelow] = "Uppercase" then "COMPLETE" else null),
#"Added Custom5" = Table.AddColumn(#"Added Custom3", "Custom", each if [CompleteHeadings_CompleteText] = null then if [Start] = "Uppercase" and [End] = "Lowercase" then "START" else if [End] = "." then "END" else null else null),
#"Added Index1" = Table.AddIndexColumn(#"Added Custom5", "Index", 0, 1, Int64.Type),
#"Merged Columns" = Table.CombineColumns(Table.TransformColumnTypes(#"Added Index1", {{"Index", type text}}, "en-GB"),{"Index", "CompleteHeadings_CompleteText"},Combiner.CombineTextByDelimiter(" ", QuoteStyle.None),"CompleteHeadings_CompleteText"),
#"Added Index3" = Table.AddIndexColumn(#"Merged Columns", "Index", 0, 1, Int64.Type),
#"Filtered Rows" = Table.SelectRows(#"Added Index3", each ([Custom] = "START")),
#"Added Index2" = Table.AddIndexColumn(#"Filtered Rows", "temp", 0, 1, Int64.Type),
combined = #"Added Index2" & Table.SelectRows(#"Added Index3", each [Custom] <> "START"),
#"Sorted Rows" = Table.Sort(combined,{{"Index", Order.Ascending}}),
#"Removed Columns" = Table.RemoveColumns(#"Sorted Rows",{"Index"}),
#"Added Custom6" = Table.AddColumn(#"Removed Columns", "Custom.1", each if Text.Contains([CompleteHeadings_CompleteText], "COMPLETE") then [CompleteHeadings_CompleteText] else if [Complete Sentences] <> null then [Complete Sentences] else [temp]),
#"Filled Down" = Table.FillDown(#"Added Custom6",{"Custom.1"}),
#"Removed Other Columns" = Table.SelectColumns(#"Filled Down",{"Column1", "Custom.1"}),
#"Grouped Rows" = Table.Group(#"Removed Other Columns", {"Custom.1"}, {{"Text", each Text.Combine([Column1], " "), type text}}),
#"Removed Other Columns1" = Table.SelectColumns(#"Grouped Rows",{"Text"})
in
#"Removed Other Columns1"
这可以实现橙色表格,但需要许多步骤,需要简化。如果您可以就此提供建议或知道更好的方法,请告诉我。
let
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Start([Column1], 1) = Text.Lower(Text.Start([Column1], 1)) then " " else "| "),
#"Merged Columns" = Table.CombineColumns(#"Added Custom",{ "Custom", "Column1"},Combiner.CombineTextByDelimiter("", QuoteStyle.None),"Merged"),
#"Transposed Table" = Table.Transpose(#"Merged Columns"),
Custom1 = Table.ColumnNames( #"Transposed Table"),
Custom2 = #"Transposed Table",
#"Merged Columns1" = Table.CombineColumns(Custom2,Custom1,Combiner.CombineTextByDelimiter("", QuoteStyle.None),"Merged"),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Merged Columns1", {{"Merged", Splitter.SplitTextByDelimiter("| ", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Merged"),
#"Split Column by Delimiter1" = Table.ExpandListColumn(Table.TransformColumns(#"Split Column by Delimiter", {{"Merged", Splitter.SplitTextByDelimiter(". ", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Merged"),
#"Changed Type" = Table.TransformColumnTypes(#"Split Column by Delimiter1",{{"Merged", type text}})
in
#"Changed Type"
有兴趣看看是否有更优雅的解决方案。