使用Java脚本从pdf提取文本的特定部分?

问题描述 投票:1回答:1

我需要进行修改。我正在使用此代码来提取pdf中的所有文本:

<!-- edit this; the PDF file must be on the same domain as this page -->
<iframe id="input" src="your-file.pdf"></iframe>

<!-- embed the pdftotext service as an iframe -->
<iframe id="processor" src="http://hubgit.github.com/2011/11/pdftotext/"></iframe>

<!-- a container for the output -->
<div id="output"></div>

<script>
var input = document.getElementById("input");
var processor = document.getElementById("processor");
var output = document.getElementById("output");

// listen for messages from the processor
window.addEventListener("message", function(event){
  if (event.source != processor.contentWindow) return;

  switch (event.data){
    // "ready" = the processor is ready, so fetch the PDF file
    case "ready":
      var xhr = new XMLHttpRequest;
      xhr.open('GET', input.getAttribute("src"), true);
      xhr.responseType = "arraybuffer";
      xhr.onload = function(event) {
        processor.contentWindow.postMessage(this.response, "*");
      };
      xhr.send();
    break;

    // anything else = the processor has returned the text of the PDF
    default:
      output.textContent = event.data.replace(/\s+/g, " ");
    break;
  }
}, true);
</script>

输出为不带任何段落的打包文本。我所有的pdf文件开头都带有“ Datacover”一词,并跟随一段大段落。

我要做的就是删除从开始到单词'Datacover'的第一个实例的所有文本,也从单词'Datacover'的开头删除所有文本,直到'的第三个实例。 '

您能帮忙吗?谢谢!

text extract
1个回答
0
投票

您可以匹配单词边界\b之间的Datacover,并以非贪婪的方式重复3次,匹配包括换行[\s\S]*?的任何字符,直到下一次出现点和空格\.

\bDatacover\b(?:[\s\S]*?\. ){3}

Regex demo

要获取数据,您可以使用

event.data.match(regex)

例如:

const regex = /\bDatacover\b(?:[\s\S]*?\. ){3}/g;
let event = {
  data: `testhjgjhg hjg jhg jkgh kjhghjkg76t 76 tguygtf yr 6 rt6 gtyut 67 tuy yoty yutyu tyu yutyuit iyut iuytiyu tuiyt Datacover uytuy tuyt uyt uiytuiyt uytutest.
yu tuyt uyt uyt iutiuyt uiy
 yuitui tuyt
test. 
 uiyt uiytuiyt
 uyt ut ui
this is a test. 
sjhdgfjsa. 
hgwryuehrgfhrghw fsdfdfsfs sddsfdfs.`
};

console.log(event.data.match(regex));
© www.soinside.com 2019 - 2024. All rights reserved.