使用Selenium-WebDriver和PDF-Box从一个URL中读取PDF。

问题描述 投票:0回答:1

我试图使用Selenium-web驱动和PDFbox API来读取PDF中的文本。如果可能的话,我不想下载文件,而只是从网络上读取PDF,只得到PDF的文本到一个字符串。我使用的代码如下,虽然不能使工作。

我找到了一些下载PDF文件的例子 并将其与下载的文件进行比较 但没有一个能从URL中提取PDF文本的功能例子。

import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

import javax.swing.JDialog;
import javax.swing.JOptionPane;
import javax.swing.Timer;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

public class PDFextract {


        public static void main(String[] args) throws Exception {
            // TODO Auto-generated method stub
            System.setProperty("webdriver.chrome.driver", "C:\\chromedriver.exe");
            WebDriver driver=new ChromeDriver();
            driver.manage().window().maximize();
            driver.get("THE URL OF SITE I CANT SHARE"); //THE URL OF SITE I CAN'T SHARE
            System.out.println(driver.getTitle());          
            List<WebElement> list = driver.findElements(By.xpath("//a[@title='Click to open file']"));
            int rows = list.size();
            for (int i= 1; i <= rows; i++) {
            }
            List<WebElement> links = driver.findElements(By.xpath("//a[@title='Click to open file']"));
        String fLinks = "";
        for (WebElement link : links) {
             fLinks = fLink + link.getAttribute("href");
        }
        fLinks = fLinks.trim();
        System.out.println(fLinks); // till here the code works fine.. i get a valid url link

        // the code bellow doesn't work
        URL url=new URL(fLinks);
        HttpURLConnection connection=(HttpURLConnection)url.openConnection();
        InputStream is=connection.getInputStream();
        PDDocument pdd=PDDocument.load(is);
        PDFTextStripper stripper=new PDFTextStripper();
        String text=stripper.getText(pdd);
        pdd.close();
        is.close();
        System.out.println(text);

我得到的错误。

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 500 for URL: ***AS TOLD ABOVE, I CANT SHARE THE URL***
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at 

sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
    at PDFextract.main(PDFextract.java:106)

编辑于07.05.2020:@TilmanHausherr,我做了更多的研究,这在第一部分帮助了我,如何从链接中读取PDF。Selenium教程: 使用Selenium WebDriver读取PDF内容

这个方法是有效的。

String pdfContent = readPDFContent(driver.getCurrentUrl());

    public String readPDFContent(String appUrl) throws Exception {
    URL url = new URL(appUrl);
    InputStream is = url.openStream();
    BufferedInputStream fileToParse = new BufferedInputStream(is);
    PDDocument document = null;
    String output = null;
    try {
        document = PDDocument.load(fileToParse);
        output = new PDFTextStripper().getText(document);
        System.out.println(output);
    } finally {
        if (document != null) {
            document.close();
        }
        fileToParse.close();
        is.close();
    }
    return output;
}

看来我的问题出在链接本身,HTML元素是"< embed >",在我的情况下,还有一个 "stream-URL"。

<embed id="plugin" type="application/x-google-chrome-pdf" 

src="https://"SITE 
I CAN'T TELL"/file.do? _tr=4d51599fead209bc4ef42c6e5c4839c9bebc2fc46addb11a" 
stream-URL="chrome-extension://mhjfbmdgcfjojefgiehjai/6958a80-4342-43fc-
838a-1dbd07fa2fc1" headers="accept-ranges: bytes
content-disposition: inline;filename=&quot;online.pdf&quot;
content-length: 71488
content-security-policy: frame-ancestors 'self' https://*"SITE I CAN'T TELL" 
https://*"DOMAIN I CAN'T TELL".net
content-type: application/pdf

找到了这个 1. 用selenium下载嵌入标签中的chrome扩展名为stream-url的文件。2. 在selenium python中处理Embed标签的内容。

但我还是没能用PDFbox读取PDF,因为它的元素是'< embed>',我可能要访问流URL。

java selenium-webdriver url pdfbox
1个回答
© www.soinside.com 2019 - 2024. All rights reserved.