如何使用Java中的WebKit从加载的页面中获取html

Question

我的目标是使用Java来解析Airbnb列表页面，例如：https://www.airbnb.com/rooms/28149735

我首先尝试使用JSoup如下：

String html = Jsoup.connect(webPage).get().html();

但是它不起作用，因为它无法加载页面的脚本，并且无法呈现当我从Chrome或Firefox等浏览器检查加载的页面时看到的内容。

所以我现在尝试通过以下代码使用WebKit：

// get the instance of the webkit
BrowserEngine browser = BrowserFactory.getWebKit();
Page page = browser.navigate("https://www.airbnb.com/rooms/28149735");
page.show();

String html = page.getDocument().getBody().getInnerHTML();

但是这也不起作用：该页面正确加载（我在控制台中看到日志并正确显示了弹出窗口），但是一旦我加载了页面，就无法访问html（我得到一个null指针异常，请参见下面的错误日志。

当我在调试模式下运行代码时，我查看了页面对象，并且该页面中的文档显示为“ null”，这似乎会导致错误。

所以我的问题是：我做错了什么，如何获取已加载页面的html？

非常感谢！

PS：这是错误：

Exception in thread "JavaFX Application Thread" io.webfolder.ui4j.api.util.Ui4jException: java.lang.NullPointerException
    at io.webfolder.ui4j.webkit.aspect.WebKitAspect$CallableExecutor.run(WebKitAspect.java:41)
    at com.sun.javafx.application.PlatformImpl.lambda$null$172(PlatformImpl.java:295)
    at java.security.AccessController.doPrivileged(Native Method)
    at com.sun.javafx.application.PlatformImpl.lambda$runLater$173(PlatformImpl.java:294)
    at com.sun.glass.ui.InvokeLaterDispatcher$Future.run$$$capture(InvokeLaterDispatcher.java:95)
    at com.sun.glass.ui.InvokeLaterDispatcher$Future.run(InvokeLaterDispatcher.java)
    at com.sun.glass.ui.gtk.GtkApplication._runLoop(Native Method)
    at com.sun.glass.ui.gtk.GtkApplication.lambda$null$48(GtkApplication.java:139)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at io.webfolder.ui4j.webkit.dom.WebKitDocument.getBody_aroundBody12(WebKitDocument.java:74)
    at io.webfolder.ui4j.webkit.dom.WebKitDocument$AjcClosure13.run(WebKitDocument.java:1)
    at io.webfolder.ui4j.internal.aspectj.runtime.reflect.JoinPointImpl.proceed(JoinPointImpl.java:149)
    at io.webfolder.ui4j.webkit.aspect.WebKitAspect$CallableExecutor.run(WebKitAspect.java:39)
    ... 8 more

Answer 1

我不知道刮除AirBNB的准确性-尽管如果您解释说要使用JSoup刮除它，那么我可以首先解释一下，任何依赖AJAX / JSON / Java-Script的页面都将填充字段在您要查找的表中，它根本无法与JSoup一起使用。我已经编写了自己的HTML Parse和搜索引擎，并且运行良好see docs/jar。我遇到了同样的问题，试图弄清楚如何处理Java脚本页面。

我知道您面临的问题非常好，大多数酒店网站都加载了AJAX呼叫-Internet上的许多网站都包含此内容。根据我的阅读，最新的标准罐头答案是使用Selenium软件包下载具有Java脚本的页面。我可以在“重复答案”上发布一堆有关刮除具有Java脚本的页面的堆栈，但我只想提出三点。

1）了解硒。它不执行JSoup的操作，实际上启动了Google Chrome的一个实例，并尝试与其进行交互。

2）我刚刚为我的项目找到了一个整洁的解决方案。我并不是要编写无头的浏览器，因为它将永远花费。我的HTML搜索例程运行良好。但是，有一个名为“ Splash”的新工具可用Python内置，但提供了HTTP接口。我只是为我的项目启动了它-我刮了一个臭名昭著的Java脚本页面-Wikipedia（例如，我去了Christopher Columbus），并且看起来它很好地填充了该页面。您必须启动它的一个实例-它在您的计算机上作为本地UNIX服务运行。您可以像这样与之交互：（这里是我刚刚使用的代码）

// This calls the "Splash Engine" that I just started up
// Read the Documentation on splash.readthedocs.io, because I just
// made it work for my project.  Maybe this is something, I don't know...
String urlStr = "http://localhost:8050/render.html?url=https://en.wikipedia.org/wiki/Christopher_Columbus&timeout=10&wait=0.5";
URL url = new URL(urlStr);

/// Plain old vanilla scrape of the Wikipedia page...
String urlStr2 = "https://en.wikipedia.org/wiki/Christopher_Columbus";
URL url2 = new URL(urlStr2);

// This version had much more HTML Content than the next one.
Vector<HTMLNode> v = HTMLPage.getPageTokens(url, false);
FileRW.writeFile(Util.pageToString(v), "cc.html");

// Scrape - does not execute java-script...
Vector<HTMLNode> v2 = HTMLPage.getPageTokens(url2, false);
FileRW.writeFile(Util.pageToString(v2), "cc2.html");

似乎所有Java脚本条目都比我的Plain-Old-Vanilla HTML Scrape好得多...

这是根据他们的网站。我要做的就是在我的Google Cloud Server Shell帐户上运行它...

拉出图像：
$ sudo docker pull scrapinghub / splash
启动容器：
$ sudo docker run -it -p 8050：8050 --rm scrapinghub / splash

如何使用Java中的WebKit从加载的页面中获取html

问题描述投票：0回答：1

1个回答

最新问题

如何使用Java中的WebKit从加载的页面中获取html

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1