HTTP解析器:抓取单页应用程序:许多GET,如何确定页面何时结束

问题描述 投票:0回答:1

我正在尝试解析此网站:

https://www.monster.com/jobs/search/?q=java&where=usa&stpage=1

本质上,它并不复杂:它是一个单页应用程序,为它提供关键字,单击搜索,然后显示结果-首先仅显示29个结果。向下滚动时,将加载新结果。

在加载新结果之前,它将GET请求发送到

https://www.monster.com/jobs/search/pagination/?q=java&where=usa&isDynamicPage=true&isMKPagination=true&page=2&total=26

这将导致JSON答复,它是作业列表,看起来像这样:

{"Title":"Java Developer","TitleLink":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","DatePostedText":"6 days ago","DatePosted":"2020-01-18T12:00","LocationText":"Orlando, FL, 32801","JobViewUrl":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","ImpressionTracking":"data-m_impr_uuid=\"a7320356-70db-46ca-908e-e540f0e74cec\" data-m_impr_a_placement_id=\"JSR2CW\" data-m_impr_s_t=\"t\" data-m_impr_j_p=\"27\" data-m_impr_j_jpm=\"1\" data-m_impr_j_lat=\"28.5418\" data-m_impr_j_long=\"-81.3736\" data-m_impr_j_jawsid=\"418397617\" data-m_impr_j_postingid=\"b55f4409-3858-483a-a2e9-65e254ec1cd2\" data-m_impr_j_jobid=\"215193478\" data-m_impr_j_cid=\"660\" data-m_impr_j_occid=\"11970\" data-m_impr_j_lid=\"385\" data-m_impr_j_jpt=\"1\" data-m_impr_j_pvc=\"monster\" data-m_impr_j_coc=\"xsummittechx\" ","Company":{"Name":"Summit Technologies","HasCompanyAddress":true,"LogoLink":""},"Text":"Java Developer","ApplyType":"ApplyOnline","IsAggregated":"false","JobViewUrlMeta":"https://job-openings.monster.com/java-developer-orlando-fl-us-summit-technologies/215193478","MusangKingId":"215193478","CompanyLogoUrl":"","PrivateBoardIconImageUrl":"","FitIcon":"","FitIconType":""}

enter image description here

并且另一个POST请求发送到

https://ib.adnxs.com/ut/v3

(v3请求):

enter image description here

其中14162549tag_id: 14162549值似乎来自上述GET请求。

因此,当您向下滚动时,它将发送1个GET和1个POST请求,直到不发送-滚动结束,请求也是如此:

enter image description here

我不知道它如何确定何时停止。

我想抓取这些工作,并且可以执行类似将GET发送给]的操作>

https://www.monster.com/jobs/search/pagination/?q=java&where=usa&isDynamicPage=true&isMKPagination=true&page=N

但是我不知道什么时候停止,因为如果说,它会在&page=12时停止滚动,如果我向&page=13发送请求,它将不会返回空的JSON,相反,它将显示其他一些工作(可能不太相关,因此滚动到底部时不可见)。

我使用okHttp发送请求,如下所示:

HttpUrl.Builder urlBuilder = HttpUrl.parse(getUrl()).newBuilder();
urlBuilder.addQueryParameter("page", "1");
String url = urlBuilder.build().toString();

Request request = new Request.Builder()
        .url(url)
        .addHeader("Content-Type", "application/json; charset=utf-8")
        .addHeader("Accept-Language", Locale.US.getLanguage())
        .build();

OkHttpClient client = new OkHttpClient();
Call call = client.newCall(request);
Response response = call.execute();
String responseBody = response.body().string();
System.out.println(responseBody);

Gson gson = new Gson();
List<MonsterJobJson> resultMonster = gson.fromJson(
        responseBody, new TypeToken<List<MonsterJobJson>>() {
        }.getType());

我正在尝试解析此站点:https://www.monster.com/jobs/search/?q=java&where=usa&stpage=1从本质上讲,这并不复杂:它是单页应用程序,您可以它的关键词,单击...

javascript java web-scraping xmlhttprequest okhttp
1个回答
0
投票

没有足够的声誉来发表评论。

© www.soinside.com 2019 - 2024. All rights reserved.