我正在尝试抓取这个网站。您需要点击搜索栏中的放大镜图标才能看到我要提取的记录。问题是该网站是动态的,我需要多次滚动才能加载整个页面,然后我可以使用
rvest
或 BeautifulSoap
提取内容但是,到目前为止,线程中的滚动方法都不适合我.
如果可以使用任何包或库在 R 或 Python 中找到解决方案,我将不胜感激。
我试过了
remDr$executeScript("window.scrollTo(0,document.body.scrollHeight);")
其中
remDr
是点击放大镜图标后的页面
我还尝试定义搜索结果,在其中检查页面并提取可以引导到项目列表的 xpath
search_results <- remDr$findElement( using = 'xpath', '//*[@id="search-feature-container"]/div[2]/div[2]/div[3]/div[2]/div[1]' )
然后我运行了这一行,但根本没有滚动:(
search_results$sendKeysToElement(list(key = "down"))
该信息通过 XHR 调用动态地输入到页面中,您可以在浏览器的开发工具 - 网络选项卡中看到。
这是获取所有研究数据的一种方法:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import json
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
big_df = pd.DataFrame()
s = requests.Session()
s.headers.update(headers)
for x in range(0, 8000, 1000):
r = s.get(f'https://vivli-prod-cus-srch.search.windows.net/indexes/studies/docs?api-key=C8237BFE70B9CC48489DC7DD84D88379&api-version=2016-09-01&$top=1000&$skip={x}&search=*&$filter=assignedAppType%20eq%20%27Default%27&$count=true&facet=studyDesign&facet=locationsOfStudySites,count:300,sort:value&facet=sponsorType&facet=contributorType&facet=sponsorName,count:500,sort:value&facet=studyType&facet=actualEnrollment,interval:100')
df = pd.json_normalize(r.json()['value'])
big_df = pd.concat([big_df, df])
print(big_df)
终端结果(仅限前两行,数据框中有 7K+ 条记录):
@search.score id title sponsorProtocolId orgId orgCode orgName irpOrgName sponsorName overrideDisplayDefaults nctId secondaryIds acronym participantTermCodes participantTerms interventionTermCodes interventionTerms outcomeTermCodes outcomeTerms searchParticipantTermCodes searchOutcomeTermCodes searchInterventionTermCodes actualEnrollment locationsOfStudySites studyType studyDesign principalInvestigator studyStartDate studyEndDate sponsorType contributorType studyDoi phase conditions interventionNames outcomeNames extractedConditions extractedInterventions antimicrobials groupingsOfResistancePatterns organisms specimenSources sampleTimes countries regions yearsDataCollected containsPediatrics containsGenotype assignedAppType numberOfIsolates program lastUpdatedDate
0 1.0 abd778c4-21ed-4063-9e34-e3e7b177db18 A Randomized, Double-Blind, Parallel-Group, Dose-Response Study to Evaluate the Efficacy and Safety of Two Doses of Topiramate Compared to Placebo and Propranolol in the Prophylaxis of Migraine CR003205 d1bd067d-3e2d-43b5-80f1-6235e85c2876 JNJ Johnson & Johnson Yoda Project Johnson & Johnson Pharmaceutical Research & Development, L.L.C. N NCT00236561 [] [lr5qxyw6ww35, kk05h7rpym8w, kk05h7rpym8x, kk05h7rpym8y, kk05h7rpym8z, kk05h7rpym90, kk05h7rpym91, r4hp3896n2zy] [Male and Female, Child 6-12 years, Adolescent 13-18 years, Young Adult 19-24 years, Adult 19-44 years, Middle Aged 45-64 years, Aged 65-79 years, Migraine] [kn3ptfq7c6lz, r4hp0qywwn28, 11g43clqdpk96, r4hp0r5sbtj7, q25gz0m8n54j, r4hp0r2dwmn5] [Pharmacological, Topiramate, Oral, Propranolol, No active treatment, Placebo] [q25g9q497cwj, r4hp3896n2zy, r4hp5zkjq0c3, ZxM7N2m9kOhRe2] [Physiological or clinical, Migraine, Evaluating Response To Treatment, Assessment Of Quality Of Life] [lr5qxyw6ww35, kk05h7rpym8w, pwhpjmwdbgkh, kk05h7rpym8x, kk05h7rpym8y, pwhpjmwdbgkg, kk05h7rpym8z, kk05h7rpym90, kk05h7rpym91, pwhpjmwdbgkf, r4hp3896n2zy, r4hp3p8ymhbg, r4hp38gs74r1, r4hp3885vk99, r4hp38mgkgb9, r4hp39w4k8tw, r4hp38c875ch, r4hp38mgkgj7, r4hp38xpp96f, r4hp3853gyf1, r4hp38l4pbqh, r4hp39krwnf7, r4hp38qpgvxq, r4hp387wrzbr, r4hp38mrn1cp, r4hp39tp4ckr, r4hp38819rxs, r4hp39mjd4qj, r4hp39cb1vjv] [q25g9q497cwj, r4hp3896n2zy, r4hp3p8ymhbg, r4hp38gs74r1, r4hp3885vk99, r4hp38mgkgb9, r4hp39w4k8tw, r4hp38c875ch, r4hp38mgkgj7, r4hp38xpp96f, r4hp3853gyf1, r4hp38l4pbqh, r4hp39krwnf7, r4hp38qpgvxq, r4hp387wrzbr, r4hp38mrn1cp, r4hp39tp4ckr, r4hp38819rxs, r4hp39mjd4qj, r4hp39cb1vjv, r4hp5zkjq0c3, r4hp5zjccp22, r4hp5zjccp1z, r4hp5zm4npzj, r4hp5zhs6j1c, zPNWxozYM3fxBr, r4hp5zjng89p, r4hp5yw4mj85, ZxM7N2m9kOhRe2, 3BgZRR0YwkHzkP] [kn3ptfq7c6lz, r4hp0qywwn28, r4hp13n1ty7w, r4hp13rf9486, 11g43clqdpk96, r4hp0r5sbtj7, zrcts8tmxp0g, r4hp13n1ty7r, r4hp13mrrc91, r4hp13mrrc8c, r4hp13mgns4j, r4hp13mrrc83, q25gz0m8n54j, r4hp0r2dwmn5, PXmmxKGR3ocNEg] 786 [] Interventional ParallelGroup 2001-04-01T00:00:00Z 2002-12-31T00:00:00Z Industry Unassigned https://doi.org/10.25934/00004657 Phase3 [Migraine] [Topiramate, Propranolol, Placebo] [Migraine, Evaluating Response To Treatment, Migraine, Assessment Of Quality Of Life] [Migraine, Common Migraine, Classic Migraine, Headache] [topiramate, propranolol] [] [] [] [] [] [] [] [] None None Default 0
1 1.0 48c15b9e-76d7-45cc-a044-6c253da74ac1 A Phase 3, Randomized, Open-label, Parallel-group, Multicenter Trial to Evaluate the Safety and Efficacy of Infliximab (REMICADE�) in Pediatric Subjects With Moderately to Severely Active Ulcerative Colitis CR012388 d1bd067d-3e2d-43b5-80f1-6235e85c2876 JNJ Johnson & Johnson Yoda Project Centocor, Inc. N NCT00336492 [C0168T72] [lr5qxyw6ww35, kk05h7rpym8v, kk05h7rpym8w, kk05h7rpym8x, r4hp3q5y2klm] [Male and Female, Child, Preschool 2-5 years, Child 6-12 years, Adolescent 13-18 years, Acute Ulcerative Colitis] [kn3ptfq7c6lz, r4hp13l4sngc, 11g43clqdpk72] [Pharmacological, Infliximab, Intravenous] [q25g9q497cwj, r4hp5zkjq0c3, r4hp5zfl2n7g] [Physiological or clinical, Evaluating Response To Treatment, Activity Analysis] [lr5qxyw6ww35, kk05h7rpym8v, pwhpjmwdbgkh, kk05h7rpym8w, kk05h7rpym8x, r4hp3q5y2klm, r4hp384nvkyl, r4hp39vkd3t3, r4hp39lc1tgs, r4hp39hf705k, r4hp39kgt2jy, r4hp38mgkgb9, r4hp39w4k8tw, r4hp38c875ch, r4hp38mgkgj7, r4hp38jd5vlp, r4hp38gxry6k, r4hp38bczf6g, r4hp38yky17z, r4hp38z9n01d, r4hp39qlrrnf, r4hp381fy5cs, r4hp381fy5cw, r4hp393pwqm9, r4hp39mjd4qj, r4hp3b0d86ss, r4hp39znk89c, r4hp39b989y6, r4hp38mb0nx2, r4hp39ys9j31, r4hp39ln4dll, r4hp39krwnf7, r4hp39l6j13q, r4hp38hhy39p, r4hp381fy5cl, r4hp38jtt70k, r4hp38f9t7jr, r4hp39zj0gsm, r4hp38nsfl6q, r4hp38n1qmfs, r4hp39ln4dhj, r4hp39j9gr79, r4hp38jp8fh0, r4hp38y8vgc2, r4hp39v3rr24, r4hp3b0twljt, r4hp38819rv0, r4hp3pdb2p7r, r4hp39hf702g, eM3W2jDdq4CnoM] [q25g9q497cwj, r4hp5zkjq0c3, r4hp5zjccp22, r4hp5zjccp1z, r4hp5zm4npzj, r4hp5zhs6j1c, zPNWxozYM3fxBr, r4hp5zjng89p, r4hp5yw4mj85, r4hp5zfl2n7g, r4hp5yxm1fj5, r4hp5yq9rf4h, r4hp5z5crc2v, r4hp5zbhq1cb, r4hp5z0tyv1k, r4hp5yvdxkr1, r4hp5zjccp2h, r4hp5zkzbcvf] [kn3ptfq7c6lz, r4hp13l4sngc, r4hp13nhg9tp, YgJdXZMgAyT4za, r4hp13n1ty7z, r4hp13qznrsn, 3r0XoawY07FG2Z, 11g43clqdpk72, PNz3A1OgQesRKw, 11g43clqdpk4z, r4hp5z5nty2h, r4hp5zj2934z, r4hp5zhs6j1c, zPNWxozYM3fxBr] 60 [United States, Canada, Belgium, Denmark, Netherlands] Interventional ParallelGroup 2006-09-01T00:00:00Z 2010-04-30T00:00:00Z Industry Unassigned https://doi.org/10.25934/00004723 Phase3 [Acute Ulcerative Colitis] [Infliximab] [Evaluating Response To Treatment, Activity Analysis] [Ulcerative Colitis] [infliximab] [] [] [] [] [] [] [] [] None None Default 0