我没有在这里得到地址。它给了我每个人的地址'我'。我想为每个人取地址。此代码提供除了来自bs4 import BeautifulSoup的地址之外的所有其他详细信息
import requests
for count in range(1,2):
r = requests.get('https://www.ratemds.com/best-doctors/?
country=in&page='+str(count))
soup = BeautifulSoup(r.text,'lxml')
for links in soup.find_all('a',class_='search-item-doctor-link'):
link = "https://www.ratemds.com"+links['href']
r2 = requests.get(link)
soup2 = BeautifulSoup(r2.text,'lxml')
try:
name = soup2.select_one('h1').text
print "NAME:"+name
except:
print "NAME:NA"
try:
speciality = soup2.select_one('.search-item-info a').text
print "SPECIALITY:"+speciality
except:
print "SPECIALITY:NA"
try:
gender = soup2.select_one('i + a').text
print "GENDER:"+gender
except:
print "GENDER:NA"
try:
speciality1 = soup2.select_one('i ~ [itemprop=name]').text
print "SPECIALTY1:"+speciality1
except:
print"SPECIALITY1:NA"
try:
contact = soup2.select_one('[itemprop=telephone]')['content']
print "CONTACT:"+contact
except:
print "CONTACT:NA"
try:
website = soup2.select_one('[itemprop=sameAs]')['href']
print "WEBSITE:"+website
except:
print "WEBSITE:NA"
try:
add = [item['content'] for item in soup2.select('[itemprop=address] meta')]
print "ADDESS:"+add
except:
print "ADDRESS:NA"
以下是更广泛信息的选择器示例
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.ratemds.com/doctor-ratings/dr-dilip-raja-mumbai-mh-in', headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(r.text,'lxml')
details = {
'name' : soup.select_one('h1').text,
'speciality' : soup.select_one('.search-item-info a').text,
'rating' : soup.select_one('.star-rating')['title'],
'gender' : soup.select_one('i + a').text,
'specialty_full' : soup.select_one('i ~ [itemprop=name]').text,
'phone' : soup.select_one('[itemprop=telephone]')['content'],
'address' : [item['content'] for item in soup.select('[itemprop=address] meta')],
'website' : soup.select_one('[itemprop=sameAs]')['href']
}
print(details)
您还可以将脚本标记定位为可以转换为json的大量信息。遗憾的是,一个漂亮的库转换hex> ascii似乎没有工作,因此已经完成了对dict的替换。
import requests
import json
from bs4 import BeautifulSoup as bs
import pandas as pd
import re
res = requests.get('https://www.ratemds.com/doctor-ratings/dr-dilip-raja-mumbai-mh-in', headers={'User-Agent': 'Mozilla/5.0'})
soup = bs(res.content, 'lxml')
r = re.compile(r'window\.DATA\.doctorDetailProps = JSON\.parse(.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0].rstrip('");').lstrip('("')
convert_dict = {
'\\u0022' : '"',
'\\u002D' : '-',
'\\u003D' : '=',
'\\u005Cn' : ' ',
'\\u0027' : "'",
'\\u005Cr' : '',
'false' : '"false"',
'true' : '"True"',
'null' : '"null"'
}
for k, v in convert_dict.items():
script = script.replace(k, v)
items = json.loads(script)
doctor = items['doctor']
print(doctor['full_name'])
print(doctor['specialty_name'])
print(doctor['gender'])
print(doctor['geocode_address'])
print(doctor['rating'])
假设你做了pip install lxml
和pip install beautifulsoup4
,你使用的代码似乎工作得很好。
这里的工作示例(单击“运行”):https://repl.it/repls/DarkorangeFinishedSoftwaresuite
如果你没有得到与我的工作示例相同的结果,那么它可能是你的request.get()
网址中的额外空间。在这种情况下,您可以复制我使用的代码,看看它是否适合您。