An alternative approach would be to extract the data directly from the JSON api on the site. This could be done without the overhead of selenium as follows:
from bs4 import BeautifulSoup
import requests
import json
session = requests.Session()
r = session.get('https://www.yelp.com/biz/ziggis-coffee-longmont')
#r = session.get('https://www.yelp.com/biz/menchies-frozen-yogurt-lafayette')
soup = BeautifulSoup(r.content, 'lxml')
# Locate the business ID to use (from JSON inside one of the script entries)
for script in soup.find_all('script', attrs={"type" : "application/json"}):
gaConfig = json.loads(script.text.strip('<!-->'))
try:
biz_id = gaConfig['gaConfig']['dimensions']['www']['business_id'][1]
break
except KeyError:
pass
# Build a suitable JSON request for the required information
json_post = [
{
"operationName": "GetBusinessAttributes",
"variables": {
"BizEncId": biz_id
},
"extensions": {
"documentId": "35e0950cee1029aa00eef5180adb55af33a0217c64f379d778083eb4d1c805e7"
}
},
{
"operationName": "GetBizPageProperties",
"variables": {
"BizEncId": biz_id
},
"extensions": {
"documentId": "f06d155f02e55e7aadb01d6469e34d4bad301f14b6e0eba92a31e635694ebc21"
}
},
]
r = session.post('https://www.yelp.com/gql/batch', json=json_post)
j = r.json()
business = j[0]['data']['business']
print(business['name'], '
')
for property in j[1]['data']['business']['organizedProperties'][0]['properties']:
print(f'{"Yes" if property["isActive"] else "No":5} {property["displayText"]}')
This would give you the following entries:
Ziggi's Coffee
Yes Offers Delivery
Yes Offers Takeout
Yes Accepts Credit Cards
Yes Private Lot Parking
Yes Bike Parking
Yes Drive-Thru
No No Outdoor Seating
No No Wi-Fi
How was this solved?
Your best friend here is your browser's network dev tools. With this you can watch the requests made to obtain the information. The normal process flow is the initial HTML page is downloaded, this runs the javascript and requests more data to further fill the page.
The trick is to first locate where the data you want is (often returned as JSON), then determine what you need to recreate the parameters needed to make the request for it.
To further understand this code, use print()
. Print everything, it will show you how each part builds on the next part. It is how the script was written, one bit at a time.
Approaches using Selenium allow the javascript to work, but most times this is not needed as it is just making requests and formatting the data for display.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…