I have an issue when parsing an html page through BS4. I have a hidden div in an html page of which I want to read the content using BeautifulSoup. The content of which is generated dynamically by a javascript function which is triggered via body onload.
The problem is: when I call the page in my browser, the tag has the content it is supposed to have. When I parse the same page via BS4, the tag is empty.
I could not find any information with regards to BS4 not being able to handle onload javascript-generated content, so not sure what the issue may be here.
Python script:
import urllib.request
from bs4 import BeautifulSoup
import time
import datetime
eT = time.time()
version = 1
vNum = str(version)
t = datetime.datetime.now()
d = "0" + str(t.day)
#d = d.rstrip()
d = d[-2:]
m = "0" + str(t.month)
#m = m.rstrip()
m = m[-2:]
y = str(t.year)
dStr = y + m + d
resultFile = 'output/classAndIdList-' + dStr + '-v' + vNum + '.txt'
pageListFile = 'input/quickListFR.txt'
f = open(pageListFile, mode='r', encoding='utf-8')
urlRoot = 'http://dev.example.com/'
fOut = open(resultFile, 'w')
ciList = []
# for url in urls.split('
'):
for l in f:
u = l.rstrip()
url = urlRoot + u
html_content = urllib.request.urlopen(url)
time.sleep(1)
html_text = html_content.read()
soup = BeautifulSoup(html_text)
ciTag = soup.find(id="testDivCSS")
print(ciTag)
ciString = ciTag.get_text()
# print(ciString)
ciArray = ciString.split(',')
# print(ciArray)
for c in ciArray:
if c not in ciList:
ciList.append(c)
fOut.write(c + '
')
print(c)
print(u + '... DONE')
fOut.close()
Example result page via BeautifulSoup:
Example-page-1.html... DONE
<div id="testDivCSS" style="display: none;"> </div>
And the div in the browser (indicating that the php and javascript parts work fine):
<div id="testDivCSS" style="display: none;">div#menu_rightup,div#social,div#sidebar,div#specific,div#menu_rightdown,div#footer</div>
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…