Trying to retrieve the page source from a website, I get a completely different (and shorter) text than when viewing the same page source through a web browser.
https://stackoverflow.com/questions/24563601/python-getting-a-wrong-source-code-of-the-web-page-asp-net
This fellow has a related issue, but obtained the home page source instead of the requested one - I am getting something completely alien.
The code is:
from urllib import request
def get_page_source(n):
url = 'https://www.whoscored.com/Matches/' + str(n) + '/live'
response = request.urlopen(url)
return str(response.read())
n = 1006233
text = get_page_source(n)
This is the page I am targeting in this example:
https://www.whoscored.com/Matches/1006233/live
The url in question contains rich information in the page source, but I end up getting only the following when running the above code:
text =
b'<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX,
NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta
name="viewport" content="initial-scale=1.0"><meta http-equiv="X-
UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;
height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=24&
xinfo=0-12919260-0 0NNY RT(1462118673272 111) q(0 -1 -1 -1) r(0 -1)
B12(4,315,0) U2&incident_id=276000100045095595-100029307305590944&edet=12&
cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px"
marginwidth="0px">Request unsuccessful. Incapsula incident ID:
276000100045095595-100029307305590944</iframe></body></html>'
What went wrong here? Can a server detect a robot even when it has not sent repetitive requests – if yes, how – and is there a way around?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…