Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.3k views
in Technique[技术] by (71.8m points)

Setting the Host header for redirected URLs with Python requests module

I'm working on a web scraping project in Python. I get a daily email from a service that has links in it. A typical link looks like:

http://clicks.serviceprovider.com/track/click/12345/www.serviceprovider.com?p=eyJzI...<snip>...JdfSJ9

In a browser, I can see that the server redirects from http://clicks.serviceprovider.com to https://www.serviceprovider.com?pageId=12345. Naturally, I want to scrape pageId 12345 with my Python code.

If I just do a requests.get(url), the server never responds. I suspect, but don't know for sure, that this is because requests isn't including a Host header.

If I set headers={'Host':'clicks.serviceprovider.com'}, I end up getting an HTTP 403 error. What I think is happening, but cannot demonstrate, is that requests is sending the original http GET request, is getting the HTTP 301 redirect, but when it does a GET for the https:// redirected page, it is still using the Host header for clicks.serviceprovider.com instead of www.serviceprovider.com from the redirected URL.

How can I tell requests to change the Host header with the redirect?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...