I'm a paid member of wsj and I tried to scrape articles to do my NLP project. I thought I kept the session.
rs = requests.session()
login_url="https://sso.accounts.dowjones.com/login?client=5hssEAdMy0mJTICnJNvC9TXEw3Va7jfO&protocol=oauth2&redirect_uri=https%3A%2F%2Faccounts.wsj.com%2Fauth%2Fsso%2Flogin&scope=openid%20idp_id&response_type=code&nonce=18091b1f-2c73-4a93-ab10-77b0d4d4f9d3&connection=DJldap&ui_locales=en-us-x-wsj-3&mg=prod%2Faccounts-wsj&state=NfljSw-Gz-TnT_I6kLjnTa2yxy8akTui#!/signin"
payload={
"username":"xxx@email",
"password":"myPassword",
}
result = rs.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
The article I want to parse.
r = rs.get('https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y')
Then I found the html is still the one for non-member
I also tried another method by using CURL to save the cookies after I login
curl -c cookies.txt -I "https://www.wsj.com"
curl -v cookies.txt "https://www.wsj.com/articles/singapore-prime-minister-lee-rejects-claims-he-misused-state-powers-in-family-feud-1499094761?tesla=y" > test.html
The result is the same.
I'm not very familiar with the mechanism how the authencation work behind the browser. Can someone explains why both the methods above are failed and how should I fix it to get my goal. Thanks you very much.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…