Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
5.9k views
in Technique[技术] by (71.8m points)

python - Scraping website with Beautiful Soup that requires login

Im trying to scrape one website that requires login with Python and Beautiful Soup. I want to scrape this page (when you click it it will redirect you to login page). : https://www.eurekalert.org/reporter/embargoed.php

This is login page: https://www.eurekalert.org/login.php

On the first link that I provided, there are a lot of news articles that has links like this: https://www.eurekalert.org/emb_releases/2021-01/embl-ebn011121.php

So every 'href' has '/emb_releases/2021-01/embl-ebn011121.php'

The problem is that I can not get HTML of the page (first link) where I can extract hrefs. Wanted hrefs has this css tag 'article.post a'. This is my code:

from bs4 import BeautifulSoup
import requests

url = 'https://www.eurekalert.org/'
login = 'login'

headers = {'origin': url,
           'referer': url+login}

s = requests.session()

login_payload = {'login': 'xxx',
                 'password': 'xxx'}

# Each YT tutorial says that it should be .post here, but on my website the request is get, not post. I have tried both ways, its the same result
login_req = s.post(url+login, headers=headers, data = login_payload)
print(login_req) # returns 200, if i try .get it also returns 200


login_response = s.get(url+'reporter/embargoed.php')
print(login_response) # returns 200
soup = BeautifulSoup(login_response.content, 'html.parser')
print(soup) # prints HTML but not the HTML that I want

I have also tried this, but I get the same result:

login_response = requests.get(url+'reporter/embargoed.php', auth = ('username', 'password'))
soup = BeautifulSoup(login_response.content, 'html.parser')
print(soup) # prints HTML but not the HTML that I want

This is the first time Im trying to scrape website that requires login, so there are probably some stupid stuff on my code. What am I doing bad? I googled a lot, and I tried a lot of different stuf but I always failed.

Thanks for helping me out.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

go to login page, put your user name and password , press F12 and record from Network tab

then click on login then copy curl as per the below images, then search for curl to python converter and get the code as per second image, also the code will be attached for you as example

1- enter image description here

2- enter image description here

and the code will be like this

    import requests

cookies = {
    '__utmt_8254f77d54ec9886070127029a0b81da': '1',
    '_fbp': 'fb.1.1610535613017.434450469',
    '__utmt': '1',
    '_ga': 'GA1.2.1008639424.1610535613',
    '_gid': 'GA1.2.56271763.1610535614',
    '__utma': '28029352.1008639424.1610535613.1610535864.1610535864.1',
    '__utmc': '28029352',
    '__utmz': '28029352.1610535864.1.1.utmcsr=(direct)^|utmccn=(direct)^|utmcmd=(none)',
    '__utmb': '28029352.1.10.1610535864',
    'sat_ppv': '84',
}

headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'Origin': 'https://www.eurekalert.org',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Referer': 'https://www.eurekalert.org/login.php',
    'Accept-Language': 'en-US,en;q=0.9',
}

data = {
  'frompage': '^',
  'username': 'Username',
  'password': 'Password'
}

def loginToPage():


# Perform login
response = requests.session().post('https://www.eurekalert.org/login.php', headers=headers, cookies=cookies, data=data)

if response.ok:
    print(' logged in successfully')
    return True

else:
    print('failed to log in')
    return False

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...