Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.6k views
in Technique[技术] by (71.8m points)

python - scrapping <a href> and title from some <div class = "xxx">

I am doing web scrapping and have done this so far-

page = requests.get('http://abcdefgh.in')
print(page.status_code)
soup = BeautifulSoup(page.content, 'html.parser')
all_p = soup.find_all(class_="p-list-sec")
print((all_p))

After doing this, I have something like this when I print all_p-

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div>

<div class = "p-list-sec">
<UI> <li>  < a href = "link1", title = "tltle1">title1<a/></li>
     <li>  < a href = "link2", title = "tltle2">title2<a/></li>
     <li>  < a href = "link3", title = "tltle3">title3<a/></li>
</ui>
</div> and so on up to around 40 div classes. 

Now I want to extract all the a href and title inside class p-list-sec and want to store them into file. I know how to store them into file but extracting all the a href and title from the all p-list-sec class is something which is creating issue for me. I am using python 3.9 and requests and beautifulsoup libraries in windows 10 using command prompt.

Thanks, akhi


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Just in case

Just in case you want to avoid looping twice, you can also use the BeautifulSoup css selector and chain class and <a>. So take your soup and select like this:

soup.select('.p-list-sec a')

To shape the information you like to process you can use a single for loop or a list comprehension all in one line:

[{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]

Output

[{'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'},
 {'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'},
 {'url': 'link1', 'title': 'tltle1'},
 {'url': 'link2', 'title': 'tltle2'},
 {'url': 'link3', 'title': 'tltle3'}]

To store it in an csv feel free to push it into pandas or csv

Pandas:

import pandas as pd

pd.DataFrame([{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]).to_csv('url.csv', index=False)

CSV:

import csv
data_list = [{'url':link['href'], 'title':link['title']} for link in soup.select('.p-list-sec a')]

keys = data_list[0].keys()

with open('url.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(data_list)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...