am trying to scrape info from this website https://www.heiminfo.ch/institutionen the HTML looks like this where the info am looking for is stored.
<article class="institution card pushed" data-name="HOF SPEICHER AG - (di Gallo)" data-institution-type="HIALTER HIEB CVAPPENZELLALTER" data-subscription="SILBER" data-zoom="15" data-track-content="" data-content-target="Huta5R8" data-lng="9.441113" data-group="Kurt di Gallo Holding AG" data-content-piece="Huta5R8" data-content-name="Institution View List" data-lat="47.41353" style="height: 249.95px;" data-ol-has-click-handler="">
<a href="/institution/hof-speicher-ag/Huta5R8" data-remote-url="" data-id="Huta5R8" data-ol-has-click-handler="">
<div class="img-container">
<img class=" lazyloaded" width="450" src="/filesystem/clientadditionportrait/2018/02/698FA7D4-F5A4-89B4-8CE87700B6C2D216/images/fit/Hof-Speicher1-w-450-hc19BB84B3.jpg" data-src="/filesystem/clientadditionportrait/2018/02/698FA7D4-F5A4-89B4-8CE87700B6C2D216/images/fit/Hof-Speicher1-w-450-hc19BB84B3.jpg" alt="HOF SPEICHER AG">
</div>
<div class="text-container" style="height: 114.99px;">
<div class="name-and-addition">
<h2 style=""><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">HOF SPEICHER AG </font></font></h2>
<p class="addition" style=""><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">(di Gallo)</font></font></p>
</div>
<p class="location">
<span class="canton"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">AR </font></font></span>
<span class="plz"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">9042 </font></font></span>
<span class="city"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">memory</font></font></span>
</p>
</div>
</a>
</article>
I've been able to obtain first 500 institution names,city, plz, location. using this code:courtesy of Arundeep Chohan
import requests
import time
import pandas as pd
import csv
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
from random import randint
from bs4 import BeautifulSoup
from selenium import webdriver as wb
driver=wb.Chrome('chromedriver.exe')
driver.maximize_window()
driver.get(' https://www.heiminfo.ch/institutionen')
button=driver.find_element_by_xpath('/html/body/div[1]/main/div/section/form/div[1]/div[3]/div/button[3]').click();
wait=WebDriverWait(driver, 5)
total=500
h=[]
while True:
try:
soup=BeautifulSoup(driver.page_source, 'html.parser')
item=soup.find(class_='institutions')
#item=driver.find_element_by_class_name('institutions')
lsh=item.find_all(class_="name-and-addition")
#lsh=item.find_element_by_class_name('name-and-addition')
if(len(lsh)>=total):
for e in lsh[:total]:
h(e.text.strip)
data=pd.DataFrame(zip(h), columns=['Adult Homes'])
print(data)
break
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".next.btn"))).click()
time.sleep(5)
except Exception as e:
print(e)
break
the remaining info is the phone number which hidden within the tag "<a> href=", which I have to click to open to obtain the telephone number. the totals number of these "<a> href=" is 1589. how can I write a scraper to iterate thru' all these links and obtain the hidden telephone number? the links look like this :
[<a href="/institution/hof-speicher-ag/Huta5R8" data-remote-url="" data-id="Huta5R8" data-ol-has-click-handler="">][1]
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…