Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.1k views
in Technique[技术] by (71.8m points)

selenium - scraping hidden data within an a tag

am trying to scrape info from this website https://www.heiminfo.ch/institutionen the HTML looks like this where the info am looking for is stored.

<article class="institution card pushed" data-name="HOF SPEICHER AG - (di Gallo)" data-institution-type="HIALTER HIEB CVAPPENZELLALTER" data-subscription="SILBER" data-zoom="15" data-track-content="" data-content-target="Huta5R8" data-lng="9.441113" data-group="Kurt di Gallo Holding AG" data-content-piece="Huta5R8" data-content-name="Institution View List" data-lat="47.41353" style="height: 249.95px;" data-ol-has-click-handler="">
            <a href="/institution/hof-speicher-ag/Huta5R8" data-remote-url="" data-id="Huta5R8" data-ol-has-click-handler="">
                <div class="img-container">
                    
                        <img class=" lazyloaded" width="450" src="/filesystem/clientadditionportrait/2018/02/698FA7D4-F5A4-89B4-8CE87700B6C2D216/images/fit/Hof-Speicher1-w-450-hc19BB84B3.jpg" data-src="/filesystem/clientadditionportrait/2018/02/698FA7D4-F5A4-89B4-8CE87700B6C2D216/images/fit/Hof-Speicher1-w-450-hc19BB84B3.jpg" alt="HOF SPEICHER AG">
                    
                </div>
                
                
                <div class="text-container" style="height: 114.99px;">
                    <div class="name-and-addition">
                        <h2 style=""><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">HOF SPEICHER AG </font></font></h2>
                        
                            <p class="addition" style=""><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">(di Gallo)</font></font></p>
                        
                    </div>
                    
                    <p class="location">
                        <span class="canton"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">AR </font></font></span>
                        <span class="plz"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">9042 </font></font></span>
                        <span class="city"><font style="vertical-align: inherit;"><font style="vertical-align: inherit;">memory</font></font></span>
                    </p>
                    
    
                </div>
            </a>
        </article>

I've been able to obtain first 500 institution names,city, plz, location. using this code:courtesy of Arundeep Chohan

    import requests
    import time
    import pandas  as  pd
    import csv
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    from time import sleep
    from random import randint
    from bs4 import BeautifulSoup
    from selenium import webdriver as wb
    driver=wb.Chrome('chromedriver.exe')
    driver.maximize_window()
    driver.get(' https://www.heiminfo.ch/institutionen')
    button=driver.find_element_by_xpath('/html/body/div[1]/main/div/section/form/div[1]/div[3]/div/button[3]').click();



  wait=WebDriverWait(driver, 5)
total=500
h=[]
while True:
    try:
        soup=BeautifulSoup(driver.page_source, 'html.parser')
        item=soup.find(class_='institutions')
        #item=driver.find_element_by_class_name('institutions')
        lsh=item.find_all(class_="name-and-addition")
        #lsh=item.find_element_by_class_name('name-and-addition')
        if(len(lsh)>=total):
            for e in lsh[:total]:
                h(e.text.strip)
                data=pd.DataFrame(zip(h), columns=['Adult Homes'])
            print(data)
            break
        wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".next.btn"))).click()
        time.sleep(5)
    except Exception as e:
        print(e)
        break

the remaining info is the phone number which hidden within the tag "<a> href=", which I have to click to open to obtain the telephone number. the totals number of these "<a> href=" is 1589. how can I write a scraper to iterate thru' all these links and obtain the hidden telephone number? the links look like this :

 [<a href="/institution/hof-speicher-ag/Huta5R8" data-remote-url="" data-id="Huta5R8" data-ol-has-click-handler="">][1]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神解答

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...