Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
403 views
in Technique[技术] by (71.8m points)

java - Web scraping with jsoup and selenium

I want to extract some information from this dynamic website with selenium and jsoup. To get the information I want to extract I have to click to the button "Details ?ffnen". The first picture shows the website before cklicking the button and the second shows the website after clicking the button. The red marked information is the information I want to extract.

enter image description here

enter image description here

I first tried to extract the information only with Jsoup, but as I was told Jsoup can not handle dynamic content, so I am now trying to extract the Information with selenium and Jsoup like you can see in the sourcecode. Howerver I am not really sure if selenium is the right thing for this, so maybe there are other ways to extract the information I need more simple, but it is important that this could be done with Java.

The next two pictures show the html code before clicking the button and after clicking the button.

enter image description here

enter image description here

public static void main(String[] args) {
    
    WebDriver driver = new FirefoxDriver(createFirefoxProfile());
    driver.get("http://www.seminarbewertung.de/seminar-bewertungen?id=3448");
    //driver.findElement(By.cssSelector("input[type='button'][value='Details ?ffnen']")).click();
    WebElement webElement = driver.findElement(By.cssSelector("input[type='submit'][value='Details ?ffnen'][rating_id='2318']"));
    JavascriptExecutor executor = (JavascriptExecutor)driver;
    executor.executeScript("arguments[0].click();", webElement);
    String html_content = driver.getPageSource();
    //driver.close();
    
    
    Document doc1 = Jsoup.parse(html_content);
    System.out.println("Hallo");
    
    Elements elements = doc1.getAllElements();
    for (Element element : elements) {
        System.out.println(element);
    }

}

private static FirefoxProfile createFirefoxProfile() {
    File profileDir = new File("/tmp/firefox-profile-dir");
    if (profileDir.exists())
        return new FirefoxProfile(profileDir);
    FirefoxProfile firefoxProfile = new FirefoxProfile();
    File dir = firefoxProfile.layoutOnDisk();
    try {
        profileDir.mkdirs();
        FileUtils.copyDirectory(dir, profileDir);
    } catch (IOException e) {
        e.printStackTrace();
    }
    return firefoxProfile;
}

With this source code I can not find the div element with the information I want to extract.

It would be really great, if somebody could help me with this.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

It is true that Jsoup can't handle dynamic content if it is javascript generated, but in your case the button is making an Ajax request and this can be done with Jsoup pretty well.

I'd suggest to make a call to retieve the buttons and their ids, and then make succesive calls (Ajax posts) to retrieve the details (comments or whatever).

The code could be:

    Document document = Jsoup.connect("http://www.seminarbewertung.de/seminar-bewertungen?id=3448").get();
    //we retrieve the buttons
    Elements select = document.select("input.rating_expand");
    //we go for the first
    Element element = select.get(0);
    //we pick the id
    String ratingId = element.attr("rating_id");

    //the Ajax call
    Document document2 = Jsoup.connect("http://www.seminarbewertung.de/bewertungs-details-abfragen")
            .header("Accept", "*/*")
            .header("X-Requested-With", "XMLHttpRequest")
            .data("rating_id", ratingId)
            .post();

    //we find the comment, and we are done
    //note that this selector is only as a demo, feel free to adjust to your needs
    Elements select2 = document2.select("div.ratingbox div.panel-body.text-center");
    //We are done!
    System.out.println(select2.text());

This code will print the desired:

Das Eingehen auf individuelle Bedürfnisse eines jeden einzelnen Teilnehmer scheint mir ein Markenzeichen von Fromm zu sein. Bei einem früheren Seminar habe ich dies auch schon so erlebt!

I hope it will help.

Have a happy new year!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...