Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
979 views
in Technique[技术] by (71.8m points)

xpath - Scrapy + Splash: scraping element inside inner html

I'm using Scrapy + Splash to crawl webpages and try to extract data form google ad banners and other ads and I'm having difficulty getting scrapy to follow the xpath into them.

I'm using the Scrpay-Splash API to render the pages so their scripts and images load and to take screenshots but it seems google ad banners are created by JS scripts that then insert their contents into a new html document within an iframe in the webpage, as so:The red area is the iframe container, the blue shows the link I want to extract

Splash makes sure the code is rendered so I don't run into the usual problem scrapy has with scripts where it reads the script's content instead of it's resulting html -- but I can't seem to find a way to indicate the XPath necessary to get to the element nodes I need (ad's href link).

If I inspect the element in google and copy it's xpath it simply gives me //*[@id="aw0"], which I feel would work if the iframe's html was all there was here, but it returns empty no matter how I write it and I fele it's probably because XPath doesn't elegantly handle html documents stacked within html documents.

The XPath to the iframe that contains the google ad code is //*[@id="google_ads_iframe_/87824813/hola/blogs/home_0"]{the numbers are constant}.

Is there a way to stack these XPaths together to get scrapy to follow the trail into the container I need? Or should I be parsing the Splash response object directly in some other way and I can't rely on Response.Xpath/Response.CSS for this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The problem is that iframe content is not returned as a part of html. You can either try to fetch iframe content directly (by its src), or use render.json endpoint with iframes=1 option:

# ...
    yield SplashRequest(url, self.parse_result, endpoint='render.json', 
                        args={'html': 1, 'iframes': 1})

def parse_result(self, response):
    iframe_html = response.data['childFrames'][0]['html']
    sel = parsel.Selector(iframe_html)
    item = {
        'my_field': sel.xpath(...),
        # ...  
    }

/execute endpoint doesn't support fetching iframes content as of Splash 2.3.3.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

56.8k users

...