Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
223 views
in Technique[技术] by (71.8m points)

python - How to make Selenium scripts work faster?

I use Python Selenium and Scrapy for crawling a website.
But my script is so slow,

Crawled 1 pages (at 1 pages/min)

I use CSS SELECTOR instead of XPATH for optimise the time.
I change the middlewares

'tutorial.middlewares.MyCustomDownloaderMiddleware': 543,

is Selenium is too slow or I should change something in Setting?

my code:

def start_requests(self):
    yield Request(self.start_urls, callback=self.parse)
def parse(self, response):
    display = Display(visible=0, size=(800, 600))
    display.start()
    driver = webdriver.Firefox()
    driver.get("http://www.example.com")
    inputElement = driver.find_element_by_name("OneLineCustomerAddress")
    inputElement.send_keys("75018")
    inputElement.submit()
    catNums = driver.find_elements_by_css_selector("html body div#page div#main.content div#sContener div#menuV div#mvNav nav div.mvNav.bcU div.mvNavLk form.jsExpSCCategories ul.mvSrcLk li")
    #INIT
    driver.find_element_by_css_selector(".mvSrcLk>li:nth-child(1)>label.mvNavSel.mvNavLvl1").click()
    for catNumber in xrange(1,len(catNums)+1):
        print "
 IN catnumber 
"
        driver.find_element_by_css_selector("ul#catMenu.mvSrcLk> li:nth-child(%s)> label.mvNavLvl1" % catNumber).click()
        time.sleep(5)
        self.parse_articles(driver)
        pages = driver.find_elements_by_xpath('//*[@class="pg"]/ul/li[last()]/a')

        if(pages):
            page = driver.find_element_by_xpath('//*[@class="pg"]/ul/li[last()]/a')

            checkText = (page.text).strip()
            if(len(checkText) > 0):
                pageNums = int(page.text)
                pageNums = pageNums  - 1
                for pageNumbers in range (pageNums):
                    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "waitingOverlay")))
                    driver.find_element_by_css_selector('.jsNxtPage.pgNext').click()
                    self.parse_articles(driver)
                    time.sleep(5)

def parse_articles(self,driver) :
    test = driver.find_elements_by_css_selector('html body div#page div#main.content div#sContener div#sContent div#lpContent.jsTab ul#lpBloc li div.prdtBloc p.prdtBDesc strong.prdtBCat')

def between(self, value, a, b):
    pos_a = value.find(a)
    if pos_a == -1: return ""
    pos_b = value.rfind(b)
    if pos_b == -1: return ""
    adjusted_pos_a = pos_a + len(a)
    if adjusted_pos_a >= pos_b: return ""
    return value[adjusted_pos_a:pos_b]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

So your code has few flaws here.

  1. You use selenium to parse the page contents when scrapy Selectors are faster and more efficient.
  2. You start a webdriver for every response.

This can be resolved very eloquently by using scrapy's Downloader middlewares! You want to create a custom downloader middleware that would download requests using selenium rather than scrapy downloader.

For example I use this:

# middlewares.py
class SeleniumDownloader(object):
    def create_driver(self):
        """only start the driver if middleware is ever called"""
        if not getattr(self, 'driver', None):
            self.driver = webdriver.Chrome()

    def process_request(self, request, spider):
        # this is called for every request, but we don't want to render
        # every request in selenium, so use meta key for those we do want.
        if not request.meta.get('selenium', False):
            return request
        self.create_driver()
        self.driver.get(request.url)
        return HtmlResponse(request.url, body=self.driver.page_source, encoding='utf-8')

Activate your middleware:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middleware.SeleniumDownloader': 13,
}

Then in your spider you can specify which urls to download via selenium driver by adding a meta argument.

# you can start with selenium
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, meta={'selenium': True})

def parse(self, response):
    # this response is rendered by selenium!
    # also can use no selenium for another response if you wish
    url = response.xpath("//a/@href")
    yield scrapy.Request(url)

The advantages of this approach is that you your driver is being started only once and used to download page source only, the rest is left to proper asynchronous scrapy tools.
The disadvantages is that you cannot click buttons around and such since you are not exposed to the driver. Most of the times you can reverse engineer what the buttons do via network inspector and you should never need to do any clicking with the driver itself.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...