python - how to process all kinds of exception in a scrapy project, in errback and callback?

Question

Welcome To Ask or Share your Answers For Others

python - how to process all kinds of exception in a scrapy project, in errback and callback?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - how to process all kinds of exception in a scrapy project, in errback and callback?

I am currently working on a scraper project which is much important to ensure EVERY request got properly handled, i.e., either to log an error or to save a successful result. I've already implemented the basic spider, and I can now process 99% of the requests successfully, but I could get errors like captcha, 50x, 30x, or even no enough fields in the result(then I'll try another website to find the missing fields).

At first, I thought it's more "logical" to raise exceptions in the parsing callback and process them all in errback, this could make the code more readable. But I tried only to find out errback can only trap errors in the downloader module, such as non-200 response statuses. If I raise a self-implemented ParseError in the callback, the spider just raises it and stops.

Even if I'll have to process the parsing request directly in the callback, I don't know how to retry the request immediately in the callback in a clean fashion. u know, I may have to include a different proxy to send another request, or modify some request header.

I admit I'm relatively new to scrapy but I've tried back and forth for days and still cannot get this to working… I've checked every single question on SO and no one matches, thanks in advance for the help.

UPDATE: I realize this could be a very complex question so I try to illustrate the scenario in the following pseudo code, hope this helps:

from scraper.myexceptions import *

def parseRound1(self, response):

    .... some parsing routines ...
    if something wrong happened:
       # this causes the spider raises a SpiderException and stops
       raise CaptchaError
    ...

    if no enough fields scraped:
       raise ParseError(task, "no enough fields")
    else:
       return items

def parseRound2(self, response):
    ...some other parsing routines...

def errHandler(self, failure):
    # how to trap all the exceptions?
    r = failure.trap()
    # cannot trap ParseError here
    if r == CaptchaError:
       # how to enqueue the original request here?
       retry
    elif r == ParseError:
        if raised from parseRound1:
            new request for Round2
        else:
            some other retry mechanism
    elif r == HTTPError:
       ignore or retry

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:38:48+0000

EDIT 16 nov 2012: Scrapy >=0.16 uses a different method to attach methods to signals, extra example added

The most simple solution would be to write an extension in which you capture failures, using Scrapy signals. For example; the following extension will catch all errors and print a traceback.

You could do anything with the Failure - like save to your database, or send an email - which itself is an instance of twisted.python.failure.Failure.

For Scrapy versions till 0.16:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class FailLogger(object):
  def __init__(self):
    """ 
    Attach appropriate handlers to the signals
    """
    dispatcher.connect(self.spider_error, signal=signals.spider_error)

  def spider_error(self, failure, response, spider):
    print "Error on {0}, traceback: {1}".format(response.url, failure.getTraceback())

For Scrapy versions from 0.16 and up:

from scrapy import signals

class FailLogger(object):

  @classmethod
  def from_crawler(cls, crawler):
    ext = cls()

    crawler.signals.connect(ext.spider_error, signal=signals.spider_error)

    return ext

  def spider_error(self, failure, response, spider):
    print "Error on {0}, traceback: {1}".format(response.url, failure.getTraceback())

You would enable the extension in the settings, with something like:

EXTENSIONS = {
'spiders.extensions.faillog.FailLogger': 599,
}

Categories

python - how to process all kinds of exception in a scrapy project, in errback and callback?

python - how to process all kinds of exception in a scrapy project, in errback and callback?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags