Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
469 views
in Technique[技术] by (71.8m points)

python - Is it possible to remove requests from scrapys scheduler queue?

Is it possible to remove requests from scrapy's scheduler queue? I have a working routine that limits crawling to a certain domain for a set amount of time. It's working in the sense that it will not yield anymore links once the time limit was hit but as the queue can already contain thousands of requests for the domain I'd like to remove them from the scheduler queue once the time limit is hit.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Okay so I ended up following the suggestion from @rickgh12hs and wrote my own Downloader Middleware:

from scrapy.exceptions import IgnoreRequest
import tldextract

class clearQueueDownloaderMiddleware(object):
    def process_request(self, request, spider):
        domain_obj = tldextract.extract(request.url)
        just_domain = domain_obj.registered_domain
        if(just_domain in spider.blocked):
            print "Blocked domain: %s (url: %s)" % (just_domain, request.url)
            raise IgnoreRequest("URL blocked: %s" % request.url)

spider.blocked is a class list variable that contains blocked domains preventing any further downloads from the blocked domains. Seem to work great, cudos to @rickgh12hs!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...