python - how to filter duplicate requests based on url in scrapy

Question

Welcome To Ask or Share your Answers For Others

python - how to filter duplicate requests based on url in scrapy

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - how to filter duplicate requests based on url in scrapy

I am writing a crawler for a website using scrapy with CrawlSpider.

Scrapy provides an in-built duplicate-request filter which filters duplicate requests based on urls. Also, I can filter requests using rules member of CrawlSpider.

What I want to do is to filter requests like:

http:://www.abc.com/p/xyz.html?id=1234&refer=5678

If I have already visited

http:://www.abc.com/p/xyz.html?id=1234&refer=4567

NOTE: refer is a parameter that doesn't affect the response I get, so I don't care if the value of that parameter changes.

Now, if I have a set which accumulates all ids I could ignore it in my callback function parse_item (that's my callback function) to achieve this functionality.

But that would mean I am still at least fetching that page, when I don't need to.

So what is the way in which I can tell scrapy that it shouldn't send a particular request based on the url?

question from:https://stackoverflow.com/questions/12553117/how-to-filter-duplicate-requests-based-on-url-in-scrapy

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T17:19:38+0000

You can write custom middleware for duplicate removal and add it in settings

import os

from scrapy.dupefilter import RFPDupeFilter

class CustomFilter(RFPDupeFilter):
"""A dupe filter that considers specific ids in the url"""

    def __getid(self, url):
        mm = url.split("&refer")[0] #or something like that
        return mm

    def request_seen(self, request):
        fp = self.__getid(request.url)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

Then you need to set the correct DUPFILTER_CLASS in settings.py

DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'

It should work after that

Categories

python - how to filter duplicate requests based on url in scrapy

python - how to filter duplicate requests based on url in scrapy

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags