scroll - how to scrape using scrapy with infinite loop without next page information

Question

Welcome To Ask or Share your Answers For Others

scroll - how to scrape using scrapy with infinite loop without next page information

asked Jan 27, 2021 in Technique[技术] by 深蓝 (71.8m points)

scroll - how to scrape using scrapy with infinite loop without next page information

i need to scrape a url using scrapy and i cant scroll down the website to load all the elements.

i try to seach the next page information but i cant found it

my code of the spider is:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from appinformatica.items import appinformaticaItem

import w3lib.html

class appinformaticaSpider (CrawlSpider):
    name = 'appinformatica'
    item_count=0
    start_urls =['https://www.appinformatica.com/telefonos/moviles/']
    rules = {
        Rule(LinkExtractor(allow=(), restrict_xpaths=('//*[@class="info-ficha"]/div[1]/a')),
             callback='parse_item', follow=False)
    }

    def parse_item(self, response):
        item = appinformaticaItem()
        self.item_count += 1
        item['Modelo'] = w3lib.html.remove_tags(response.xpath("//h1").get(default=''))
        item['Position'] = self.item_count
        item['Precio'] = w3lib.html.remove_tags(response.xpath('//*[@id="ficha-producto"]/div[2]/div[1]/div/div[1]').get(default=''))
        item['PrecioTienda'] = w3lib.html.remove_tags(response.xpath('//*[@id="ficha-producto"]/div[2]/div[1]/div/div[2]').get(default=''))
        item['Stock'] = w3lib.html.remove_tags(response.xpath('//*[@id="ficha-producto"]/div[2]/div[3]/p[3]').get(default=''))
        item['Submodelo'] = w3lib.html.remove_tags(response.xpath('//*[@id="ficha-producto"]/div[2]/div[3]/p[2]/strong[2]').get(default=''))
        item['Url'] = w3lib.html.remove_tags(response.url)
        yield item

anyone can help me?

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-01-27T03:53:49+0000

Change allow to allow=(r'/moviles/.*.html'),follow=True and put your allowed_domains. And try this.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
# from appinformatica.items import appinformaticaItem

import w3lib.html

class appinformaticaSpider (CrawlSpider):
    name = 'appinformatica'
    allowed_domains = ["appinformatica.com"]
    item_count=0
    start_urls =['https://www.appinformatica.com/telefonos/moviles/']
    rules = {
        Rule(LinkExtractor(allow=(r'/moviles/.*.html'), ),
             callback='parse_item', follow=True)
    }

    def parse_item(self, response):
        item = {}
        self.item_count += 1
        item['Modelo'] = w3lib.html.remove_tags(response.xpath("//h1").get(default=''))
        item['Position'] = self.item_count
        item['Precio'] = w3lib.html.remove_tags(response.xpath('//*[@id="ficha-producto"]/div[2]/div[1]/div/div[1]').get(default=''))
        item['PrecioTienda'] = w3lib.html.remove_tags(response.xpath('//*[@id="ficha-producto"]/div[2]/div[1]/div/div[2]').get(default=''))
        item['Stock'] = w3lib.html.remove_tags(response.xpath('//*[@id="ficha-producto"]/div[2]/div[3]/p[3]').get(default=''))
        item['Submodelo'] = w3lib.html.remove_tags(response.xpath('//*[@id="ficha-producto"]/div[2]/div[3]/p[2]/strong[2]').get(default=''))
        item['Url'] = w3lib.html.remove_tags(response.url)
        yield item

Categories

scroll - how to scrape using scrapy with infinite loop without next page information

scroll - how to scrape using scrapy with infinite loop without next page information

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags