Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
599 views
in Technique[技术] by (71.8m points)

python - How to scrape dynamic content from a website?

So I'm using scrapy to scrape a data from Amazon books section. But somehow I got to know that it has some dynamic data. I want to know how dynamic data can be extracted from the website. Here's something I've tried so far:

import scrapy
from ..items import AmazonsItem

class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6']

    def parse(self, response):
        items =  AmazonsItem()
        products_name = response.css('.s-access-title::attr("data-attribute")').extract()
        for product_name in products_name:
            print(product_name)
        next_page = response.css('li.a-last a::attr(href)').get()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

Now I was using SelectorGadget to select a class which I have to scrape but in case of a dynamic website, it doesn't work.

  1. So how do I scrape a website which has dynamic content?
  2. what exactly is the difference between dynamic and static content?
  3. How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
  4. how would I know that data is dynamically created?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

So how do I scrape a website which has dynamic content?

there are a few options:

  1. Use Selenium, which allows you to simulate opening a browser, letting the page render, then pull the html source code
  2. Sometimes you can look at the XHR and see if you can fetch the data directly (like from an API)
  3. Sometimes the data is within the <script> tags of the html source. You could search through those and use json.loads() once you manipulate the text into a json format

what exactly is the difference between dynamic and static content?

Dynamic means the data is generated from a request after the initial page request. Static means all the data is there at the original call to the site

How do I extract other information like price and image from the website? and how to get particular classes for example like a price?

Refer to your first question

how would I know that data is dynamically created?

You'll know it's dynamically created if you see it in the dev tools page source, but not in the html page source you first request. You can also see if the data is generated by additional requests in the dev tool and looking at Network -> XHR

Lastly

Amazon does offer an API to access the data. Try looking into that as well


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...