r - Rvest: Scrape multiple URLs

Question

Welcome To Ask or Share your Answers For Others

r - Rvest: Scrape multiple URLs

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Rvest: Scrape multiple URLs

I am trying to scrape some IMDB data looping through a list of URLs. Unfortunately my output isn't exactly what I hoped for, never mind storing it in a dataframe.

I get URLs with

library(rvest)
topmovies <- read_html("http://www.imdb.com/chart/top")
links <- top250 %>%
  html_nodes(".titleColumn") %>%
  html_nodes("a") %>%
  html_attr("href")
links_full <- paste("http://imdb.com",links,sep="")
links_full_test <- links_full[1:10]

and then I could get content with

lapply(links_full_test, . %>% read_html() %>% html_nodes("h1") %>% html_text())

but it is a nested list and I don't know how to get it into a proper data.frame in R. Similarly, if I wanted to get another attribute, say

%>% read_html() %>% html_nodes("strong span") %>% html_text()

to retrieve the IMDB rating, I get the same nested-list output and most importantly I have to do read_html() twice ... which takes a lot of time. Is there a better way to do this? I guess for-loops, but I can't get it to work that way :(

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:51:27+0000

Here's one approach using purrr and rvest. The key idea is to save the parsed page, and then extract the bits you're interested in.

library(rvest)
library(purrr)

topmovies <- read_html("http://www.imdb.com/chart/top")
links <- topmovies %>%
  html_nodes(".titleColumn") %>%
  html_nodes("a") %>%
  html_attr("href") %>% 
  xml2::url_absolute("http://imdb.com") %>% 
  .[1:5] # for testing

pages <- links %>% map(read_html)

title <- pages %>% 
  map_chr(. %>% 
    html_nodes("h1") %>% 
    html_text()
  )
rating <- pages %>% 
  map_dbl(. %>% 
    html_nodes("strong span") %>% 
    html_text() %>% 
    as.numeric()
  )

Categories

r - Rvest: Scrape multiple URLs

r - Rvest: Scrape multiple URLs

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags