I am trying to scrape some IMDB data looping through a list of URLs. Unfortunately my output isn't exactly what I hoped for, never mind storing it in a dataframe.
I get URLs with
library(rvest)
topmovies <- read_html("http://www.imdb.com/chart/top")
links <- top250 %>%
html_nodes(".titleColumn") %>%
html_nodes("a") %>%
html_attr("href")
links_full <- paste("http://imdb.com",links,sep="")
links_full_test <- links_full[1:10]
and then I could get content with
lapply(links_full_test, . %>% read_html() %>% html_nodes("h1") %>% html_text())
but it is a nested list and I don't know how to get it into a proper data.frame in R. Similarly, if I wanted to get another attribute, say
%>% read_html() %>% html_nodes("strong span") %>% html_text()
to retrieve the IMDB rating, I get the same nested-list output and most importantly I have to do read_html() twice ... which takes a lot of time. Is there a better way to do this? I guess for-loops, but I can't get it to work that way :(
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…