r - Harvest (rvest) multiple HTML pages from a list of urls

Question

Welcome To Ask or Share your Answers For Others

r - Harvest (rvest) multiple HTML pages from a list of urls

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Harvest (rvest) multiple HTML pages from a list of urls

I have a dataframe that looks like this:

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
          "http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

    country link
1   Canada  http://en.wikipedia.org/wiki/United_States
2   US      http://en.wikipedia.org/wiki/Canada
3   Japan   http://en.wikipedia.org/wiki/Japan
4   China   http://en.wikipedia.org/wiki/China

Using rvest I'd like to scrape the table of contents for each url and bind them to one single output.

This code extracts the table of contents for one url:

library(rvest)
toc <- html(url) %>%
  html_nodes(".toctext") %>%
  html_text()

Desired Output:

country toc
US      Etymology
        History
        Native American and European contact
        Settlements
        ...  
Canada  Etymology
        History
        Aboriginal peoples
        European colonization
        ...etc

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T03:05:39+0000

This will scrape them into a full data frame (one row per TOC entry). Tedious-but-straightforward "print/output" code left to the OP:

library(rvest)
library(dplyr)

country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", 
         "http://en.wikipedia.org/wiki/Canada",
         "http://en.wikipedia.org/wiki/Japan", 
         "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)

bind_rows(lapply(url, function(x) {

  data.frame(url=x, toc_entry=toc <- html(url[1]) %>%
    html_nodes(".toctext") %>%
    html_text())

})) -> toc_entries

df <- toc_entries %>% left_join(df)

df[sample(nrow(df), 10),]

## Source: local data frame [10 x 3]
## 
##                                           url                            toc_entry country
## 1          http://en.wikipedia.org/wiki/Japan                   Government finance   Japan
## 2         http://en.wikipedia.org/wiki/Canada        Cold War and civil rights era      US
## 3  http://en.wikipedia.org/wiki/United_States                                 Food  Canada
## 4          http://en.wikipedia.org/wiki/Japan                               Sports   Japan
## 5         http://en.wikipedia.org/wiki/Canada                             Religion      US
## 6          http://en.wikipedia.org/wiki/China        Cold War and civil rights era   China
## 7          http://en.wikipedia.org/wiki/Japan Literature, philosophy, and the arts   Japan
## 8  http://en.wikipedia.org/wiki/United_States                           Population  Canada
## 9          http://en.wikipedia.org/wiki/Japan                          Settlements   Japan
## 10        http://en.wikipedia.org/wiki/Canada                             Military      US

Categories

r - Harvest (rvest) multiple HTML pages from a list of urls

r - Harvest (rvest) multiple HTML pages from a list of urls

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags