I have a dataframe that looks like this:
country <- c("Canada", "US", "Japan", "China")
url <- c("http://en.wikipedia.org/wiki/United_States", "http://en.wikipedia.org/wiki/Canada",
"http://en.wikipedia.org/wiki/Japan", "http://en.wikipedia.org/wiki/China")
df <- data.frame(country, url)
country link
1 Canada http://en.wikipedia.org/wiki/United_States
2 US http://en.wikipedia.org/wiki/Canada
3 Japan http://en.wikipedia.org/wiki/Japan
4 China http://en.wikipedia.org/wiki/China
Using rvest
I'd like to scrape the table of contents for each url and bind them to one single output.
This code extracts the table of contents for one url:
library(rvest)
toc <- html(url) %>%
html_nodes(".toctext") %>%
html_text()
Desired Output:
country toc
US Etymology
History
Native American and European contact
Settlements
...
Canada Etymology
History
Aboriginal peoples
European colonization
...etc
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…