简单R语言爬虫

R爬虫实验

简单的R语言爬虫实验，因为比较懒，在处理javascript翻页上用了取巧的办法。
主要用到的网页相关的R包是: {rvest}. 其余的R包都是常用包。

library(rvest)
library(stringr)
library(dplyr)
library(ggplot2)

测试的网页是B站，我想根据关键词搜索，然后统计一下UP主的作品个数(好吧，挺无聊的)。

首先就是在B站键入搜索词以后把网页地址复制下来。

中文是搜索词最后会被保护起来，反正那串东西我看不懂

url <- "https://search.bilibili.com/all?keyword=%E5%A4%9A%E8%82%89"

在处理翻页上取巧，根据page进行动态刷新抓取，首先就需要得到一共覆盖多少页，这个在网页的body中。

这次测试的是50页，还好不多。

用正则把最大页数提取出来:

body_title <- strsplit(as.character(html_nodes(read_html(url), xpath = \'body\')), ">")[[1]][1]
page_number <- as.numeric(gsub(".*data-num_pages=\\\"([0-9]+)\\\".*","\\1", body_title, perl = T))

接下来就是UP主名字的提取，存储方式如下：

xpath = ‘//a[@class="up-name"]’

然后根据B站的page规则实现翻页就可以了。

up_name_vec <- c()
for(i in 1:page_number){
  new_url <- paste0(url, "&page=", i, "&order=totalrank")
  info <- read_html(new_url) %>% html_nodes(xpath = \'//a[@class="up-name"]\') %>% html_text(trim = T)
  up_name_vec<- c(up_name_vec, info)
 }

简单的用{ggplot}画一下barplot

up_table <- table(up_name_vec)
need_stat <- up_table[which(up_table >= 5)]
up_df <- data.frame(
  up_name = names(need_stat),
  up_num = as.vector(need_stat)
 )
ggplot(data = up_df, aes(up_name, up_num)) +
  geom_bar(stat = "identity", aes(fill = up_name), show.legend = F, width = 0.7) +
  theme_bw() +
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_text(angle = 45,size = 9, vjust = 0.58, color = "black")
 )

客服电话

电子邮件

R爬虫实验

PeRl

请发表评论

全部评论

上一篇：

下一篇：

关于我们

产品与服务

解决方案

139-2527-9053