用TypeScript开发爬虫程序
全局安装typescript:
npm install -g typescript
目前版本2.0.3,这个版本不再需要使用typings命令了。但是vscode捆绑的版本是1.8的,需要一些配置工作,看本文的处理办法。
测试tsc命令:
tsc
创建要写的程序项目文件夹:
mkdir test-typescript-spider
进入该文件夹:
cd test-typescript-spider
初始化项目:
npm init
安装superagent和cheerio模块:
npm i --save superagent cheerio
安装对应的类型声明模块:
npm i -s @types/superagent --save
npm i -s @types/cheerio --save
安装项目内的typescript(必须走这一步):
npm i --save typescript
用vscode打开项目文件夹。在该文件夹下创建tsconfig.json文件,并复制以下配置代码进去:
{ "compilerOptions": { "target": "ES6", "module": "commonjs", "noEmitOnError": true, "noImplicitAny": true, "experimentalDecorators": true, "sourceMap": false, // "sourceRoot": "./", "outDir": "./out" }, "exclude": [ "node_modules" ] } |
在vscode打开“文件”-“首选项”-“工作区设置”
在settings.json中加入(如果不做这个配置,vscode会在打开项目的时候提示选择哪个版本的typescript):
{
"typescript.tsdk": "node_modules/typescript/lib"
}
创建api.ts文件,复制以下代码进去:
import superagent = require(\'superagent\'); import cheerio = require(\'cheerio\'); export const remote_get = function(url: string) { const promise = new Promise<superagent.Response>(function (resolve, reject) { superagent.get(url) .end(function (err, res) { if (!err) { resolve(res); } else { console.log(err) reject(err); } }); }); return promise; } |
创建app.ts文件,书写测试代码:
import api = require(\'./api\'); const go = async () => { let res = await api.remote_get(\'http://www.baidu.com/\'); console.log(res.text); } go(); |
执行命令:
tsc
然后:
node out/app
观察输出是否正确。
现在尝试抓取http://cnodejs.org/的第一页文章链接。
修改app.ts文件,代码如下:
import api = require(\'./api\'); import cheerio = require(\'cheerio\'); const go = async () => { const res = await api.remote_get(\'http://cnodejs.org/\'); const $ = cheerio.load(res.text); let urls: string[] = []; let titles: string[] = []; $(\'.topic_title_wrapper\').each((index, element) => { titles.push($(element).find(\'.topic_title\').first().text().trim()); urls.push(\'http://cnodejs.org/\' + $(element).find(\'.topic_title\').first().attr(\'href\')); }) console.log(titles, urls); } go(); |
观察输出,文章的标题和链接都已获取到了。
现在尝试深入抓取文章内容
import api = require(\'./api\'); import cheerio = require(\'cheerio\'); const go = async () => { const res = await api.remote_get(\'http://cnodejs.org/\'); const $ = cheerio.load(res.text); $(\'.topic_title_wrapper\').each(async (index, element) => { let url = (\'http://cnodejs.org\' + $(element).find(\'.topic_title\').first().attr(\'href\')); const res_content = await api.remote_get(url); const $_content = cheerio.load(res_content.text); console.log($_content(\'.topic_content\').first().text()); }) } go(); |
可以发现因为访问服务器太迅猛,导致出现很多次503错误。
解决:
添加helper.ts文件:
export const wait_seconds = function (senconds: number) { return new Promise(resolve => setTimeout(resolve, senconds * 1000)); } |
修改api.ts文件为:
import superagent = require(\'superagent\'); import cheerio = require(\'cheerio\'); export const get_index_urls = function () { const res = await remote_get(\'http://cnodejs.org/\'); const $ = cheerio.load(res.text); let urls: string[] = []; $(\'.topic_title_wrapper\').each(async (index, element) => { urls.push(\'http://cnodejs.org\' + $(element).find(\'.topic_title\').first().attr(\'href\')); }); return urls; } export const get_content = async function (url: string) { const res = await remote_get(url); const $ = cheerio.load(res.text); return $(\'.topic_content\').first().text(); } export const remote_get = function (url: string) { const promise = new Promise<superagent.Response>(function (resolve, reject) { superagent.get(url) .end(function (err, res) { if (!err) { resolve(res); } else { console.log(err) reject(err); } }); }); return promise; } |
修改app.ts文件为:
import api = require(\'./api\'); import helper = require(\'./helper\'); import cheerio = require(\'cheerio\'); const go = async () => { const res = await api.remote_get(\'http://cnodejs.org/\'); const $ = cheerio.load(res.text); let urls = await api.get_index_urls(); for (let i = 0; i < urls.length; i++) { await helper.wait_seconds(1); let text = await api.get_content(urls[i]); console.log(text); } } go(); |
观察输出可以看到,程序实现了隔一秒再请求下一个内容页。
现在尝试把抓取到的东西存到数据库中。
安装mongoose模块:
npm i mongoose --save
npm i -s @types/mongoose --save
然后建立Scheme。先创建models文件夹:
mkdir models
在models文件夹下创建index.ts:
import * as mongoose from \'mongoose\'; mongoose.connect(\'mongodb://127.0.0.1/cnodejs_data\', { server: { poolSize: 20 } }, function (err) { if (err) { process.exit(1); } }); // models export const Article = require(\'./article\'); |
在models文件夹下创建IArticle.ts:
interface IArticle { title: String; url: String; text: String; } export = IArticle; |
在models文件夹下创建Article.ts:
import mongoose = require(\'mongoose\'); import IArticle = require(\'./IArticle\'); interface IArticleModel extends IArticle, mongoose.Document { } const ArticleSchema = new mongoose.Schema({ title: { type: String }, url: { type: String }, text: { type: String }, }); const Article = mongoose.model<IArticleModel>("Article", ArticleSchema); export = Article; |
修改api.ts为:
import superagent = require(\'superagent\'); import cheerio = require(\'cheerio\'); import models = require(\'./models\'); const Article = models.Article; export const get_index_urls = async function () { const res = await remote_get(\'http://cnodejs.org/\'); const $ = cheerio.load(res.text); let urls: string[] = []; $(\'.topic_title_wrapper\').each((index, element) => { urls.push(\'http://cnodejs.org\' + $(element).find(\'.topic_title\').first().attr(\'href\')); }); return urls; } export const fetch_content = async function (url: string) { const res = await remote_get(url); const $ = cheerio.load(res.text); let article = new Article(); article.text = $(\'.topic_content\').first().text(); article.title = $(\'.topic_full_title\').first().text().replace(\'置顶\', \'\').replace(\'精华\', \'\').trim(); article.url = url; console.log(\'获取成功:\' + article.title); article.save(); } export const remote_get = function (url: string) { return new Promise<superagent.Response>((resolve, reject) => { superagent.get(url) .end(function (err, res) { if (!err) { resolve(res); } else { reject(err); } }); }); } |
修改app.ts为:
import api = require(\'./api\'); import helper = require(\'./helper\'); import cheerio = require(\'cheerio\'); (async () => { try { let urls = await api.get_index_urls(); for (let i = 0; i < urls.length; i++) { await helper.wait_seconds(1); await api.fetch_content(urls[i]); } } catch (err) { console.log(err); } console.log(\'完毕!\'); })(); |
执行tsc
node out/app
观察输出,并去数据库检查一下
可以发现入库成功了!