需求分析:本人最近做一个项目,项目中需要从新闻的索引页(就是上面有很多链接的那种网页),获取新闻正文页源码,并将新闻正文页源码保存到本地数据库中。 但是由于网络稳定性的原因,总会出现 404 page not found 类型的error。(但是网页是确确实实存在的)。而且这种错误,往往是在程序运行一段时间后出现的,觉得很不可思议。我在网络上查这种问题的解决方案时,发现没有一种管用的。本人现在已经成功解决该问题,遂将自己的解决方案写下来和大家分享与探讨。 解决方案核心:一旦出现这种错误,程序中就递归调用下载函数本身。代码说明如下:
public static string GetDataFromUrl(string url, int nRetryTimes) { if (nRetryTimes == 0) return string.Empty;
string result = string.Empty; try { result=GetDataFromUrl(url); } catch (System.Exception exc) { if(exc.Message.IndexOf("404")!=-1) { result=GetDataFromUrl(url,nRetryTimes-1); } } return result; } 其中nRetryTimes 代表出现这种错误后,函数递归调用自己的次数,也可以理解为递归终止的条件。GetDataFromUrl(string url)函数代码如下:
public static string GetDataFromUrl(string url) { string str = string.Empty; HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url); //设置http头 request.AllowAutoRedirect = true; request.AllowWriteStreamBuffering = true; request.Referer = ""; request.Timeout = 10000000; request.UserAgent = ""; ; request.KeepAlive = false;//to avoid the error of time out HttpWebResponse response = null; response = (HttpWebResponse)request.GetResponse();
//根据http应答的http头来判断编码 string characterSet = response.CharacterSet; Encoding encode; if (characterSet != "") { if (characterSet == "ISO-8859-1") { characterSet = "gb2312"; } encode = Encoding.GetEncoding(characterSet); } else { encode = Encoding.Default; }
//声明一个内存流来保存http应答流 Stream receiveStream = response.GetResponseStream(); MemoryStream mStream = new MemoryStream();
byte[] bf = new byte[255]; int count = receiveStream.Read(bf, 0, 255); while (count > 0) { mStream.Write(bf, 0, count); count = receiveStream.Read(bf, 0, 255); } receiveStream.Close();
mStream.Seek(0, SeekOrigin.Begin);
//从内存流里读取字符串 StreamReader reader = new StreamReader(mStream, encode); char[] buffer = new char[1024]; count = reader.Read(buffer, 0, 1024); while (count > 0) { str += new String(buffer, 0, count); count = reader.Read(buffer, 0, 1024); }
//从解析出的字符串里判断charset,如果和http应答的编码不一直 //那么以页面声明的为准,再次从内存流里重新读取文本 Regex reg = new Regex(@"<meta[\s\S]+?charset=(.*?)""[\s\S]+?>", RegexOptions.Multiline | RegexOptions.IgnoreCase); MatchCollection mc = reg.Matches(str); if (mc.Count > 0) { string tempCharSet = mc[0].Result("$1"); if (string.Compare(tempCharSet, characterSet, true) != 0) { encode = Encoding.GetEncoding(tempCharSet); str = string.Empty; mStream.Seek(0, SeekOrigin.Begin); reader = new StreamReader(mStream, encode); buffer = new char[255]; count = reader.Read(buffer, 0, 255); while (count > 0) { str += new String(buffer, 0, count); count = reader.Read(buffer, 0, 255); } } } reader.Close(); mStream.Close(); if (response != null) response.Close();
return str;
}
值得说明的是:尽管采用了此方法,当你查看数据库的时候,你还是会发现有些正文源码没有下载下来。拿我的数据表单来说:我的数据库表单的各个属性如下 ArticlePageId,--数据表的主键。ArticlePageTitle--新闻标题,ArticlePageUrl,--新闻正文页URL,ArticlePageSource--新闻正文页源码,也就是从ArticlePageUrl下载的源码。如果ArticlePageSource字段为空,则表明,下载失败。于是,我又加了一个打补丁的模块。代码如下:
(isModified) { adapter.Update(table); } }
PS:我是新手,这也是我第一次选择首页发帖和大家分享我的一点收获和见解。如有不对的地方还请各位前辈指证。以免误认子弟。
|
请发表评论