• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    公众号

ASimpleCrawlerUsingC#Sockets

原作者: [db:作者] 来自: [db:来源] 收藏 邀请

Contents

Introduction

A web crawler (also known as a web spider or ant) is a program, which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a web site, such as checking links, or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

Crawler Overview

In this article, I will introduce a simple Web crawler with a simple interface, to describe the crawling story in a simple C# program. My crawler takes the input interface of any Internet navigator to simplify the process. The user just has to input the URL to be crawled in the navigation bar, and click "Go".

The crawler has a URL queue that is equivalent to the URL server in any large scale search engine. The crawler works with multiple threads to fetch URLs from the crawler queue. Then the retrieved pages are saved in a storage area as shown in the figure.

The fetched URLs are requested from the Web using a C# Sockets library to avoid locking in any other C# libraries. The retrieved pages are parsed to extract new URL references to be put in the crawler queue, again to a certain depth defined in the Settings.

In the next sections, I will describe the program views, and discuss some technical points related to the interface.

Crawler Views

My simple crawler contains three views that can follow the crawling process, check the details, and view the crawling errors.

Threads view

Threads view is just a window to display all the threads' workout to the user. Each thread takes a URI from the URIs queue, and starts connection processing to download the URI object, as shown in the figure.

.

Requests view

Requests view displays a list of the recent requests downloaded in the threads view, as in the following figure:

This view enables you to watch each request header, like:

GET / HTTP/1.0
Host: www.cnn.com
Connection: Keep-Alive

You can watch each response header, like:

HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:39:05 GMT
Content-Length: 65730
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:40:05 GMT
Cache-Control: max-age=60, private
Connection: keep-alive
Proxy-Connection: keep-alive
Server: Apache
Last-Modified: Sun, 19 Mar 2006 19:38:58 GMT
Vary: Accept-Encoding,User-Agent
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

And a list of found URLs is available in the downloaded page:

Parsing page ...
Found: 356 ref(s)
http://www.cnn.com/
http://www.cnn.com/search/
http://www.cnn.com/linkto/intl.html

Crawler Settings

Crawler settings are not complicated, they are selected options from many working crawlers in the market, including settings such as supported MIME types, download folder, number of working threads, and so on.

MIME types

MIME types are the types that are supported to be downloaded by the crawler, and the crawler includes the default types to be used. The user can add, edit, and delete MIME types. The user can select to allow all MIME types, as in the following figure:

Output

Output settings include the download folder, and the number of requests to keep in the requests view for reviewing request details.

Connections

Connection settings contain:

  • Thread count: the number of concurrent working threads in the crawler.
  • Thread sleep time when the refs queue is empty: the time that each thread sleeps when the refs queue is empty.
  • Thread sleep time between two connections: the time that each thread sleeps after handling any request, which is a very important value to prevent hosts from blocking the crawler due to heavy loads.
  • Connection timeout: represents the send and receive timeout to all crawler sockets.
  • Navigate through pages to a depth of: represents the depth of navigation in the crawling process.
  • Keep same URL server: to limit the crawling process to the same host of the original URL.
  • Keep connection alive: to keep the socket connection opened for subsequent requests to avoid reconnection time.

Advanced

The advanced settings contain:

  • Code page to encode the downloaded text pages.
  • List of a user defined list of restricted words to enable the user to prevent any bad pages.
  • List of a user defined list of restricted host extensions to avoid blocking by these hosts.
  • List of a user defined list of restricted file extensions to avoid parsing non-text data.

Points of Interest

  1. Keep Alive connection:

    Keep-Alive is a request form the client to the server to keep the connection open after the response is finished for subsequent requests. That can be done by adding an HTTP header in the request to the server, as in the following request:

    GET /CNN/Programs/nancy.grace/ HTTP/1.0
        Host: www.cnn.com
        Connection: Keep-Alive

    The "Connection: Keep-Alive" tells the server to not close the connection, but the server has the option to keep it opened or close it, but it should reply to the client socket regarding its decision. So the server can keep telling the client that it will keep it open, by including "Connection: Keep-Alive" in its reply, as follows:

    HTTP/1.0 200 OK
        Date: Sun, 19 Mar 2006 19:38:15 GMT
        Content-Length: 29025
        Content-Type: text/html
        Expires: Sun, 19 Mar 2006 19:39:15 GMT
        Cache-Control: max-age=60, private
        Connection: keep-alive
        Proxy-Connection: keep-alive
        Server: Apache
        Vary: Accept-Encoding,User-Agent
        Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
        Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

    Or it can tell the client that it refuses, as follows:

    HTTP/1.0 200 OK
        Date: Sun, 19 Mar 2006 19:38:15 GMT
        Content-Length: 29025
        Content-Type: text/html
        Expires: Sun, 19 Mar 2006 19:39:15 GMT
        Cache-Control: max-age=60, private
        Connection: Close
        Server: Apache
        Vary: Accept-Encoding,User-Agent
        Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
        Via: 1.1 webcache (NetCache NetApp/6.0.1P3)
  2. WebRequest and WebResponse problems:

    When I started this article code, I was using the WebRequest class and WebResponse, like in the following code:

    WebRequest request = WebRequest.Create(uri);
        WebResponse response = request.GetResponse();
        Stream streamIn = response.GetResponseStream();
        BinaryReader reader = new BinaryReader(streamIn, TextEncoding);
        byte[] RecvBuffer = new byte[10240];
        int nBytes, nTotalBytes = 0;
        while((nBytes = reader.Read(RecvBuffer, 0, 10240)) > 0)
        {
        nTotalBytes += nBytes;
        ...
        }
        reader.Close();
        streamIn.Close();
        response.Close();

    This code works well but it has a very serious problem as the WebRequest class function GetResponse locks the access to all other processes, the WebRequest tells the retrieved response as closed, as in the last line in the previous code. So I noticed that always only one thread is downloading while others are waiting to GetResponse. To solve this serious problem, I implemented my two classes, MyWebRequest and MyWebResponse.

    MyWebRequest and MyWebResponse use the Socket class to manage connections, and they are similar to WebRequest and WebResponse, but they support concurrent responses at the same time. In addition, MyWebRequest supports a built-in flag, KeepAlive, to support Keep-Alive connections.

    So, my new code would be like:

    request = MyWebRequest.Create(uri, request/*to Keep-Alive*/, KeepAlive);
        MyWebResponse response = request.GetResponse();
        byte[] RecvBuffer = new byte[10240];
        int nBytes, nTotalBytes = 0;
        while((nBytes = response.socket.Receive(RecvBuffer, 0,
        10240, SocketFlags.None)) > 0)
        {
        nTotalBytes += nBytes;
        ...
        if(response.KeepAlive && nTotalBytes >= response.ContentLength
        && response.ContentLength > 0)
        break;
        }
        if(response.KeepAlive == false)
        response.Close();

    Just replace the GetResponseStream with a direct access to the socket member of the MyWebResponse class. To do that, I did a simple trick to make the socket next read, to start, after the reply header, by reading one byte at a time to tell header completion, as in the following code:

    /* reading response header */
        Header = "";
        byte[] bytes = new byte[10];
        while(socket.Receive(bytes, 0, 1, SocketFlags.None) > 0)
        {
        Header += Encoding.ASCII.GetString(bytes, 0, 1);
        if(bytes[0] == '\n' && Header.EndsWith("\r\n\r\n"))
        break;
        }

    So, the user of the MyResponse class will just continue receiving from the first position of the page.

  3. Thread management:

    The number of threads in the crawler is user defined through the settings. Its default value is 10 threads, but it can be changed from the Settings tab, Connections. The crawler code handles this change using the property ThreadCount, as in the following code:

    Collapse
     number of running threads
        private int nThreadCount;
        private int ThreadCount
        {
        get    {    return nThreadCount;    }
        set
        {
        Monitor.Enter(this.listViewThreads);
        try
        {
        for(int nIndex = 0; nIndex < value; nIndex ++)
        {
        // check if thread not created or not suspended
                        if(threadsRun[nIndex] == null ||
        threadsRun[nIndex].ThreadState != ThreadState.Suspended)
        {
        // create new thread
                            threadsRun[nIndex] = new Thread(new
        ThreadStart(ThreadRunFunction));
        // set thread name equal to its index
                            threadsRun[nIndex].Name = nIndex.ToString();
        // start thread working function
                            threadsRun[nIndex].Start();
        // check if thread dosn't added to the view
                            if(nIndex == this.listViewThreads.Items.Count)
        {
        // add a new line in the view for the new thread
                                ListViewItem item =
        this.listViewThreads.Items.Add(
        (nIndex+1).ToString(), 0);
        string[] subItems = { "", "", "", "0", "0%" };
        item.SubItems.AddRange(subItems);
        }
        }
        // check if the thread is suspended
                        else if(threadsRun[nIndex].ThreadState ==
        ThreadState.Suspended)
        {
        // get thread item from the list
                            ListViewItem item = this.listViewThreads.Items[nIndex];
        item.ImageIndex = 1;
        item.SubItems[2].Text = "Resume";
        // resume the thread
                            threadsRun[nIndex].Resume();
        }
        }
        // change thread value
                    nThreadCount = value;
        }
        catch(Exception)
        {
        }
        Monitor.Exit(this.listViewThreads);
        }
        }

    If TheadCode is increased by the user, the code creates a new thread or suspends suspended threads. Else, the system leaves the process of suspending extra working threads to threads themselves, as follows. Each working thread has a name equal to its index in the thread array. If the thread name value is greater than ThreadCount, it continues its job and goes for the suspension mode.

  4. Crawling depth:

    It is the depth that the crawler goes in the navigation process. Each URL has an initial depth equal to its parent's depth plus one, with a depth 0 for the first URL inserted by the user. The fetched URL from any page is inserted at the end of the URL queue, which means a "first in first out" operation. And all the threads can be inserted in to the queue at any time, as shown in the following code:

    void EnqueueUri(MyUri uri)
        {
        Monitor.Enter(queueURLS);
        try
        {
        queueURLS.Enqueue(uri);
        }
        catch(Exception)
        {
        }
        Monitor.Exit(queueURLS);
        }

    And each thread can retrieve the first URL in the queue to request it, as shown in the following code:

    MyUri DequeueUri()
        {
        Monitor.Enter(queueURLS);
        MyUri uri = null;
        try
        {
        uri = (MyUri)queueURLS.Dequeue();
        }
        catch(Exception)
        {
        }
        Monitor.Exit(queueURLS);
        return uri;
        }

References

  1. Web crawler from Wikipedia, the free encyclopedia.
  2. RFC766.

Thanks to...

Thanks god!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
上一篇:
C#操作XML方法详解发布时间:2022-07-13
下一篇:
c#内部类的使用发布时间:2022-07-13
热门推荐
热门话题
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap