在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
ACCEPTED CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 2001
Rapid Object Detection using a Boosted Cascade of Simple Features 简单特征的优化级联在快速目标检测中的应用 Paul Viola Michael Jones [email protected] [email protected] Mitsubishi Electric Research Labs Compaq CRL 三菱电气实验室 康柏剑桥研究所 201 Broadway, 8th FL One Cambridge Center Cambridge, MA 02139 Cambridge, MA 02142
Abstract
This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers[6]. The third contribution is a method for combining increasingly more complex classi- fiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more compu- tation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guar- antees that discarded regions are unlikely to contain the ob- ject of interest. In the domain of face detection the system yields detection rates comparable to the best previous sys- tems. Used in real-time applications, the detector runs at 15 frames per second without resorting to image differenc- ing or skin color detection. 摘要 本文描述了一个视觉目标检测的机器学习法,它能够非常快速地处理图像而且能实现高检测速率。这项工作可分为三个创新性研究成果。第一个是一种新的图像表征说明,称为“积分图”,它允许我们的检测的特征得以很快地计算出来。第二个是一个学习算法,基于Adaboost自适应增强法,可以从一些更大的设置和产量极为有效的分类器中选择出几个关键的视觉特征。第三个成果是一个方法:用一个“级联”的形式不断合并分类器,这样便允许图像的背景区域被很快丢弃,从而将更多的计算放在可能是目标的区域上。这个级联可以视作一个目标特定的注意力集中机制,它不像以前的途径提供统计保障,保证舍掉的地区不太可能包含感兴趣的对象。在人脸检测领域,此系统的检测率比得上之前系统的最佳值。在实时监测的应用中,探测器以每秒15帧速度运行,不采用帧差值或肤色检测的方法。
1. Introduction
This paper brings together new algorithms and insights to construct a framework for robust and extremely rapid object detection. This framework is demonstrated on, and in part motivated by, the task of face detection. Toward this end we have constructed a frontal face detection system which achieves detection and false positive rates which are equiv- alent to the best published results [16, 12, 15, 11, 1]. This face detection system is most clearly distinguished from previous approaches in its ability to detect faces extremely rapidly. Operating on 384 by 288 pixel images, faces are detected at 15 frames per second on a conventional 700 MHz Intel Pentium III. In other face detection systems, auxiliary information, such as image differences in video sequences, or pixel color in color images, have been used to achieve high frame rates. Our system achieves high frame rates working only with the information present in a single grey scale image. These alternative sources of information can also be integrated with our system to achieve even higher frame rates. 1.引言 本文汇集了新的算法和见解,构筑一个鲁棒性良好的极速目标检测框架。这一框架主要是体现人脸检测的任务。为了实现这一目标,我们已经建立了一个正面的人脸检测系统,实现了相当于已公布的最佳结果的检测率和正误视率, [16,12,15,11,1]。这种人脸检测系统区分人脸比以往的方法都要清楚,而且速度很快。通过对384×288像素的图像,硬件环境是常规700 MHz英特尔奔腾III,人脸检测速度达到了每秒15帧。在其它人脸检测系统中,一些辅助信息如视频序列中的图像差异,或在彩色图像中像素的颜色,被用来实现高帧率。而我们的系统仅仅使用一个单一的灰度图像信息实现了高帧速率。上述可供选择的信息来源也可以与我们的系统集成,以获得更高的帧速率。
There are three main contributions of our object detection framework. We will introduce each of these ideas briefly below and then describe them in detail in subsequent sections. 本文的目标检测框架包含三个主要创新性成果。下面将简短介绍这三个概念,之后将分章节对它们一一进行详细描述。
The first contribution of this paper is a new image representation called an integral image that allows for very fast feature evaluation. Motivated in part by the work of Papageorgiou et al. our detection system does not work directly with image intensities [10]. Like these authors we use a set of features which are reminiscent of Haar Basis functions (though we will also use related filters which are more complex than Haar filters). In order to compute these fea- tures very rapidly at many scales we introduce the integral image representation for images. The integral image can be computed from an image using a few operations per pixel. Once computed, any one of these Harr-like features can be computed at any scale or location in constant time. 本文的第一个成果是一个新的图像表征,称为积分图像,允许进行快速特征评估。我们的检测系统不能直接利用图像强度的信息工作[10]。和这些作者一样,我们使用一系列与Haar基本函数相关的特征:(尽管我们也将使用一些更复杂的滤波器)。为了非常迅速地计算多尺度下的这些特性,我们引进了积分图像。在一幅图像中,每个像素使用很少的一些操作,便可以计算得到积分图像。任何一个类Haar特征可以在任何规模或位置上被计算出来,且是在固定时间内。
The second contribution of this paper is a method for constructing a classifier by selecting a small number of im- portant features using AdaBoost [6]. Within any image sub- window the total number of Harr-like features is very large, far larger than the number of pixels. In order to ensure fast classification, the learning process must exclude a large ma- jority of the available features, and focus on a small set of critical features. Motivated by the work of Tieu and Viola, feature selection is achieved through a simple modification of the AdaBoost procedure: the weak learner is constrained so that each weak classifier returned can depend on only a single feature [2]. As a result each stage of the boosting process, which selects a new weak classifier, can be viewed as a feature selection process. AdaBoost provides an effec- tive learning algorithm and strong bounds on generalization performance [13, 9, 10]. 本文的第二个成果是通过使用AdaBoost算法选择数个重要的特征构建一个分类器[6]。在任何图像子窗口里的类Haar特征的数目非常大,远远超过了像素数目。为了确保快速分类,在学习过程中必须剔除的大部分可用的特征,关注一小部分关键特征。选拔工作是通过一个AdaBoost的程序简单修改:约束弱学习者,使每一个弱分类器返回时仅可依赖1个特征[2]。因此,每个改善过程的阶段,即选择一个新的弱分类器的过程,可以作为一个特征选择过程。 AdaBoost算法显示了一个有效的学习算法和良好的泛化性能[13,9,10]。
The third major contribution of this paper is a method for combining successively more complex classifiers in a cascade structure which dramatically increases the speed of the detector by focusing attention on promising regions of the image. The notion behind focus of attention approaches is that it is often possible to rapidly determine where in an image an object might occur [17, 8, 1]. More complex pro- cessing is reserved only for these promising regions. The key measure of such an approach is the “false negative” rate of the attentional process. It must be the case that all, or almost all, object instances are selected by the attentional filter. 本文的第三个主要成果是在一个在级联结构中连续结合更复杂的分类器的方法,通过将注意力集中到图像中有希望的地区,来大大提高了探测器的速度。在集中注意力的方法背后的概念是,它往往能够迅速确定在图像中的一个对象可能会出现在哪里[17,8,1]。更复杂的处理仅仅是为这些有希望的地区所保留。衡量这种做法的关键是注意力过程的“负误视”(在模式识别中,将属于物体标注为不属于物体)的概率。在几乎所有的实例中,对象实例必须是由注意力滤波器选择。
We will describe a process for training an extremely sim- ple and efficient classifier which can be used as a “super- vised” focus of attention operator. The term supervised refers to the fact that the attentional operator is trained to detect examples of a particular class. In the domain of face detection it is possible to achieve fewer than 1% false neg- atives and 40% false positives using a classifier constructed from two Harr-like features. The effect of this filter is to reduce by over one half the number of locations where the final detector must be evaluated. 我们将描述一个过程:训练一个非常简单又高效的分类器,用来作为注意力操作的“监督”中心。术语“监督”是指:注意力操作被训练用来监测特定分类的例子。在人脸检测领域,使用一个由两个类Haar特征构建的分类器,有可能达到1%不到的负误视和40%正误视。该滤波器的作用是减少超过一半的最终检测器必须进行评估的地方。
Those sub-windows which are not rejected by the initial classifier are processed by a sequence of classifiers, each slightly more complex than the last. If any classifier rejects the sub-window, no further processing is performed. The structure of the cascaded detection process is essentially that of a degenerate decision tree, and as such is related to the work of Geman and colleagues [1, 4]. 这些没有被最初的分类器排除的子窗口,由接下来的一系列分类处理,每个分类器都比其前一个稍有复杂。如果某个子窗口被任一个分类器排除,那它将不会被进一步处理。在检测过程的级联结构基本上是一个退化型决策树,这点可以参照German和同事的工作[1,4]。
An extremely fast face detector will have broad prac- tical applications. These include user interfaces, image databases, and teleconferencing. In applications where rapid frame-rates are not necessary, our system will allow for significant additional post-processing and analysis. In addition our system can be implemented on a wide range of small low power devices, including hand-helds and embed- ded processors. In our lab we have implemented this face detector on the Compaq iPaq handheld and have achieved detection at two frames per second (this device has a low power 200 mips Strong Arm processor which lacks floating point hardware). 一个非常快速的人脸检测器有广泛实用性。这包括用户界面,图像数据库,及电话会议。在不太需要高帧速率的应用中,我们的系统可提供额外的重要后处理和分析。另外我们的系统能够在各种低功率的小型设备上实现,包括手持设备和嵌入式处理器。在我们实验室我们已经将该人脸检测系统在Compaq公司的ipaq上实现,并达到了两帧每秒的检测率(该设备仅有200 MIPS的低功耗处理器,缺乏浮点硬件)。
The remainder of the paper describes our contributions and a number of experimental results, including a detailed description of our experimental methodology. Discussion of closely related work takes place at the end of each section. 本文接下来描述我们的研究成果和一些实验结果,包括我们实验方法学的详尽描述。每章结尾会有对近似工作的讨论。
2. Features
Our object detection procedure classifies images based on the value of simple features. There are many motivations for using features rather than the pixels directly. The most common reason is that features can act to encode ad-hoc domain knowledge that is difficult to learn using a finite quantity of training data. For this system there is also a second critical motivation for features: the feature based system operates much faster than a pixel-based system. 2.特征 我们的目标检测程序是基于简单的特征值来分类图像的。之所以选择使用特征而不是直接使用像素,主要是因为特征可以解决特定领域知识很难学会使用有限训练资料的问题。对于这些系统来说,选择使用特征还有另外一个重要原因:基于特征的系统的运行速度要远比基于像素的快。
The simple features used are reminiscent of Haar basis functions which have been used by Papageorgiou et al. [10]. More specifically, we use three kinds of features. The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions. The regions have the same size and shape and are horizontally or ver- tically adjacent (see Figure 1). A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature computes the difference between diagonal pairs of rectangles. 上述简单特征是基于Haar基本函数设置的,Papageorgiou等人已使用过[10]。而我们则是更具体地选择了特定的三类特征。其中,双矩形特征的值定义为两个矩形区域里像素和的差。而区域则具有相同尺寸和大小,并且水平或垂直相邻(如图1)。而三矩形特征的值则是两个外侧矩形的像素和减去中间矩形的和所得的最终值。最后一个四矩形特征的值是计算两组对角线矩形的区别而得的。
Given that the base resolution of the detector is 24x24, the exhaustive set of rectangle features is quite large, over 180,000 . Note that unlike the Haar basis, the set of rectan- gle features is overcomplete1 . 检测器的基本分辨率设定为24×24,既而得到数目巨大的矩形特征的完备集,超过了180000。需要注意的是,矩形特征的集合不像Haar基底,它是过完备1的。 Figure 1: Example rectangle features shown relative to the enclosing detection window. The sum of the pixels which lie within the white rectangles are subtracted from the sum of pixels in the grey rectangles. Two-rectangle features are shown in (A) and (B). Figure (C) shows a three-rectangle feature, and (D) a four-rectangle feature. 矩形特征可以反映检测窗口之间的联系。白色矩形框中的像素和减去灰色矩形框内的像素和得到特征值。(A)和(B)是矩形特征。(C)是三矩形特征。(D)是四矩形特征。 图 1 2.1. Integral Image Rectangle features can be computed very rapidly using an intermediate representation for the image which we call the integral image.2The integral image at location x, y contains the sum of the pixels above and to the left of x, y , inclusive: 我们采用一个中间表示方法来计算图像的矩形特征,这里称为积分图像2。位置x,y上的积分图像包含点x,y上边和左边的像素和,包括:
1 A complete basis has no linear dependence between basis elements and has the same number of elements as the image space, in this case 576. The full set of 180,000 thousand features is many times over-complete. 2 There is a close relation to “summed area tables” as used in graphics [3]. We choose a different name here in order to emphasize its use for the analysis of images, rather than for texture mapping.
1 一个完备基底在集元素之间没有线性独立,且数目和图像空间的元素个数相等,这里是576。在总数为180,000的全集中,数千特征是多次过完备的。
2在图形学中还有个近义词称为“区域求和表”[3]。这里我们选择一个不同名称,是为了便于读者理解这是用来进行图像处理,而不是纹理映射的。
Figure 2: The sum of the pixels within rectangle D can be computed with four array references. The value of the integral image at location 1 is the sum of the pixels in rectangle A. The value at location 2 is A+B , at location 3 is A+C, and at location 4 is A+B+C+D. The sum within D can be computed as 4+1-(2+3). 矩形D内的像素和可以按四个数组计算。位置1的积分图像的值就是矩形A中的像素之和。位置2的值是A+B,位置3的值是A+C,而位置4的值是A+B+C+D。那么D中的像素和就是4+1-(2+3)。 图 2
当ii(x,y)是积分图像,i(x,y)是原始图像。可以使用下列一对循环: ( 这里S(x,y)是行累积和 S(x,-1)=0,ii(-1,y)=0 )积分图像可以通过已知原始图像而一步求得。
Using the integral image any rectangular sum can be computed in four array references (see Figure 2). Clearly the difference between two rectangular sums can be computed in eight references. Since the two-rectangle features defined above involve adjacent rectangular sums they can be computed in six array references, eight in the case of the three-rectangle features, and nine for four-rectangle features. 使用积分图像可以把任意一个矩形用四个数组计算(见图2)。显然两个矩形和之差可以用八个数组。因为双矩形特征的定义是两个相邻矩形的和,所以仅用6个数组就可以计算出结果。同理三矩形特征用8个,四矩形特征用9个。
2.2. Feature Discussion Rectangle features are somewhat primitive when compared with alternatives such as steerable filters [5, 7]. Steerable fil- ters, and their relatives, are excellent for the detailed analy- sis of boundaries, image compression, and texture analysis. In contrast rectangle features, while sensitive to the pres- ence of edges, bars, and other simple image structure, are quite coarse. Unlike steerable filters the only orientations available are vertical, horizontal, and diagonal. The set of rectangle features do however provide a rich image repre- sentation which supports effective learning. In conjunction with the integral image , the efficiency of the rectangle fea- ture set provides ample compensation for their limited flex- ibility. 2.2特征讨论 和一些相似方法,如导向滤波比较起来,矩形特征看似有些原始[5,7]。导向滤波等类似方法,非常适合做对边界的详细分析,图像压缩,纹理分析。相比之下矩形特征,对于边缘,条纹,以及其他简单的图像结构的敏感度,是相当粗糙的。不同于导向滤波,它仅有的有效位置就是垂直,水平和对角线。矩形特征的设置做不过是提供了丰富的图像表征,支持有效的学习。与积分图像一起,矩形特征的高效给它们有限的灵活性提供了极大补偿。
3. Learning Classification Functions
Given a feature set and a training set of positive and neg- ative images, any number of machine learning approaches could be used to learn a classification function. In our sys- tem a variant of AdaBoost is used both to select a small set of features and train the classifier [6]. In its original form, the AdaBoost learning algorithm is used to boost the clas- sification performance of a simple (sometimes called weak) learning algorithm. There are a number of formal guaran- tees provided by the AdaBoost learning procedure. Freund and Schapire proved that the training error of the strong classifier approaches zero exponentially in the number of rounds. More importantly a number of results were later proved about generalization performance [14]. The key insight is that generalization performance is related to the margin of the examples, and that AdaBoost achieves large margins rapidly. 3.自学式分类功能 给定一个特征集和一个包含正图像和负图像的训练集,任何数量的机器学习方法可以用来学习分类功能。在我们的系统中,使用AdaBoost的一种变种来选择小规模特征集和调试分类器[6]。在其原来的形式中,这种AdaBoost自学式算法是用来提高一个简单(有时称为弱式)自学式算法的。AdaBoost自学步骤提不少有效保证。Freund和Schapire证明,在相当数量的循环中,强分类器的调试误差接近于零。更重要的是,最近相当数量的结果证明了关于泛化性能的优势[14]。其关键观点是泛化性能与例子的边界有关,而AdaBoost能迅速达到较大的边界。
Recall that there are over 180,000 rectangle features as- sociated with each image sub-window, a number far larger than the number of pixels. Even though each feature can be computed very efficiently, computing the complete set is prohibitively expensive. Our hypothesis, which is borne out by experiment, is that a very small number of these features can be combined to form an effective classifier. The main challenge is to find these features. 回想一下,有超过180,000个矩形特征与每个图像子窗口有关,这个数字远大过像素数。虽然每个特征的计算效率非常高,但是对整个集合进行计算却花费高昂。而我们的假说,已被实验证实,可以将极少数的特征结合起来,形成有效的分类器。而主要挑战是如何找到这些特征。
In support of this goal, the weak learning algorithm is designed to select the single rectangle feature which best separates the positive and negative examples (this is similar to the approach of [2] in the domain of image database re- trieval). For each feature, the weak learner determines the optimal threshold classification function, such that the min- imum number of examples are misclassified. A weak clas- sifier hj(x) thus consists of a feature fj , a threshold θj and a parity pj indicating the direction of the inequality sign: Here x is a 24x24 pixel sub-window of an image. See Ta- ble 1 for a summary of the boosting process. 为实现这一目标,我们设计弱学习算法,用来选择使得正例和负例得到最佳分离的单一矩形特征(这是[2]中方法类似,在图像数据库检索域)。对于每一个特征,弱学习者决定最优阈值分类功能,这样可以使错误分类的数目最小化。弱分类器hj(x)包括:特征 fj,阈值 θj,和一个正负校验 pj,即保证式子两边符号相同:
In practice no single feature can perform the classifica- tion task with low error. Features which are selected in early rounds of the boosting process had error rates between 0.1 and 0.3. Features selected in later rounds, as the task be- comes more difficult, yield error rates between 0.4 and 0.5. 在实践中没有单个特征能在低错误的条件下执行分类任务。在优化过程的循环初期中被选中的特征错误率在0.1到0.3之间。在循环后期,由于任务变得更难,因此被选择的特征误差率在0.4和0.5之间。
3.1. Learning Discussion Many general feature selection procedures have been pro- posed (see chapter 8 of [18] for a review). Our final appli- cation demanded a very aggressive approach which would discard the vast majority of features. For a similar recogni- tion problem Papageorgiou et al. proposed a scheme for fea- ture selection based on feature variance [10]. They demon- strated good results selecting 37 features out of a total 1734 features. 3.1自学习讨论 许多通用的特征选择程序已经提出(见18]的第八章)。我们的最终应用的方法要求是一个非常积极的,能抛弃绝大多数特征的方法。对于类似的识别问题,Papageorgiou等人提出了一个基于特征差异的特征选择计划。他们从1734个特征中选出37个特征,实现了很好的结果。
Roth et al. propose a feature selection process based on the Winnow exponential perceptron learning rule [11]. The Winnow learning process converges to a solution where many of these weights are zero. Nevertheless a very large number of features are retained (perhaps a few hundred or thousand). Roth等人提出了一种基于winnow指数感知机学习规则的特征选择过程[11]。这种Winnow学习过程收敛了一个解决方法,其中有不少权重为零。然而却保留下来相当大一部分的特征(也许有好几百或几千)。
Table 1: The AdaBoost algorithm for classifier learning. Each round of boosting selects one feature from the 180,000 potential features. 表1:关于自学式分类的Adaboost算法。每个循环都在180,000个潜在特征中选择一个特征。
3.2. Learning Results
While details on the training and performance of the final system are presented in Section 5, several simple results merit discussion. Initial experiments demonstrated that a frontal face classifier constructed from 200 features yields a detection rate of 95% with a false positive rate of 1 in 14084. These results are compelling, but not sufficient for many real-world tasks. In terms of computation, this clas- sifier is probably faster than any other published system, requiring 0.7 seconds to scan an 384 by 288 pixel image. Unfortunately, the most straightforward technique for im- proving detection performance, adding features to the classifier, directly increases computation time. 3.2自学习结果 最终系统的详细调试和执行将在第5节中介绍,现在对几个简单的结果进行讨论。初步实验证明,正面人脸分类器由200个特征构造而成,正误视率在14084中为1,检测率为95%。这些结果是引人注目的,但对许多实际任务还是不够的。就计算而言,这个分类器可能比任何其他公布的系统更快,扫描由1个384乘288像素图像仅需要0.7秒。不幸的是,若用这个最简单的技术改善检测性能,给分类器添加特征,会直接增加计算时间。
For the task of face detection, the initial rectangle fea- tures selected by AdaBoost are meaningful and easily inter- preted. The first feature selected seems to focus on the prop- erty that the region of the eyes is often darker than the region of the nose and cheeks (see Figure 3). This feature is rel- atively large in comparison with the detection sub-window, and should be somewhat insensitive to size and location of the face. The second feature selected relies on the property that the eyes are darker than the bridge of the nose. 对于人脸检测的任务,由AdaBoost选择的最初的矩形特征是有意义的且容易理解。选定的第一个特征的重点是眼睛区域往往比鼻子和脸颊区域更黑暗(见图3)。此特征的检测子窗口相对较大,并且某种程度上不受面部大小和位置的影响。第二个特征选择依赖于眼睛的所在位置比鼻梁更暗。 Figure 3: The first and second features selected by Ad- aBoost. The two features are shown in the top row and then overlayed on a typical training face in the bottom row. The first feature measures the difference in intensity between the region of the eyes and a region across the upper cheeks. The feature capitalizes on the observation that the eye region is often darker than the cheeks. The second feature compares the intensities in the eye regions to the intensity across the bridge of the nose. 这两个特点显示在最上面一行,然后一个典型的调试面部叠加在底部一行。第一个特点,测量眼睛部区域和上脸颊地区的强烈程度的区别。该特征利用了眼睛部区域往往比脸颊更暗。第二个特点比较了眼睛区域与鼻梁的强度。
4. The Attentional Cascade
This section describes an algorithm for constructing a cas- cade of classifiers which achieves increased detection per- formance while radically reducing computation time. The key insight is that smaller, and therefore more efficient, boosted classifiers can be constructed which reject many of the negative sub-windows while detecting almost all posi- tive instances (i.e. the threshold of a boosted classifier can be adjusted so that the false negative rate is close to zero). Simpler classifiers are used to reject the majority of sub- windows before more complex classifiers are called upon to achieve low false positive rates. 4.注意力级联 本章描述了构建级联分类器的算法,它能增加检测性能达从而从根本上减少计算时间。它的主要观点是构建一种优化分类器,其规模越小就越高效。这种分类器在检测几乎所有都是正例时剔除许多负子窗口(即,优化分类器阈值可以调整使得负误视率接近零)。在调用较复杂的分类器之前,我们使用相对简单的分类器来剔除大多数子窗口,以实现低正误视率。
The overall form of the detection process is that of a degenerate decision tree, what we call a “cascade” (see Fig- ure 4). A positive result from the first classifier triggers the evaluation of a second classifier which has also been ad- justed to achieve very high detection rates. A positive result from the second classifier triggers a third classifier, and so on. A negative outcome at any point leads to the immediate rejection of the sub-window. 在检测过程中,整体形式是一个退化决策树,我们称之为“级联”(见图4)。从第一个分类得到的有效结果能触发第二个分类器,也已调整至达到非常高的检测率。再得到一个有效结果使得第二个分类器触发第三个分类器,以此类推。在任何一个点的错误结果都导致子窗口立刻被剔除。
Stages in the cascade are constructed by training clas- sifiers using AdaBoost and then adjusting the threshold to minimize false negatives. Note that the default AdaBoost threshold is designed to yield a low error rate on the train- ing data. In general a lower threshold yields higher detection rates and higher false positive rates. 级联阶段的构成首先是利用AdaBoost训练分类器,然后调整阈值使得负误视最大限度地减少。注意,默认AdaBoost的阈值旨在数据过程中产生低错误率。一般而言,一个较低的阈值会产生更高的检测速率和更高的正误视率。
Figure 4: Schematic depiction of a the detection cascade. A series of classifiers are applied to every sub-window. The initial classifier eliminates a large number of negative exam- ples with very little processing. Subsequent layers eliminate additional negatives but require additional computation. Af- ter several stages of processing the number of sub-windows have been reduced radically. Further processing can take any form such as additional stages of the cascade (as in our detection system) or an alternative detection system. 一系列的分类器适用于每一个子窗口。最初的分类器用很少的处理来消除大部分的负例。随后的层次消除额外的负例,但是需要额外的计算。经过数个阶段处理以后,子窗口的数量急剧减少。进一步的处理可以采取任何形式,如额外的级联阶段(正如我们的检测系统中的)或者另一个检测系统。
For example an excellent first stage classifier can be con- structed from a two-feature strong classifier by reducing the threshold to minimize false negatives. Measured against a validation training set, the threshold can be adjusted to de- tect 100% of the faces with a false positive rate of 40%. See Figure 3 for a description of the two features used in this classifier. 例如,一个两特征强分类器通过降低阈值,达到最小的负误视后,可以构成一个优秀的第一阶段分类器。测量一个定的训练集时,阈值可以进行调整,最后达到100%的人脸检测率和40%的正误视率。图3为此分类器这两个特征的使用说明。
Computation of the two feature classifier amounts to about 60 microprocessor instructions. It seems hard to imagine that any simpler filter could achieve higher rejec- tion rates. By comparison, scanning a simple image tem- plate, or a single layer perceptron, would require at least 20 times as many operations per sub-window. 计算这两个特征分类器要使用大约60个微处理器指令。很难想象还会有其它任何简单的滤波器可以达到更高的剔除率。相比之下,一个简单的图像扫描模板,或单层感知器,将至少需要20倍于每个子窗口的操作。
The structure of the cascade reflects the fact that within any single image an overwhelming majority of sub- windows are negative. As such, the cascade attempts to re- ject as many negatives as possible at the earliest stage pos- sible. While a positive instance will trigger the evaluation of every classifier in the cascade, this is an exceedingly rare event. 该级联结构反映了,在任何一个单一的图像中,绝大多数的子窗口是无效的。因此,我们的级联试图在尽可能早的阶段剔除尽可能多的负例。虽然正例将触发评估每一个在级联中的分类器,但这极其罕见。
Much like a decision tree, subsequent classifiers are trained using those examples which pass through all the previous stages. As a result, the second classifier faces a more difficult task than the first. The examples which make it through the first stage are “harder” than typical exam- ples. The more difficult examples faced by deeper classi- fiers push the entire receiver operating characteristic (ROC) curve downward. At a given detection rate, deeper classi- fiers have correspondingly higher false positive rates. 随后的分类器就像一个决策树,使用这些通过所有以前的阶段例子进行训练。因此,第二个分类器所面临的任务比第一个更难。这些过第一阶段的例子比典型例子更“难”。这些例子推动整个受试者工作特征曲线(ROC)向下。在给定检测率的情况下,更深层次分类器有着相应较高的正误视率。
4.1. Training a Cascade of Classifiers The cascade training process involves two types of trade- offs. In most cases classifiers with more features will achieve higher detection rates and lower false positive rates.At the same time classifiers with more features require more time to compute. In principle one could define an optimiza- tion framework in which: i) the number of classifier stages, ii) the number of features in each stage, and iii) the thresh- old of each stage, are traded off in order to minimize the expected number of evaluated features. Unfortunately find- ing this optimum is a tremendously difficult problem. 4.1 调试分类器级联 级联的调试过程包括两个类型的权衡。在大多数情况下具有更多的特征分类器达到较高的检测率和较低的正误视率。同时具有更多的特征的分类器需要更多的时间来计算。原则上可以定义一个优化框架,其中:一)分级级数,二)在每个阶段的特征数目,三)每个阶段为最小化预计数量评价功能而进行的门限值交换。不幸的是,发现这个最佳方案是一个非常困难的问题。
In practice a very simple framework is used to produce an effective classifier which is highly efficient. Each stage in the cascade reduces the false positive rate and decreases the detection rate. A target is selected for the minimum reduction in false positives and the maximum decrease in detection. Each stage is trained by adding features until the target detection and false positives rates are met ( these rates are determined by testing the detector on a validation set). Stages are added until the overall target for false positive and detection rate is met. 在实践中用一个非常简单的框架产生一个有效的高效分类器。级联中的每个阶段降低了正误视率并且减小了检测率。现在的目标旨在最小化正误视率和最大化检测率。调试每个阶段,不断增加特征,直到检测率和正误视率的目标实现(这些比率是通过将探测器在验证设置上测试而得的)。同时添加阶段,直到总体目标的正误视和检测率得到满足为止。
4.2. Detector Cascade Discussion The complete face detection cascade has 38 stages with over 6000 features. Nevertheless the cascade structure results in fast average detection times. On a difficult dataset, con- taining 507 faces and 75 million sub-windows, faces are detected using an average of 10 feature evaluations per sub- window. In comparison, this system is about 15 times faster than an implementation of the detection system constructed by Rowley et al.3 [12] 4.2 探测器级联的探讨 完整的人脸检测级联已经有拥有超过6000个特征的38个阶段。尽管如此,级联结构还是能够缩短平均检测时间。在一个复杂的包含507张人脸和7500万个子窗口的数据集中,人脸在检测时是每个子窗口由平均10个特征来评估。相比之下,本系统的速度是由罗利等人3[12]构建的检测系统的15倍。
A notion similar to the cascade appears in the face de- tection system described by Rowley et al. in which two de- tection networks are used [12]. Rowley et al. used a faster yet less accurate network to prescreen the image in order to find candidate regions for a slower more accurate network. Though it is difficult to determine exactly, it appears that Rowley et al.’s two network face system is the fastest existing face detector.4 由Rowley等人描述的一个类似于级联的概念出现人脸检测系统中。在这个系统中他们使用了两个检测网络。Rowley等人用更快但相对不准确的网络,以先筛选图像,这样做是为了使较慢但更准确的网络找到候选区域。虽然这很难准确判断,但是Rowley等人的双网络系统,是目前速度最快的脸部探测器。4
The structure of the cascaded detection process is es- sentially that of a degenerate decision tree, and as such is related to the work of Amit and Geman [1]. Unlike tech- niques which use a fixed detector, Amit and Geman propose an alternative point of view where unusual co-occurrences of simple image features are used to trigger the evaluation of a more complex detection process. In this way the full detection process need not be evaluated at many of the po- tential image locations and scales. While this basic insight is very valuable, in their implementation it is necessary to first evaluate some feature detector at every location. These features are then grouped to find unusual co-occurrences. In practice, since the form of our detector and the features that it uses are extremely efficient, the amortized cost of evalu- ating our detector at every scale and location is much faster than finding and grouping edges throughout the image. 在检测过程中的级联结构基本上是退化决策树,因此是涉及到了Amit和Geman[1]的工作。,Amit和Geman建议不再使用固定一个探测器的技术,而他们提出一个不寻常的合作同现,即简单的图像特征用于触发评价一个更为复杂的检测过程。这样,完整的检测过程中不需要对潜在的图像位置和范围进行估计。然而这种基本的观点非常有价值,在它们的执行过程中,必须要对每一个位置的某些功能检测首先进行估计。这些特征被归类,以用于找到不寻常的合作。在实践中,由于我们的检测器的形式,它的使用非常高效,用于评估我们在每个探测器的规模和位置的成本消耗比寻找和分组整个图像边缘快很多。
In recent work Fleuret and Geman have presented a face detection technique which relies on a “chain” of tests in or- der to signify the presence of a face at a particular scale and location [4]. The image properties measured by Fleuret and Geman, disjunctions of fine scale edges, are quite different than rectangle features which are simple, exist at all scales, and are somewhat interpretable. The two approaches also differ radically in their learning philosophy. The motivation for Fleuret and Geman’s learning process is density estima- tion and density discrimination, while our detector is purely discriminative. Finally the false positive rate of Fleuret and Geman’s approach appears to be higher than that of previ- ous approaches like Rowley et al. and this approach. Un- fortunately the paper does not report quantitative results of this kind. The included example images each have between 2 and 10 false positives. 在最近的工作中Fleuret和Geman已经提交了一种人脸检测技术,它以“链测试”为主调,用来表示在某一特定范围和位置人脸是否存在[4]。由Fleuret和Geman测量的图像属性,细尺度边界的分离,与简单、存在于所有尺度且某种程度可辨别的矩阵特征有很大的不同。这两种方法的基本原理也存在根本上的差异。Fleuret和Geman的学习过程的目的是密度估计和密度辨别,而我们的探测器是单纯的辨别。最后,Fleuret和Geman的方法中的正误视率似乎也比以前的如Rowley等人的方法中的更高。不幸的是,这种办法在文章中并没有定量分析结果。图像所包含的每个例子都有2到10个正误视。
5 Results
A 38 layer cascaded classifier was trained to detect frontal upright faces. To train the detector, a set of face and non- face training images were used. The face training set con- sisted of 4916 hand labeled faces scaled and aligned to a base resolution of 24 by 24 pixels. The faces were ex- tracted from images downloaded during a random crawl of the world wide web. Some typical face examples are shown in Figure 5. The non-face subwindows used to train the detector come from 9544 images which were manually in- spected and found to not contain any faces. There are about 350 million subwindows within these non-face images. 5.实验结果 我们训练一个38层级联分类器,用来检测正面直立人脸。为了训练分类器,我们使用了一系列包含人脸和不包含人脸的图片。人脸训练集由4916个手标人脸组成,都缩放和对齐成24×24像素的基本块。提取人脸的图片是在使用随机爬虫在万维网上下载。一些典型人脸例子如图5所示。训练检测器的没有人脸的子窗口来自9544张图片,都已经进行人工检查,确定不包含任何人脸。在这些没有人脸的图片中,子窗口共有大概3.5亿个。
The number of features in the first five layers of the de- tector is 1, 10, 25, 25 and 50 features respectively. The remaining layers have increasingly more features. The total number of features in all layers is 6061. 在开始五层检测器中特征的数量分别为1、10、25、25和50。剩下的各层包含的特征数量急剧增多。特征总数是6061个。
Each classifier in the cascade was trained with the 4916 training faces (plus their vertical mirror images for a total of 9832 training faces) and 10,000 non-face sub-windows (also of size 24 by 24 pixels) using the Adaboost training procedure. For the initial one feature classifier, the non- face training examples were collected by selecting random sub-windows from a set of 9544 images which did not con- tain faces. The non-face examples used to train subsequent layers were obtained by scanning the partial cascade across the non-face images and collecting false positives. A max- imum of 10000 such non-face sub-windows were collected for each layer. 在级联中的每个分类器都经过4916个受训人脸(加上它们的垂直镜像,一共有9832个受训人脸)和10000个无人脸的子窗口(同样它们的尺寸都是24×24),使用自适应增强训练程序训练。对于最初的含一个特征的分类器,无人脸训练实例从一系列9544张没有人脸的图片中随机选择出子窗口。用来训练随后的层的没有人脸实例是通过扫描部分级联的无人脸图像以及收集正误视率而得的。每一层收集的像这样无人脸的子窗口的最大值是10000。
Figure 5: Example of frontal upright face images used for training
Speed of the Final Detector The speed of the cascaded detector is directly related to the number of features evaluated per scanned sub-window. Evaluated on the MIT+CMU test set [12], an average of 10 features out of a total of 6061 are evaluated per sub-window. This is possible because a large majority of sub-windows are rejected by the first or second layer in the cascade. On a 700 Mhz Pentium III processor, the face detector can pro- cess a 384 by 288 pixel image in about .067 seconds (us- ing a starting scale of 1.25 and a step size of 1.5 described below). This is roughly 15 times faster than the Rowley- Baluja-Kanade detector [12] and about 600 times faster than the Schneiderman-Kanade detector [15]. 最终检测器的速度 级联的检测器的速度是和在每次扫描子窗口中评估的特征数目有直接影响的。在MIT+CMU测试集的评估中[12],平均6061个特征中有10个特征被挑出,评估每一个子窗口。这并非不可能,因为有大量子窗口被级联的第一层和第二层剔除。在700兆赫的奔腾3处理器上,该人脸检测可以约0.67秒的速度处理一幅384×288像素大小的图像(使用)。这个大概是Rowley-Baluja-Kanade检测器[12]的速度的15倍,是Schneiderman- Kanade检测器[15]速度的约600倍。
Image Processing All example sub-windows used for training were vari- ance normalized to minimize the effect of different light- ing conditions. Normalization is therefore necessary during detection as well. The variance of an image sub-window can be computed quickly using a pair of integral images. Recall that , where is the standard deviation, is the mean, and is the pixel value within the sub-window. The mean of a sub-window can be com- puted using the integral image. The sum of squared pixels is computed using an integral image of the image squared (i.e. two integral images are used in the scanning process). During scanning the effect of image normalization can be achieved by post-multiplying the feature values rather than pre-multiplying the pixels. 图像处理 所有用来训练的子窗口实例都经过方差标准化达到最小值,尽量减少不同光照条件的影响。因此,在检测中也必须规范化。一个图像子窗口的方差可以使用一对积分图像快速计算。回忆,此处是标准差,是均值,而是在子窗口中的像素值。子窗口的均值可以由积分图像计算得出。像素的平方和可以由一个图像的积分图像的平方得出(即,两个积分图像在扫描进程中使用)。在扫描图像中,图像的规范化可以通过后乘以特征值达到,而不是预先乘以像素值。
Scanning the Detector The final detector is scanned across the image at multi- ple scales and locations. Scaling is achieved by scaling the detector itself, rather than scaling the image. This process makes sense because the features can be evaluated at any scale with the same cost. Good results were obtained using a set of scales a factor of 1.25 apart. 扫描检测器 扫描最终检测器在多尺度和定位下对图像进行扫描。尺度缩放更多是由缩放检测器自身而不是缩放图像得到。这个进程的意义在于特征可以在任意尺度下评估。使用1.25的间隔的可以得到良好结果。
The detector is also scanned across location. Subsequent locations are obtained by shifting the window some number of pixels Δ. This shifting process is affected by the scale of the detector: if the current scale is S the window is shifted by [SΔ] , where [] is the rounding operation. 检测器也根据定位扫描。后续位置的获得是通过将窗口平移⊿个像素获得的。这个平移程序受检测器的尺度影响:若当前尺度是s,窗口将移动[s⊿],这里[]是指凑整操作。
The choice of Δ affects both the speed of the detector as well as accuracy. The results we present are for Δ = 1.0 . We can achieve a significant speedup by setting Δ = 1.5 with only a slight decrease in accuracy. ⊿的选择不仅影响到检测器的速度还影响到检测精度。我们展示的结果是取了⊿=1.0。通过设定⊿=1.5,我们实现一个有意义的加速,而精度只有微弱降低。
Integration of Multiple Detections Since the final detector is insensitive to small changes in translation and scale, multiple detections will usually occur around each face in a scanned image. The same is often true of some types of false positives. In practice it often makes sense to return one final detection per face. Toward this end it is useful to postprocess the detected sub-windows in order to combine overlapping detections into a single detection. 多检测的整合 因为最终检测器对于传递和扫描中的微小变化都很敏感,在一幅扫描图像中每个人脸通常会得到多检测结果,一些类型的正误视率也是如此。在实际应用中每个人脸返回一个最终检测结果才显得比较有意义。
In these experiments detections are combined in a very simple fashion. The set of detections are first partitioned into disjoint subsets. Two detections are in the same subset if their bounding regions overlap. Each partition yields a single final detection. The corners of the final bounding region are the average of the corners of all detections in the set. 在这些试验中,我们用非常简便的模式合并检测结果。首先把一系列检测分割成许多不相交的子集。若两个检测结果的边界区重叠了,那么它们就是相同子集的。每个部分产生单个最终检测结果。最后的边界区的角落定义为一个集合中所有检测结果的角落平均值。
Experiments on a Real-World Test Set We tested our system on the MIT+CMU frontal face test set [12]. This set consists of 130 images with 507 labeled frontal faces. A ROC curve showing the performance of our detector on this test set is shown in Figure 6. To create the ROC curve the threshold of the final layer classifier is adjusted from -∞ to +∞ . Adjusting the threshold to +∞ will yield a detection rate of 0.0 and a false positive rate of 0.0. Adjusting the threshold to -∞ , however, increases both the detection rate and false positive rate, but only to a certain point. Neither rate can be higher than the rate of the detection cascade minus the final layer. In effect, a threshold of -∞ is equivalent to removing that layer. Further increasing the detection and false positive rates requires decreasing the threshold of the next classifier in the cascade.Thus, in order to construct a complete ROC curve, classifier layers are removed. We use the number of false positives as opposed to the rate of false positives for the x-axis of the ROC curve to facilitate comparison with other systems. To compute the false positive rate, simply divide by the total number of sub-windows scanned. In our experiments, the number of sub-windows scanned is 75,081,800. 在现实测试集中实验 我们在MIT+CMU正面人脸测试集[12]上对系统进行测试。这个集合由130幅图像组成,共有507个标记好的正面人脸。图6是一个ROC曲线,显示在该测试集上运行的检测器的性能。其中末层分类器的阈值设置为从—∞到+∞。当调节阈值趋近+∞时,检测率趋于0.0,正误视率也趋于0.0。而当调节阈值趋近—∞时,检测率和正误视率都增长了,但最终会趋向一个恒值。速率最高的就是级联中末层的。实际上,阈值趋近—∞就等价于移走这一层。要想得到检测率和正误视率更多的增长,就需要减小下一级分类器的阈值。因此,为了构建一个完整的ROC曲线,我们将分类器层数移走了。为了方便与其它系统比较,我们使用正误视的数目而不是正误视概率作为坐标的x轴。为了计算正误视率,简单将扫描的子窗口总数与之相除即可。在我们的实验中,扫描过的子窗口总数达到了75,081,800。
Unfortunately, most previous published results on face detection have only included a single operating regime (i.e. single point on the ROC curve). To make comparison with our detector easier we have listed our detection rate for the false positive rates reported by the other systems. Table 2 lists the detection rate for various numbers of false detec- tions for our system as well as other published systems. For the Rowley-Baluja-Kanade results [12], a number of differ- ent versions of their detector were tested yielding a number of different results they are all listed in under the same head- ing. For the Roth-Yang-Ahuja detector [11], they reported their result on the MIT+CMU test set minus 5 images containing line drawn faces removed. 不幸的是,大多数人脸检测的先前已公布的结果仅有单一操作制度(即,ROC曲线上的单一点)。为了使之与我们的检测器更容易进行比较,我们将我们系统在由其它系统测出的正误视率下的检测率进行列表。表2列出了我们的系统和其它已公布系统的不同数目错误检测结果下的检测率。对Rowley-Baluja-Kanade的结论[12],我们对他们的一些不同版本的检测器进行测试,产生一些不同结果,都列在同一标题下。Roth-Yang-Ahuja[11]检测器的结果的5幅图像包括线绘人脸被移除了。
Figure 6: ROC curve for our face detector on the MIT+CMU test set. The detector was run using a step size of 1.0 and starting scale of 1.0 (75,081,800 sub-windows scanned). 图 6 检测器在MIT+CMU测试集上的ROC曲线
Figure 7 shows the output of our face detector on some test images from the MIT+CMU test set. 图7则展示了对于一些来自MIT+CMU测试集中的测试图片,我们的人脸检测器的输出结果。
Figure 7: Output of our face detector on a number of test images from the MIT+CMU test set. 图7:我们的人脸检测器的输出结果,在数个来自MIT+CMU测试集的测试图像上
A simple voting scheme to further improve results In table 2 we also show results from running three de- tectors (the 38 layer one described above plus two similarly trained detectors) and outputting the majority vote of the three detectors. This improves the detection rate as well as eliminating more false positives. The improvement would be greater if the detectors were more independent. The cor- relation of their errors results in a modest improvement over the best single detector. 简易完善计划 在表2我们也显示了运行三个检测器的结果(一个本文描述的38层检测器加上两个类似受训检测器)。在提高检测率的同时也消除很多正误视率,且随检测器独立性增强而提高。由于它们之间存在误差,所以对于最佳的单一检测器,检测率是有一个适度提高。
全部评论
专题导读
热门推荐
热门话题
阅读排行榜
|
请发表评论