Deep Learning Working Group R-CNN Includes slides from : Josef Sivic, Andrew Zisserman and so many other Nicolas Gonthier February 1, 2018
Recognition Tasks Image Classification Does the image contain an aeroplane? (last lecture) Object Class Detection/Localization Where are the aeroplanes (if any)? Object Class Segmentation Which pixels are part of an aeroplane (if any)?
Classification vs. Detection ü Dog Dog Dog
Problem formulation { airplane, bird, motorbike, person, sofa } person motorbike Input Desired output
Region proposals: Selective Search 1. Merge two most similar regions based on S. 2. Update similarities between the new region and its neighbors. 3. Go back to step 1. until the whole image is a single region. [K. van de Sande, J. Uijlings, T. Gevers, and A. Smeulders, ICCV 2011]
Region proposals: Selective Search Take bounding boxes of all generated regions and treat them as possible object locations. [K. van de Sande, J. Uijlings, T. Gevers, and A. Smeulders, ICCV 2011]
Region proposals: Selective Search [K. van de Sande, J. Uijlings, T. Gevers, and A. Smeulders, ICCV 2011]
Test: Non-maximum suppression (NMS) Scanning-window detectors typically result in multiple responses for the same object Conf=.9 To remove multiple responses, a simple greedy procedure called Non-maximum suppression is applied: NMS: 1. Sort all detections by detector confidence 2. Choose most confident detection d i ; remove all d j s.t. overlap(d i,d j )>T 3. Repeat Step 2. until convergence
Putting it together: R-CNN Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-1 Feb 2016 Slide credit: Ross Girschick [Girschick et al, Rich feature hierarchies for accurate object detection and semantic segmentation, CVPR 2014]
Region-based Convolutional Networks (R-CNNs) mean Average Precision (map) 70% 60% 50% 40% 30% 20% 10% 17% DPM 23% DPM, HOG+ BOW 28% DPM, MKL 37% 41% DPM++ DPM++, MKL, Selective Search 41% Selective Search, DPM++, MKL 53% R-CNN v1 62% R-CNN v2 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 year [R-CNN. Girshick et al. CVPR 2014]
76% mean Average Precision (map) 70% 60% 50% 40% 30% 20% 10% ResNet ~1 year ~5 years 0% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 year
R-CNN Problems 1. Slow at test-time: need to run full forward pass of CNN for each region proposal 2. SVMs and regressors are post-hoc: CNN features not updated in response to SVMs and regressors 3. Complex multistage training pipeline Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-66 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
R-CNN Problem #1: Slow at test-time due to independent forward passes of the CNN Solution: Share computation of convolutional layers between proposals for an image Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-68 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016 [Girschick, Fast R-CNN, ICCV 2015]
R-CNN Problem #2: Post-hoc training: CNN not updated in response to final classifiers and regressors R-CNN Problem #3: Complex training pipeline Solution: Just train the whole system end-to-end all at once! Slide credit: Ross Girschick Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-69 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
Fast R-CNN: Region of Interest Pooling Convolution and Pooling Max-pool within each grid cell Fully-connected layers Hi-res input image: 3 x 800 x 600 with region proposal Hi-res conv features: C x H x W with region proposal RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-73 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
Fast R-CNN: Region of Interest Pooling Convolution and Pooling Can back propagate similar to max pooling Fully-connected layers Hi-res input image: 3 x 800 x 600 with region proposal Hi-res conv features: C x H x W with region proposal RoI conv features: C x h x w for region proposal Fully-connected layers expect low-res conv features: C x h x w Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-74 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
Fast R-CNN Results R-CNN Fast R-CNN Faster! FASTER! Better! Training Time: 84 hours 9.5 hours (Speedup) 1x 8.8x Test time per image 47 seconds 0.32 seconds (Speedup) 1x 146x map (VOC 2007) 66.0 66.9 Using VGG-16 CNN on Pascal VOC 2007 dataset Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-77 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
Fast R-CNN Problem Solution: Test-time speeds don t include region proposals Just make the CNN do region proposals too! R-CNN Fast R-CNN Test time per image 47 seconds 0.32 seconds (Speedup) 1x 146x Test time per image with Selective Search 50 seconds 2 seconds (Speedup) 1x 25x Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-79 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
Faster R-CNN: Insert a Region Proposal Network (RPN) after the last convolutional layer RPN trained to produce region proposals directly; no need for external region proposals! After RPN, use RoI Pooling and an upstream classifier and bbox regressor just like Fast R-CNN Ren et al, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS 2015 Slide credit: Ross Girschick Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-80 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
Faster R-CNN: Region Proposal Network Slide a small window on the feature map Build a small network for: classifying object or not-object, and regressing bbox locations 1 x 1 conv 1 x 1 conv Position of the sliding window provides localization information with reference to the image 1 x 1 conv Box regression provides finer localization information with reference to this sliding window Slide credit: Kaiming He Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-81 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
Faster R-CNN: Region Proposal Network Use N anchor boxes at each location Anchors are translation invariant: use the same ones at every location Regression gives offsets from anchor boxes Classification gives the probability that each (regressed) anchor shows an object Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-82 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
Faster R-CNN: Training In the paper: - Use alternating optimization to train RPN, then Fast R-CNN with RPN proposals, etc. - More complex than it has to be Since publication: Joint training! One network, four losses - RPN classification (anchor good / bad) - RPN regression (anchor -> proposal) - Fast R-CNN classification (over classes) - Fast R-CNN regression (proposal -> box) Slide credit: Ross Girschick Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-83 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
Faster R-CNN: Results R-CNN Fast R-CNN Faster R-CNN Test time per image (with proposals) 50 seconds 2 seconds 0.2 seconds (Speedup) 1x 25x 250x map (VOC 2007) 66.0 66.9 66.9 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8-84 1 Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson 1 Feb 2016
Detection without proposals: Yolo / SSD Input image 3 x H x W Redmon et al, You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016 Liu et al, SSD: Single-Shot MultiBox Detector, ECCV 2016 Divide image into grid 7 x 7 Image a set of base boxes centered at each grid cell Here B = 3 Slide credit: L. Fei Fei, J. Johnson, S. Yeung, http://cs231n.stanford.edu/
Detection without proposals: Yolo / SSD Within each grid cell: - Regress from each of the B base boxes to a final box with 5 numbers: (dx, dy, dh, dw, confidence) - Predict scores for each of C classes (including background as a class) Divide image into grid 7 x 7 Image a set of base boxes centered at each grid cell Here B = 3 Output: 7 x 7 x (5 * B + C) From input image to scores with a single network. Faster but not as accurate as RCNN. See also: Lin et al., Focal loss for dense object detection, ICCV 2017. Slide credit: L. Fei Fei, J. Johnson, S. Yeung, http://cs231n.stanford.edu/
Mask-RCNN: object detection and segmentation R-CNN = Faster R-CNN with FCN on RoIs Faster R-CNN FCN on RoI Mask RCNN = - 1. Object detector using Faster RCNN + - 2. fully convolutional network (FCN) on region of interest (RoI) Slide credit: K. He, instancetutorial.github.io
References for object detection RCNN B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. TPAMI, 2012. I. Endres and D. Hoiem. Category independent object proposals. In ECCV, 2010. J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013. A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012. X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object detection. In ICCV, 2013. R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014. Fast R-CNN K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV, 2014. R. Girshick. Fast R-CNN. In ICCV, 2015. Faster R-CNN D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014. P.O.Pinheiro,R.Collobert,andP.Dollar.Learningtosegmentobjectcandidates.InNIPS,2015. S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. Segmentation J. Long, E. Shelhamer, T. Darrell, Fully Convolutional Networks for Semantic Segmentation. CVPR 2015. He et al, Mask R-CNN, ICCV 2017
Summary of object detection / segmentation Basic idea: train a sliding window classifier from training data Pre-CNN: Histogram of oriented gradients (HOG) + lin. SVM jittering, hard negative mining to improve accuracy, region proposals R-CNN: combine region proposals and CNN features Fast(er) R-CNN: end-to-end training: Region proposals and object classification can be trained jointly Deeper networks (ResNet101) improve accuracy Mask-RCNN: object detection+segmentation Fully convolutional networks (FCN) for segmentation Loss: segmentation, classification and bounding box prediction