Researchs on the field of visual tracking




Many studies in the past decades shown that about 90-95% of car crashes are caused by human error. From those observations, the fields of mathematics and informatics began to think about new technologies to help reduce human mistakes.


In the context of automation and with the rising of artificial intelligence, we propose to tackle the problem of visual tracking. The goal of the algorithm is to detect and classify different object in a given situation. The task can be separated into two “easier” tasks : first the detection of the position and the shape of the object, then the classification as a known type (pedestrian, dog, car…).


The main problems of visual tracking reside in predicting the position and the shape of the same desired object in every possible condition (different lighting, different orientation, separation of bounding…) through time. The algorithm must also be more reliable than any human.


Many solutions exists, but the recent soaring of machine learning in visual recognition and the excellent precision those models get on image classification exposed the deep learning methods as most relevant to tackle visual tracking problems.

We focused on two models : on the one hand the Faster-RCNN model by Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, on the other hand the Single Shot MultiBox Detector by Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg.

Both methods are efficient to solve the problematic, but they have different processing results.


The main difference between those two models is that one does the localisation and the classification in the same time while the other splits those two outputs into two different problems.

We will compare the architecture and the performances of those models to finally conclude on the model that we will use during this project.







Faster R-CNN

The R-CNN method [1] trains CNNs end-to-end to classify the proposal regions into object categories or background. R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression). Its accuracy depends on the performance of the region proposal module.

The architecture of the model is composed by 2 modules :

    - The first module is a deep fully convolutional network that proposes regions

    - The second module is the Fast R-CNN detector that uses the proposed regions


The problem is then separated into two sub-problems : one that focuses on spatial localisation and the other focused on the classification.




On the table below we can see the precision of the Faster R-CNN on the PASCAL VOC 2007 data set.






SSD [2] is a single-shot detector for multiple categories that is faster than

the previous state-of-the-art for single shot detectors (YOLO), and significantly

more accurate, in fact as accurate as slower techniques that perform explicit region

proposals and pooling (including Faster R-CNN).

The core of SSD is predicting category scores and box offsets for a fixed set of

default bounding boxes using small convolutional filters applied to feature maps.

To achieve high detection accuracy it produces predictions of different scales from

feature maps of different scales, and explicitly separate predictions by aspect ratio.

These design features lead to simple end-to-end training and high accuracy, even

on low resolution input images, further improving the speed vs accuracy trade-off.


The model is based exclusively on convolutional and pooling layers.

The first layer (base network) architecture is based on the VGG-16 model for visual classification, truncated before the fully connected layer. This layer, like all others, produces a fixed set of detection predictions using a set of convolutional filters.

Then convolutional feature layers are added to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales.

The key difference between training SSD and training a typical detector that uses region
proposals, is that ground truth information needs to be assigned to specific outputs in
the fixed set of detector outputs.Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies.


The particularity of this model is that each feature produces detections, meaning that the backpropagation is applied on every feature.

Each feature represents a window in which  the probability that the object appears in a fixed-size box is calculated.

Unlike the Faster R-CNN network, that model does not use previously proposed locations. This model does the localisation and the classification task at each layer level.



The model has very good precision, even better than the Faster R-CNN.

Below you can see the comparison on the PASCAL2007 data set of the Faster R-CNN accuracy and the SSD accuracy. We clearly see that the SSD models are more accurate on the detection of all the classes of the data set.






The development of new technologies grows continuously. In order to limit humans errors and to automate several tasks, new algorithms are implemented. In this document, we were interested in visual tracking. We focused our analysis on two models : SSD (Single-Shot Detector) and Faster R-CNN (Region-based Convolutional Neural Network). The comparison between the two models have shown that the SSD performs better on many different situation, for it always beats the Faster R-CNN in every datasets.


The Faster R-CNN model was the first achieving good performances. Having a 60% mean error showed that the research was heading in a good way. But the results remained too low to be satisfying. Its architecture, based on two phases, was a first approach of the visual tracking problem. The separation allowed the user to separately tackle two very different objectives. But it also appeared to be very complex for its performances.


It is in this context of active research that Liu, Wei, and al. proposed a new approach : instead of splitting the task into two subtasks, the algorithm will perform both the localisation and the classification at each level of abstraction. Thus, the higher levels use the information provided by the lower ones. This slight difference allows the model to outperform the Faster R-CNN by 5% on average.

Références :


[1] Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99).


[2] Liu, Wei, and al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.