Report project : Pedestrians Detection with TensorFlow
I ) Introduction
During the Computer Vision course we had to work on a project. We chose the Pedestrian Detection subject. The purpose of this task is to detect and locate the pedestrians in the field of view of the camera.
For that, we used an existant algorithm SSD (Single Shot MultiBox Detector). This code is implemented in Python. The Tensorflow library has been used in this one.
In this report, we will explain the VGG16 network. After that, we will introduce the SSD model. Then, we will do some tests in order to see how it works.
II ) The VGG16 network
The SDD network is based on the VGG16 architecture. So in this section we will explain this architecture.
Figure 1 : VGG16 network
This network is called VGG16 because it has 16 layers of neurons.
To build this network we begin by adding two convolutions and we use the Relu activation function after that we do a compression with max pooling. We do the same a second time. Then we add three convolutions and use again the Relu function and the max pooling, we add this part three times. Then we add three layers of neurons fully connected with the use of the function Relu. And to finish we use softmax like classifier.
Convolution : The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of learnable filters. Each filter is convolved across the width and height of the input volume, computing the dot product between the entries of the filter and the input and producing an activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input.
Adding several convolution allows to a neuron to looks a small region in the input.
Activation function : An activation function of a node defines the output of that node given an input or set of input.
Relu function : The Relu function is defined like this : f(x)=max(0,x).
Pooling layer : The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters and amount of computation in the network.
Max pooling : Max pooling is a pooling layer. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples at every depth slice in the input by 2 along both width and height,
Figure 2 : Max pooling with a 2x2 filter and stride = 2
Fully connected layers : After several convolutional and max pooling layers, we need to add fully connected layers. Neurons in a fully connected layer have connections to all activations in the previous layer.
The loss layer : It specifies how training penalizes the deviation between the predicted and true labels and is normally the final layer.
Softmax loss : The softmax function is often used in the final layer of a neural network-based classifier. This loss is used for predicting a single class of K mutually exclusive classes.
II ) The SSD model
SSD is a fast single-shot objet detector for multiple categories.
The network of the SSD model is the following :
Figure 3 : SDD model
We can see that it is composed by the VGG16 network, it also have several convolutions. At the end there is a non maximum supression algorithm in order to eliminate multiple detections. The detections that have the higher score is selected and all the others are removed from the detection set.
The SSD network that we downloaded is already trained. For restoring it we need do to the following instructions :
ckpt_filename is the file where the trained model is saved.
If we want to use it we have to give an image to network.
The function process_image give this image to the network and returns the classes, the scores and positions of the boxes.
The function visualization.plt_bboxes returns the initial image with the boxes plotted and the number of the class and its percentage.
We have seen it the code that the total number of classes is 21.
III ) Network tests
In this part we wil examine the performance of this network by giving it different images.
First we will give it an image from the database Demo.
Figure 4 : Example of objets detection in a image
The figure 3 show the image that the algorithm returns. We can see that is able to recognize pedestrians and also others objets like cars and bike. In the image we have a colored box around each objet detected. For each class we have a box in a specific color with the number that represents this class. In these boxes we also have the percentage of chance that the objet is well detected.
We can notice that the pedestrians have the number fifteen. We remark that the pedestrians in the foreground of the image are better detected than the others in the background.
With this example we can say that pedestrians are well detected.
Now we will take another image that doesn't come from the database demo to compare.
Figure 5 : Example of pedestrians detection in an image.
In this picture there are eight people and the algorithm detects only four of them.
For the box with the percentage 0.975 we don't know which person is detected. It could be the girl or the man in the horizontale.
Therefore even if we have high percentage it doesn't mean that the algorithm has well detected the object.
To conclude, we presented the VGG 16 network . Then, we introduced the SSD network which use the VGG architecture. After that, we did some tests to evaluate the efficiency of this model. Thanks to these tests we noticed that pedestrians are not always well detected even if the percentage is high.
The next part of this project will be to modify the code in order to make it able to identify only pedestrians.
Researchs on the field of visual tracking
Many studies in the past decades shown that about 90-95% of car crashes are caused by human error. From those observations, the fields of mathematics and informatics began to think about new technologies to help reduce human mistakes.
In the context of automation and with the rising of artificial intelligence, we propose to tackle the problem of visual tracking. The goal of the algorithm is to detect and classify different object in a given situation. The task can be separated into two “easier” tasks : first the detection of the position and the shape of the object, then the classification as a known type (pedestrian, dog, car…).
The main problems of visual tracking reside in predicting the position and the shape of the same desired object in every possible condition (different lighting, different orientation, separation of bounding…) through time. The algorithm must also be more reliable than any human.
Many solutions exists, but the recent soaring of machine learning in visual recognition and the excellent precision those models get on image classification exposed the deep learning methods as most relevant to tackle visual tracking problems.
We focused on two models : on the one hand the Faster-RCNN model by Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, on the other hand the Single Shot MultiBox Detector by Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg.
Both methods are efficient to solve the problematic, but they have different processing results.
The main difference between those two models is that one does the localisation and the classification in the same time while the other splits those two outputs into two different problems.
We will compare the architecture and the performances of those models to finally conclude on the model that we will use during this project.
The R-CNN method  trains CNNs end-to-end to classify the proposal regions into object categories or background. R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression). Its accuracy depends on the performance of the region proposal module.
The architecture of the model is composed by 2 modules :
- The first module is a deep fully convolutional network that proposes regions
- The second module is the Fast R-CNN detector that uses the proposed regions
The problem is then separated into two sub-problems : one that focuses on spatial localisation and the other focused on the classification.
On the table below we can see the precision of the Faster R-CNN on the PASCAL VOC 2007 data set.
SSD  is a single-shot detector for multiple categories that is faster than
the previous state-of-the-art for single shot detectors (YOLO), and significantly
more accurate, in fact as accurate as slower techniques that perform explicit region
proposals and pooling (including Faster R-CNN).
The core of SSD is predicting category scores and box offsets for a fixed set of
default bounding boxes using small convolutional filters applied to feature maps.
To achieve high detection accuracy it produces predictions of different scales from
feature maps of different scales, and explicitly separate predictions by aspect ratio.
These design features lead to simple end-to-end training and high accuracy, even
on low resolution input images, further improving the speed vs accuracy trade-off.
The model is based exclusively on convolutional and pooling layers.
The first layer (base network) architecture is based on the VGG-16 model for visual classification, truncated before the fully connected layer. This layer, like all others, produces a fixed set of detection predictions using a set of convolutional filters.
Then convolutional feature layers are added to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales.
The key difference between training SSD and training a typical detector that uses region
proposals, is that ground truth information needs to be assigned to specific outputs in
the fixed set of detector outputs.Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies.
The particularity of this model is that each feature produces detections, meaning that the backpropagation is applied on every feature.
Each feature represents a window in which the probability that the object appears in a fixed-size box is calculated.
Unlike the Faster R-CNN network, that model does not use previously proposed locations. This model does the localisation and the classification task at each layer level.
The model has very good precision, even better than the Faster R-CNN.
Below you can see the comparison on the PASCAL2007 data set of the Faster R-CNN accuracy and the SSD accuracy. We clearly see that the SSD models are more accurate on the detection of all the classes of the data set.
The development of new technologies grows continuously. In order to limit humans errors and to automate several tasks, new algorithms are implemented. In this document, we were interested in visual tracking. We focused our analysis on two models : SSD (Single-Shot Detector) and Faster R-CNN (Region-based Convolutional Neural Network). The comparison between the two models have shown that the SSD performs better on many different situation, for it always beats the Faster R-CNN in every datasets.
The Faster R-CNN model was the first achieving good performances. Having a 60% mean error showed that the research was heading in a good way. But the results remained too low to be satisfying. Its architecture, based on two phases, was a first approach of the visual tracking problem. The separation allowed the user to separately tackle two very different objectives. But it also appeared to be very complex for its performances.
It is in this context of active research that Liu, Wei, and al. proposed a new approach : instead of splitting the task into two subtasks, the algorithm will perform both the localisation and the classification at each level of abstraction. Thus, the higher levels use the information provided by the lower ones. This slight difference allows the model to outperform the Faster R-CNN by 5% on average.
 Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99).
 Liu, Wei, and al. "Ssd: Single shot multibox detector." European conference on computer vision. Springer, Cham, 2016.
In a context of a project held by the Clermont Auvergne University and the engineering school SIGMA, we are studying the possibility to create a 3D impression system for concrete. The goal of this system is to be able to put the concrete from a concrete print head at the end of the cable of a hoist. All of this can be done by the help of a vision camera where the sensor will be positioned on the print head to follow people activities on the ground. Thanks to that, we can extract 4 main tasks which are :
-Calibration of the camera, Detection of a sight, Location of the camera compared to the ground and Detection of people.
My team and me decided to work on the dectection of a sight. We have to find a way for the camera to detect in the image/video the sight and to follow it. To do that, we have to detect interest points in a sight reference image (many exemples are available in the library OpenCv) and to find these same interest points in the current image.
II) How does it work ?
As we said before, our goal is, given a query image of our sight and the image/video taken by the camera, to find our sight and follow its position.
In illustrations 1 and 2, we can see our sight and the image where we want to find our sight (here is a very basical exemple).
1) Research of interest points of our reference sight
The first step of the algorithm is to detect interest points (or interest points) of our reference sight.
A interest point in a picture is a point which has a clear position and which is stable under perturbations such as rotation, translation, scaling, illumination, etc ….The goal is to detect these points, at first in our reference sight then at every frame in the video taken.Multiple algorithms exist to find such points but one of the most famous one is the Harris dectector which is a corner detection algorithm which detect corner.
2) Matching of these points
Now we know how to detect interest points, we need to match the interest points of our reference sight with the interest points of the video To do that, we need a feature descriptor which will be used to compare and match the different interest points.
These 2 previous steps are the « center » of every image detector algorithms.
Now we will study the program himself used for our projet (which was available in OpenCv).
III) Algorithm utilisation examples
For the moment, the algorithm we use is not the final algorithm because in the algorithm we will talk about now, we have to select manually the initial position of the sight and then the sight will be followed in the video.
First, when we execute the algorithm, we have to select (with a green rectangle), the area of the object we want to follow. In my example, we are trying to follow the card, so we selected it and all interest points will be found in this area.
After beeing selected, if we release the clic, we can see that a blue « house » will surround the area selected.
Then even after the video continue, we can see that the « house » keep following the object selected.
While no other area (a new object to follow) is not selected (with green rectangle), the algorithm will keep tracking it.
In the next part we will explain the different sequences of this programm.
IV) Implementation with OpenCv
This process should be fulfiled by using the OpenCV Library in Python language.
We use the preprogrammed sequence plane_ar.py which propose to select a rectangular part of a moving image and tries
to follow it in its movements. Moreover, it builds a 3D-looking "house" above the selected shape which follows it along its movements.
This program is composed of an initialization sequence _init_ that creates the main window and a trackbar that allows the user to adapt the focal of the image. It also allows to create plane zone selectors thanks to common.RectSelector(). This function needs the on_rect() defined function, that adds the selected target to a tracking list using tracker.add_target().
|( , ):|
|.cap video.create_capture(src, presets[ ])|
|cv2.createTrackbar(, , , , common.nothing)|
|.rect_sel common.RectSelector( , .on_rect)|
|( , ):|
|.tracker.add_target( .frame, rect)|
We now define a function run(), that will, as its name shows, allow to run the App itself.
Firstly, everything is set to make the gif or the video play ans to pause it while selecting a rectangular zone. Then, using the functions polylines and circle of OpenCV, we set the drawing of the house-shaped cube above the selected zone, and we circle the interest points used for detection.
After everything is drawn and shown thanks to the draw() and imshow() functions of OpenCV, we set two interactions with the window: press «Space» to pause the video/gif and «c» to clear all the trackers.
|playing .frame :|
|cv2.polylines(vis, [np.int32(tr.quad)],, ( , , ), )|
|(x, y) np.int32(tr.p1):|
|cv2.circle(vis, (x, y),, ( , , ))|
|ch ( ):|
|ch ( ):|
Finally, we set a function draw_overlay() in which everything is defined to put the 3D reconstitution of the house-shaped cube in place on the tracked points. Indeed, it sets a transformation matrix K and creates a 3D projection thanks to the OpenCV function projectPoints().
|( , , ):|
|x0, y0, x1, y1tracked.target.rect|
|quad_3dnp.float32([[x0, y0, ], [x1, y0, ], [x1, y1, ], [x0, y1, ]])|
|fxcv2.getTrackbarPos( , )|
|h, wvis.shape[: ]|
|Knp.float64([[fx w, , (w )],|
|[, fx w, (h )],|
|[, , ]])|
|_ret, rvec, tveccv2.solvePnP(quad_3d, tracked.quad, K, dist_coef)|
|vertsar_verts [(x1 x0), (y1 y0), (x1 x0) ] (x0, y0, )|
|vertscv2.projectPoints(verts, rvec, tvec, K, dist_coef)[ ].reshape( , )|
|i, j ar_edges:|
|(x0, y0), (x1, y1)verts[i], verts[j]|
cv2.line(vis, ((x0), (y0)), ( (x1), (y1)), ( , , ), )
Then the main function is implemented. It simply runs the main App and set a default image if none is submitted by the user while calling the main program.
V) Test using our own materials
Encountering problems with the OpenCV library, we decided to cut the videos into multiple images thanks to the filezigzag.com website. We now input the treated video as a sequence of following images to avoid error using the command cv2.VideoCapture('im/%08d.ppm'). Thanks to this advanced process we managed to read our own videos. Then we could launch the plane_ar.py program with custumized videos to test the recognition of our sight. (unfortunately, we can't take screenshot of our video because there are some problems with the virtual machine)
The aim will now be to find a way to give the program our card as a reference instead of selecting a rectangle zone manually. Therefore, it will be able to spot the whished shape in the image automatically and to follow it through its movements. To do so, we are thinking about removing the function RectSelector() and to find a way to pass our reference image as an argument.
Project n°4: Person detection
The goal of the project is to detect moving persons on a video using deep learning methods. We teach the neural network to detect persons on images by learning from an annotated database. More specifically we will use the SSD method thanks to the tensorflow module on python.
I) Convolutionnal networks.
This method uses convolutional neural networks, that is to say neural networks that have convolutionnal layers:
A convolutional layer is composed of packs of neurones, each pack only processes a certain window in the initial image. A retropropagation algorithm is used to modify the activation filters that each neurone applies to the window it is assigned to.
A pooling (agregation) layer is often applied to reduce the dimension of the output after a convolutionnal layer.The outputs are then connected to a fully connected layer (a layer in which each neurone is connected to every output of the previous layer) to allow an overall processing of the information.
Scheme of a convolutional layer (in blue) applied on an image (in pink)
This type of network is particularly relevant as far as image recognition is concerned, as it processes information at the local scale (which is ideal for border recognition for example) allowing a much faster and adapted processing.
II) VGG network
The VGG network is a a multi-layer convolutional network that aims to predicts the probability of presence of object classes in the image. Convolutional layers are applied to the inputted image, followed by a pooling layer, then convolutional layers are applied again and so on. After several iterations, each
reducing the dimension of the output, fully connected layers are applied and finally a classification layer gives the output probability for each class of object.
Scheme of a VGG network
This model of network is one of the most efficient for image recognition, it managed to attain more than 92 % of successful recognition on the image net database.
III) SSD network
The SSD network, standing for Single Shot multibox Detector, it is a method for detecting objects in an image using a single deep neural network. It's part of the family of networks which predict the bounding boxes of objects in a given image.It is a simple, end to end single network, removing many steps involved in other networks which tries to achieve the same task.
The SSD network uses the VGG architecture as a base. But instead of trying to classify the image after in went through the VGG, we remove the fully connected layer at the end of VGG. Then we apply several convolutional layers, the output of the VGG as well as the outputs of every following layers (of decreasing dimensions) are all connected to a fully connected layer that computes all this information ...
A SSD scheme
iV) Code explanation
The lines below explain some parts of the codes of SSD Framework :
The demo folder contains a set of images for testing the SSD algorithm in the main file.
The Notebook folder contains a minimal example of the SSD TensorFlow pipline. Basically, the detection is made of two main steps:
1) Running the SSD network on the image
2) Post-processing the output(putting a rectangle on the detected object with a number which design the class where the object belongs).