Determine Anchor Boxes for Object Detection

wsh
3 min readDec 28, 2018

YOLO is an object detection algorithm that runs much faster than any other object detection algorithms. As its name suggests, YOLO performs feed forward operation just one time for an image, thus it runs faster than region based models.

The second version of YOLO, called YOLOv2, runs faster than YOLO and it uses some new techniques to make its prediction more precisely and faster. One of them is using Anchor Boxes.

Anchor Boxes are special boxes that are used to give a model, such as YOLOv2, some assumptions on the shapes and sizes of bounding boxes. They are predetermined from the training data, and fed to the model, as a list of some constants, before training and prediction. For example, if we feed five anchor boxes to the network, it will be a list of 2x5 = 10 numbers, where each consecutive pair of numbers represents width and height of an anchor box. Actually, it does not improve the precision of prediction, but it can make training much faster by fixing shapes and sizes of possible bounding boxes.

Let’s say that we want to predict bounding boxes for humans with YOLOv2. Usually the shape of humans is a vertical rectangle, and square boxes or horizontal rectangles are not probable. So, by using bounding boxes in training and prediction, we can shift the training and prediction procedure from ‘what the shape of the box’ to ‘how large is the box compared to a anchor box’. This obviously makes training faster.

The other day, I used YOLOv2 to predict terrorists in CSGO, which is an online pc game, in real time. To make it practical, I had to run the prediction at over 30 or 40 fps. Though it seems to be difficult to accomplish that in Python, I encountered an object detection API called darkflow, and also a screen capture called python-mss that runs very fast.

darkflow is a python implementation of YOLOv2 and it is quite easy to use to train a model on a custom training data.
First, I collected training images and annotated them with .xml files.

You can see that terrorists fit into a vertical rectangle. So we can expect that anchor boxes are some different sizes of vertical rectangle.

To see that more deeply, I analyzed the whole training data and made a plot as follows:

Width and height of bounding boxes are converted to relative values about the image size, such that they lie between 0 and 13. You can see that the scatter plots(blue) roughly fit into a line with slope 4.0.

From this scatter plot, I read five anchor boxes (red points):

For example, a pair of (0.38, 1.8) makes one anchor box of relative width 0.38 and relative height 1.8. The number of anchor boxes can be varied depending on the training data. darkflow uses five anchor boxes as default.

--

--