This is the third blog post of Object Detection with YOLO blog series. This blog discusses the YOLO's model architecture. I will use PASCAL VOC2012 data. This blog assumes that the readers have read the previous two blog posts - Part 1, Part 2.
Andrew Ng's YOLO lecture¶
- Neural Networks - Bounding Box Predictions
- C4W3L06 Intersection Over Union
- C4W3L07 Nonmax Suppression
- C4W3L08 Anchor Boxes
- C4W3L09 YOLO Algorithm
Reference¶
Reference in my blog¶
- Part 1 Object Detection using YOLOv2 on Pascal VOC2012 - anchor box clustering
- Part 2 Object Detection using YOLOv2 on Pascal VOC2012 - input and output encoding
- Part 3 Object Detection using YOLOv2 on Pascal VOC2012 - model
- Part 4 Object Detection using YOLOv2 on Pascal VOC2012 - loss
- Part 5 Object Detection using YOLOv2 on Pascal VOC2012 - training
- Part 6 Object Detection using YOLOv2 on Pascal VOC 2012 data - inference on image
- Part 7 Object Detection using YOLOv2 on Pascal VOC 2012 data - inference on video
My GitHub repository¶
This repository contains all the ipython notebooks in this blog series and the funcitons (See backend.py).
import matplotlib.pyplot as plt
import numpy as np
import os
import sys
print(sys.version)
%matplotlib inline
Define anchor box¶
ANCHORS
defines the number of anchor boxes and the shape of each anchor box.
The choice of the anchor box specialization is already discussed in Part 1 Object Detection using YOLOv2 on Pascal VOC2012 - anchor box clustering.
Based on the K-means analysis in the previous blog post, I will select 4 anchor boxes of following width and height. The width and heights are rescaled in the grid cell scale (Assuming that the number of grid size is 13 by 13.) See Part 2 Object Detection using YOLOv2 on Pascal VOC2012 - input and output encoding to learn how I rescal the anchor box shapes into the grid cell scale.
Here I choose 4 anchor boxes. With 13 by 13 grids, every frame gets 4 x 13 x 13 = 676 bouding box predictions.
ANCHORS = np.array([1.07709888, 1.78171903, # anchor box 1, width , height
2.71054693, 5.12469308, # anchor box 2, width, height
10.47181473, 10.09646365, # anchor box 3, width, height
5.48531347, 8.11011331]) # anchor box 4, width, height
Define Label vector containing 20 object classe names.¶
LABELS = ['aeroplane', 'bicycle', 'bird', 'boat', 'bottle',
'bus', 'car', 'cat', 'chair', 'cow',
'diningtable','dog', 'horse', 'motorbike', 'person',
'pottedplant','sheep', 'sofa', 'train', 'tvmonitor']
YOLOv2 Model Architecture¶
While YOLO's input and output encodings are complex, and loss function of YOLO is quite complex (which will be discussed very soon), the model architecture is simple. It repeatedly stacks Convolusion + Batch Normalization + Leaky Relu layers until the image shape reduces to the grid cell size. Here is the model defenition, extracted from experiencor/keras-yolo2.
from keras.models import Sequential, Model
from keras.layers import Reshape, Activation, Conv2D, Input, MaxPooling2D, BatchNormalization, Flatten, Dense, Lambda
from keras.layers.advanced_activations import LeakyReLU
from keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from keras.optimizers import SGD, Adam, RMSprop
from keras.layers.merge import concatenate
import keras.backend as K
import tensorflow as tf
# the function to implement the orgnization layer (thanks to github.com/allanzelener/YAD2K)
def space_to_depth_x2(x):
return tf.space_to_depth(x, block_size=2)
def ConvBatchLReLu(x,filters,kernel_size,index,trainable):
# when strides = None, strides = pool_size.
x = Conv2D(filters, kernel_size, strides=(1,1),
padding='same', name='conv_{}'.format(index),
use_bias=False, trainable=trainable)(x)
x = BatchNormalization(name='norm_{}'.format(index), trainable=trainable)(x)
x = LeakyReLU(alpha=0.1)(x)
return(x)
def ConvBatchLReLu_loop(x,index,convstack,trainable):
for para in convstack:
x = ConvBatchLReLu(x,para["filters"],para["kernel_size"],index,trainable)
index += 1
return(x)
def define_YOLOv2(IMAGE_H,IMAGE_W,GRID_H,GRID_W,TRUE_BOX_BUFFER,BOX,CLASS, trainable=False):
convstack3to5 = [{"filters":128, "kernel_size":(3,3)}, # 3
{"filters":64, "kernel_size":(1,1)}, # 4
{"filters":128, "kernel_size":(3,3)}] # 5
convstack6to8 = [{"filters":256, "kernel_size":(3,3)}, # 6
{"filters":128, "kernel_size":(1,1)}, # 7
{"filters":256, "kernel_size":(3,3)}] # 8
convstack9to13 = [{"filters":512, "kernel_size":(3,3)}, # 9
{"filters":256, "kernel_size":(1,1)}, # 10
{"filters":512, "kernel_size":(3,3)}, # 11
{"filters":256, "kernel_size":(1,1)}, # 12
{"filters":512, "kernel_size":(3,3)}] # 13
convstack14to20 = [{"filters":1024, "kernel_size":(3,3)}, # 14
{"filters":512, "kernel_size":(1,1)}, # 15
{"filters":1024, "kernel_size":(3,3)}, # 16
{"filters":512, "kernel_size":(1,1)}, # 17
{"filters":1024, "kernel_size":(3,3)}, # 18
{"filters":1024, "kernel_size":(3,3)}, # 19
{"filters":1024, "kernel_size":(3,3)}] # 20
input_image = Input(shape=(IMAGE_H, IMAGE_W, 3),name="input_image")
true_boxes = Input(shape=(1, 1, 1, TRUE_BOX_BUFFER , 4),name="input_hack")
# Layer 1
x = ConvBatchLReLu(input_image,filters=32,kernel_size=(3,3),index=1,trainable=trainable)
x = MaxPooling2D(pool_size=(2, 2),name="maxpool1_416to208")(x)
# Layer 2
x = ConvBatchLReLu(x,filters=64,kernel_size=(3,3),index=2,trainable=trainable)
x = MaxPooling2D(pool_size=(2, 2),name="maxpool1_208to104")(x)
# Layer 3 - 5
x = ConvBatchLReLu_loop(x,3,convstack3to5,trainable)
x = MaxPooling2D(pool_size=(2, 2),name="maxpool1_104to52")(x)
# Layer 6 - 8
x = ConvBatchLReLu_loop(x,6,convstack6to8,trainable)
x = MaxPooling2D(pool_size=(2, 2),name="maxpool1_52to26")(x)
# Layer 9 - 13
x = ConvBatchLReLu_loop(x,9,convstack9to13,trainable)
skip_connection = x
x = MaxPooling2D(pool_size=(2, 2),name="maxpool1_26to13")(x)
# Layer 14 - 20
x = ConvBatchLReLu_loop(x,14,convstack14to20,trainable)
# Layer 21
skip_connection = ConvBatchLReLu(skip_connection,filters=64,
kernel_size=(1,1),index=21,trainable=trainable)
skip_connection = Lambda(space_to_depth_x2)(skip_connection)
x = concatenate([skip_connection, x])
# Layer 22
x = ConvBatchLReLu(x,filters=1024,kernel_size=(3,3),index=22,trainable=trainable)
# Layer 23
x = Conv2D(BOX * (4 + 1 + CLASS), (1,1), strides=(1,1), padding='same', name='conv_23')(x)
output = Reshape((GRID_H, GRID_W, BOX, 4 + 1 + CLASS),name="final_output")(x)
# small hack to allow true_boxes to be registered when Keras build the model
# for more information: https://github.com/fchollet/keras/issues/2790
output = Lambda(lambda args: args[0],name="hack_layer")([output, true_boxes])
model = Model([input_image, true_boxes], output)
return(model, true_boxes)
IMAGE_H, IMAGE_W = 416, 416
GRID_H, GRID_W = 13 , 13
TRUE_BOX_BUFFER = 50
BOX = int(len(ANCHORS)/2)
CLASS = len(LABELS)
## true_boxes is the tensor that takes "b_batch"
model, true_boxes = define_YOLOv2(IMAGE_H,IMAGE_W,GRID_H,GRID_W,TRUE_BOX_BUFFER,BOX,CLASS,
trainable=False)
model.summary()
Load pre-trained YOLOv2 weights¶
Following the instruction at YOLO: Real-Time Object Detection, we download the pre-trained weights using wget:
wget https://pjreddie.com/media/files/yolov2.weights
The weights are saved at:
path_to_weight = "./yolov2.weights"
The following codes are extracted from keras-yolo2/Yolo Step-by-Step.ipynb
class WeightReader:
# code from https://github.com/experiencor/keras-yolo2/blob/master/Yolo%20Step-by-Step.ipynb
def __init__(self, weight_file):
self.offset = 4
self.all_weights = np.fromfile(weight_file, dtype='float32')
def read_bytes(self, size):
self.offset = self.offset + size
return self.all_weights[self.offset-size:self.offset]
def reset(self):
self.offset = 4
weight_reader = WeightReader(path_to_weight)
print("all_weights.shape = {}".format(weight_reader.all_weights.shape))
Assign pre-trained weights to the following layers:
conv_i
, norm_i
, i = 1, 2,..., 22
.
These layers do not depend on the number of object classes or the number of anchor boxes.
def set_pretrained_weight(model,nb_conv, path_to_weight):
weight_reader = WeightReader(path_to_weight)
weight_reader.reset()
for i in range(1, nb_conv+1):
conv_layer = model.get_layer('conv_' + str(i)) ## convolusional layer
if i < nb_conv:
norm_layer = model.get_layer('norm_' + str(i)) ## batch normalization layer
size = np.prod(norm_layer.get_weights()[0].shape)
beta = weight_reader.read_bytes(size)
gamma = weight_reader.read_bytes(size)
mean = weight_reader.read_bytes(size)
var = weight_reader.read_bytes(size)
weights = norm_layer.set_weights([gamma, beta, mean, var])
if len(conv_layer.get_weights()) > 1: ## with bias
bias = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[1].shape))
kernel = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[0].shape))
kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
kernel = kernel.transpose([2,3,1,0])
conv_layer.set_weights([kernel, bias])
else: ## without bias
kernel = weight_reader.read_bytes(np.prod(conv_layer.get_weights()[0].shape))
kernel = kernel.reshape(list(reversed(conv_layer.get_weights()[0].shape)))
kernel = kernel.transpose([2,3,1,0])
conv_layer.set_weights([kernel])
return(model)
nb_conv = 22
model = set_pretrained_weight(model,nb_conv, path_to_weight)
Initialize the 23rd layer¶
def initialize_weight(layer,sd):
weights = layer.get_weights()
new_kernel = np.random.normal(size=weights[0].shape, scale=sd)
new_bias = np.random.normal(size=weights[1].shape, scale=sd)
layer.set_weights([new_kernel, new_bias])
layer = model.layers[-4] # the last convolutional layer
initialize_weight(layer,sd=GRID_H*GRID_W)
So that's how we define the YOLOv2 model! The next blog will discuss the loss function of this model which will be used to train the parameters.
FairyOnIce/ObjectDetectionYolo contains this ipython notebook and all the functions that I defined in this notebook.