I will revisit Driver's facial keypoint detection. In this blog, I will improve the landmark detection model performance with data augmentation. ImageDataGenerator for the purpose of landmark detection is implemented at my github account and discussed in my previous blog - Data augmentation for facial keypoint detection-.

In [1]:

## Import usual libraries
import matplotlib.pyplot as plt
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
import keras, sys, time, os, warnings, cv2
import numpy as np
import pandas as pd 
warnings.filterwarnings("ignore")



os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" 
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.gpu_options.visible_device_list = "4" 
set_session(tf.Session(config=config))   



print("python {}".format(sys.version))
print("keras version {}".format(keras.__version__)); del keras
print("tensorflow version {}".format(tf.__version__))

Using TensorFlow backend.

python 2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
keras version 2.1.3
tensorflow version 1.5.0

In [2]:

dir_data = "DrivFace/"

Read in the annotated data¶

This step is the same as previous analysis.

I will read in the drivePoints.txt into a panda dataframe that contains the facial keypoints and driver's infomation for each image. The data contain not only the facial keypoints but also the bounding box.

In [3]:

labels = open(dir_data + "drivPoints.txt").read()
labels = labels.split("\r\n")

lines = [line.split(",") for line in labels]
df_label = pd.DataFrame(lines[1:],columns=lines[0])
cols = list(set(df_label.columns) - set(["fileName"]))
df_label[cols] = df_label[cols].apply(pd.to_numeric, errors='coerce')
## remove the rows with NA
print(df_label.shape)
df_label = df_label.dropna()
print(df_label.shape)
df_label.head(3)

(607, 19)
(606, 19)

Out[3]:

	fileName	subject	imgNum	label	xF	yF	wF	hF	xRE	yRE	xLE	yLE	xN	yN	xRM	yRM	xLM	yLM
0	20130529_01_Driv_001_f	1.0	1.0	2.0	292.0	209.0	100.0	112.0	323.0	232.0	367.0	231.0	353.0	254.0	332.0	278.0	361.0	278.0
1	20130529_01_Driv_002_f	1.0	2.0	2.0	286.0	200.0	109.0	128.0	324.0	235.0	366.0	235.0	353.0	258.0	333.0	281.0	361.0	281.0
2	20130529_01_Driv_003_f	1.0	3.0	2.0	290.0	204.0	105.0	121.0	325.0	240.0	367.0	239.0	351.0	260.0	334.0	282.0	362.0	282.0

The annotated landmarks are:¶

Right Eye (RE)
Left Eye (LE)
Nose (N)
Right Mouth (RM)
Left Mouth (LM)

In the panda dataframe above, the (x,y) coordinates of these landmarks are recorded in the columns named as x"name of the landmark" and y"name of the landmark".

In [4]:

landmarks = ["RE","LE","N","RM","LM"]

Extract image data¶

The image data is extracted in the same order as the row of df_label.

In the following code, I create a list object imgs such that:

imgs[i] contains a numpy array of image corresponding to the df_label.iloc[i,:].

In [5]:

from keras.preprocessing.image import img_to_array, load_img

imgs = []
count = 0
for jpg in df_label["fileName"]:
    if count % 100 == 0:
        print(count)
    try:
        img = img_to_array(load_img(dir_data + "/DrivImages/" + jpg +".jpg"))
    except:
        img = []
        pass
    imgs.append(img)
    count += 1

assert len(imgs) == df_label.shape[0]

Bounding box of varying size is available¶

Our data luckily provides bounging box. But the width and height of the box vary across images but they are always more than 90. The plots below shows the histogram of the width and heights of the bounding box.

In our analysis, we assume that the bounding box is given as in previous analysis. Then the model performance was assessed on the landmark detection accuracy within the "down-sized" bounding box. Here, "down-sized" bounding box means that the original image is trimmed to have bounding box and then the bounded image is rescaled to have reduced size; width = 90 and height = 90.

So, during the training, we may trim the original image into bounding box, resize the bounded image to have size (90, 90), and then translate the resized bounded image into various augmented images using the ImageDataGenerator for landmark detection.

However,resizing the image before the image translation will reduce the number of potential augmented images that image translation can make, in comparisons to doing the image translation after resizing. Therefore, for training, I will first trim the image to have the same bounding box size (without resizing). As the bounding box provided from the data has varying width and height, I adjusted the width and height to have 150 by extending the box size (while keeping the center of the box to be the same as the original). Then the bounded (150,150) image is translated to various augmented images using ImageDataGenerator for landmark detection. The augmented images are finally translated to have size (90,90).

For testing image, we will use the "down-sized" bounding box having the size (90,90) so that the model performance is comparable to previous analysis.

In [6]:

fig = plt.figure(figsize=(15,7))

for count, label in enumerate(["wF","hF"],1):
    ax = fig.add_subplot(2,1,count)
    ml = int(np.max(df_label[label]))
    ax.hist(df_label[label])
    ax.set_title("bounding box in data: {},  Max={}".format(label,ml))
    ax.set_xlim([80,160])
plt.show()

Define the input image size for CNN.

In [7]:

target_shape = (90,90,3)

Prepare data in two ways:¶

training data
- trim using bounding box
- upsize the image to (150,150)
- this data will be passed to ImageDataGenerator and used as the original image for the data augmentation (after data augmentation, image will be resized to (90,90) and then the resized image is passed to our deep learning model).
evaluation data
- trim using bounding box
- downsize the image to (90,90)
- this data will be used for evaluating data

In [8]:

def get_bbox(img,row):
    '''
    row : df_label.iloc[i,:]
    use the bounding box defined in dataframe
    '''  
    faces  = (int(row["xF"]),
              int(row["yF"]),
              int(row["wF"]),
              int(row["hF"]))  
    return(faces)


def adjust_loc(rows,x_topleft=0,y_topleft=0):
    '''
    adjust the landmark coordinates with respect to bbox
    output:
    
        [(xRE,yRE),
         (xLE,yLE),
         (xN,yN),
         (xRM,yRM),
         (xLM,yLM)] 
     
     with respect to the bounding box 
    '''
    out = []
    for lm in landmarks:
        out.append((int(rows["x" + lm]) - x_topleft,
                    int(rows["y" + lm]) - y_topleft))
    return(out)

def adjust_xy(y,shape_orig,shape_new):
        '''
        y : [x1,y1,x2,y2,...]
        '''
        y[0::2] = y[0::2]*shape_new[1]/float(shape_orig[1])
        y[1::2] = y[1::2]*shape_new[0]/float(shape_orig[0])
        return y

def expand_bbox(faces, 
                adjw = 150, 
                adjh = 150):
    
    (x, y, w, h) = faces
    winc  = int(adjw - w)
    hinc  = int(adjh - h)
    x -= int(winc/2.0)
    y -= int(hinc/2.0) 

    return(x, y, adjw, adjh)
## increase the width and height of the bounding box by prop*100 %

bd_shape = (150, 150)
prop_bd = 0.3
imgs_bd, lms_bd = [], []
imgs_bd_test, lms_bd_test = [], []
count = 0
for i, img in enumerate(imgs):
    
    row = df_label.iloc[i,:]
    faces = get_bbox(img,row)
    
    (x, y, w, h) = faces
    ys = np.array(adjust_loc(row,x,y)).flatten()
    _img = img[y:(y+h),x:(x+w)]
    imgr = cv2.resize(_img,target_shape[:2])
    ys = adjust_xy(ys,_img.shape,imgr.shape)
    lms_bd_test.append(ys)
    imgs_bd_test.append(imgr)
    
    if row["subject"] != 4:
        (x, y, w, h) = expand_bbox(faces,*bd_shape)
        lms_bd.append(adjust_loc(row,x,y))
        imgs_bd.append(img[y:(y+h),x:(x+w)])
assert len(imgs_bd_test) == df_label.shape[0]
print(" {} training images".format(len(lms_bd)))
print(" {} evaluation images".format(len(lms_bd_test)))

 516 training images
 606 evaluation images

Training data¶

Let's look at the training images before the image translation.

In [9]:

for i in [100,200,400]:
    
    img, img_bd,lm_bd = imgs[i], imgs_bd[i], lms_bd[i]
    row = df_label.iloc[i,:]
    
    fig = plt.figure(figsize=(15,4))
    fig.subplots_adjust ( hspace = 0, wspace = 0 )
    
    ## ------------------- ##
    ## Original image 
    ## ------------------- ##
    ax = fig.add_subplot(1,3,1)
    ax.imshow(img/255.0)
    ax.set_title("original image")
    ax.axis("off")
    for (x,y) in adjust_loc(row):
        ax.scatter(x,y)
    
    ## ------------------- ##
    ## Original bbox
    ## ------------------- ##        
    ax = fig.add_subplot(1,3,2)
    (x, y, w, h) = get_bbox(img,row)
    ax.imshow(img[y:(y+h),x:(x+w)]/255.0)
    ax.set_title("original bounding box")
    for (x,y) in adjust_loc(row,x,y):
        ax.scatter(x,y)
    
    ## ------------------- ##
    ## Expanded bbox
    ## ------------------- ##       
    ax = fig.add_subplot(1,3,3)
    
    ax.imshow(img_bd/255.0)
    for (x,y) in lm_bd:
        ax.scatter(x,y)
    ax.set_title("Resized bounding box with shape = {}".format(bd_shape))
    plt.show()

Instantiate the ImageDataGenerator_landmarks object developed in Data augmentation for facial keypoint detection. The class defenition is available at Github. Just download the ImageDataGenerator_landmarks.py file into the current directory and import the module. I will consider wide zooming range.

In [10]:

from keras.preprocessing.image import ImageDataGenerator,  img_to_array, load_img
import ImageDataGenerator_landmarks as idg
reload(idg)

y_scale = np.min(target_shape[:2])
print("y_scale={}".format(y_scale))

## pre-processing function is applied AFTER image translation and rescaling to target_shape
def scaley(y):
    my = float(y_scale)/2.0
    y = (y - my) / float(y_scale)   
    return(y)
def preprocessing(x,y):
    x = x / 255.0
    y = scaley(y)   
    return(x,y)

datagen = ImageDataGenerator(rotation_range=0,
                             width_shift_range=0, 
                             height_shift_range=0,  
                             ## Float. Shear Intensity (Shear angle in counter-clockwise direction in degrees)
                             shear_range=0,
                             ## zoom_range: Float or [lower, upper]. 
                             ## Range for random zoom. If a float, 
                             ## [lower, upper] = [1-zoom_range, 1+zoom_range]
                             zoom_range=[0.7, 1.001], 
                             fill_mode='nearest', 
                             #cval=-2, 
                             horizontal_flip=True, 
                             vertical_flip=False)



generator = idg.ImageDataGenerator_landmarks(datagen,
                                         preprocessing_function=preprocessing,
                                         ignore_horizontal_flip=False,
                                         loc_xRE=0, 
                                         loc_xLE=2,
                                         target_shape=target_shape,
                                         flip_indicies =  ((0,2), # xRE <-> xLE
                                                          (1,3), # yRE <-> yLE
                                                          (6,8), # xRM <-> xLM
                                                          (7,9)) # yRM <-> yLM
                                        )

y_scale=90

For training images, create y_mask from the landmark's (x,y) coordinates¶

In [11]:

xy = []
for img_bd, lm_bd in zip(imgs_bd,lms_bd):
    xy.append(generator.get_ymask(img_bd,lm_bd))
xy_train = np.array(xy)
print("xy_train.shape={}".format(xy_train.shape))
assert xy_train.shape[0] == len(imgs_bd)

xy_train.shape=(516, 150, 150, 4)

Let's see some example translated images¶

In [12]:

def sp(ax,ys):
    my = y_scale/2.0 
    for x , y in zip(ys[0::2],ys[1::2]):
        ax.scatter(x*y_scale + my,
                   y*y_scale + my)

for xs, ys in generator.flow(xy_train,batch_size=600):
    break

print("**translated images**")
print("x.shape={}".format(xs.shape))
print("x: min={:4.3f}, max={:4.3f}".format(np.min(xs),np.max(xs)))
print("y.shape={}".format(ys.shape))
print("y: min={:4.3f}, max={:4.3f}".format(np.min(ys),np.max(ys)))

Nrow, Ncol, count = 5, 7, 1

fig = plt.figure(figsize=(15,10))
for irow in range(0,500,10):        
    ax = fig.add_subplot(Nrow,Ncol,count)
    ax.imshow(xs[irow])
    sp(ax,ys[irow])
    count += 1
    if count > Nrow*Ncol:
        break
plt.show()

**translated images**
x.shape=(1032, 90, 90, 3)
x: min=0.000, max=1.000
y.shape=(1032, 10)
y: min=-0.437, max=0.360

For testing images, we do not need make y_mask.¶

In [13]:

xx = np.array(imgs_bd_test)
yy = np.array(lms_bd_test)
print("xx.shape={}".format(xx.shape))
print("yy.shape={}".format(yy.shape))

xx.shape=(606, 90, 90, 3)
yy.shape=(606, 10)

In [14]:

xs, ys = xx/255.0, scaley(yy)

print("**translated images**")
print("x.shape={}".format(xs.shape))
print("x: min={:4.3f}, max={:4.3f}".format(np.min(xs),np.max(xs)))
print("y.shape={}".format(ys.shape))
print("y: min={:4.3f}, max={:4.3f}".format(np.min(ys),np.max(ys)))

Nrow, Ncol, count = 3, 7, 1


fig = plt.figure(figsize=(15,6))
for irow in range(0,600,20):        
    ax = fig.add_subplot(Nrow,Ncol,count)
    ax.imshow(xs[irow])
    sp(ax,ys[irow])
    count += 1
    if count > Nrow*Ncol:
        break
plt.show()

**translated images**
x.shape=(606, 90, 90, 3)
x: min=0.000, max=1.000
y.shape=(606, 10)
y: min=-0.367, max=0.356

Model¶

Here, I will use a vanilla CNN loosely based on a state of art model used in Facial Landmark Detection with Tweaked Convolutional Neural Networks. The difference to the paper is that our input dimension is (90,90,3) rather than (40,40,3).

In [15]:

from keras.layers import Conv2D, MaxPooling2D,Flatten, Dropout, Activation, Dense
from keras.models import Sequential


batch_size = 64
num_channels = 3
def StandardCNN(input_shape = (150, 150, 3)):
    '''
    WithDropout: If True, then dropout regularlization is added.
    This feature is experimented later.
    '''
    model = Sequential()

    # uses theano ordering. Note that we leave the image size as None to allow multiple image sizes
    model.add(Conv2D(16, kernel_size=(5,5), 
                     name="CL1",
                     input_shape=input_shape))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2),strides=(2,2)))
    
    model.add(Conv2D(48, kernel_size=(3, 3),name="CL2"))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2),strides=(2,2)))
    
    model.add(Conv2D(64, kernel_size=(3, 3),name="CL3"))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2),strides=(2,2)))
    
    model.add(Conv2D(64,kernel_size=(2, 2),name="CL4"))
    model.add(Activation('relu'))
    model.add(Flatten())
    
    
    model.add(Dense(100,name="FC5"))
    model.add(Activation('relu'))
    model.add(Dense(10,name="FC6"))
    model.compile(loss='mse', optimizer='adam')
    return(model)

model = StandardCNN(input_shape = target_shape)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
CL1 (Conv2D)                 (None, 86, 86, 16)        1216      
_________________________________________________________________
activation_1 (Activation)    (None, 86, 86, 16)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 43, 43, 16)        0         
_________________________________________________________________
CL2 (Conv2D)                 (None, 41, 41, 48)        6960      
_________________________________________________________________
activation_2 (Activation)    (None, 41, 41, 48)        0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 20, 20, 48)        0         
_________________________________________________________________
CL3 (Conv2D)                 (None, 18, 18, 64)        27712     
_________________________________________________________________
activation_3 (Activation)    (None, 18, 18, 64)        0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 9, 9, 64)          0         
_________________________________________________________________
CL4 (Conv2D)                 (None, 8, 8, 64)          16448     
_________________________________________________________________
activation_4 (Activation)    (None, 8, 8, 64)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 4096)              0         
_________________________________________________________________
FC5 (Dense)                  (None, 100)               409700    
_________________________________________________________________
activation_5 (Activation)    (None, 100)               0         
_________________________________________________________________
FC6 (Dense)                  (None, 10)                1010      
=================================================================
Total params: 463,046
Trainable params: 463,046
Non-trainable params: 0
_________________________________________________________________

Training starts here¶

In [16]:

batch_size = xy_train.shape[0]
Nepochs = 200
iepoch = 1
hists = []
for xs, ys in generator.flow(xy_train,batch_size=batch_size):
        hist = model.fit(xs,ys,epochs=1,verbose=False)
        h = hist.history["loss"][0]
        hists.append(h)
        if iepoch % 10 == 0:
            print("Epoch {:03.0f} - {:8.7f}".format(iepoch,h))
        if iepoch > Nepochs:
            break
        iepoch += 1

Epoch 010 - 0.0006441
Epoch 020 - 0.0003549
Epoch 030 - 0.0002512
Epoch 040 - 0.0002695
Epoch 050 - 0.0002240
Epoch 060 - 0.0002051
Epoch 070 - 0.0001912
Epoch 080 - 0.0001768
Epoch 090 - 0.0001759
Epoch 100 - 0.0001440
Epoch 110 - 0.0001457
Epoch 120 - 0.0001474
Epoch 130 - 0.0001468
Epoch 140 - 0.0001414
Epoch 150 - 0.0001460
Epoch 160 - 0.0001453
Epoch 170 - 0.0001319
Epoch 180 - 0.0001264
Epoch 190 - 0.0001247
Epoch 200 - 0.0001325

Plot of loss over epochs¶

In [17]:

plt.plot(hists)
plt.xlabel("loss")
plt.show()

Model performance on testing data¶

In [18]:

pick = (df_label["subject"]==4).values

## ============= ##
## training data
## ============= ##
x_tr, y_tr = xx[~pick], yy[~pick]
y_pred0 = model.predict(x_tr/255.0)
print("Training MSE={:7.6f}".format(np.mean((y_pred0 - scaley(y_tr))**2)))

## ============= ##
## testing data
## ============= ##
x_test, y_test = xx[pick], yy[pick]
y_pred0 = model.predict(x_test/255.0)
print("Testing MSE={:7.6f}".format(np.mean((y_pred0 - scaley(y_test))**2)))
y_pred = y_pred0*y_scale + (y_scale/2.0)

assert np.all(y_pred <= np.max(target_shape))
assert np.all(y_pred >=0)

Training MSE=0.000241
Testing MSE=0.001665

In the previous post, Driver's facial keypoint detection, the model performance was assessed on the 4th driver using the normalized mean Euclidean distances between the true facial keypoint and the estimated one within bounding box.

In [19]:

df_label["IPD"] = np.sqrt((df_label["xRE"] - df_label["xLE"])**2 + (df_label["yRE"] - df_label["yLE"])**2)
for ii, facialkp in enumerate(landmarks):
    
    i = ii*2
    ## Model prediction
    df_label["Model - data augmentation_" + facialkp] = np.NaN
    xterm = (y_pred[:,i]   - y_test[:,i])**2
    yterm = (y_pred[:,i+1] - y_test[:,i+1])**2
    ## save it for the test subjects
    df_label["Model - data augmentation_" + facialkp].loc[df_label["subject"]==4] = np.sqrt(xterm + yterm)

Model performance on testing data summary¶

Remind you that without data augmentation, the model performance in previous analysis was:

Landmark	Median normalized ED	% (normalized ED < 10 percent)
LE	4.184403	95.555556
RE	4.128279	97.777778
N	5.725722	87.777778
RM	3.794306	95.555556
LM	3.349447	100.000000

Clearly, the model performance improved by using data augmentation!

In [ ]:

def proplessthan(vec):
    values = [np.median(vec["value"]), 
              np.mean(vec["value"] < 10)*100]
    return(pd.Series(values,index=["Median normalized ED",
                                   "% (normalized ED < 10%)"] ))

collabels = []
for nm in landmarks:
    collabels.extend(["Model - data augmentation_"+ nm])
df_eval = df_label[collabels] 


df_eval = df_eval.dropna() ## NA is recorded for the training image 
## un-pivot so that there is a a single column containing box plot values
## this un-pivot is necessary for seaborn boxplot
df_boxplot = pd.melt(df_eval,  value_vars=collabels)
v = np.array([ term.split("_") for term in df_boxplot["variable"]])
df_boxplot["procedure"] = v[:,0]
df_boxplot["keypoints"] = v[:,1]


df_boxplot_summary  = df_boxplot.groupby(["keypoints","procedure"]).apply(proplessthan).reset_index()
df_boxplot_summary

Out[ ]:

	keypoints	procedure	Median normalized ED	% (normalized ED < 10%)
0	LE	Model - data augmentation	5.751284	95.555556
1	LM	Model - data augmentation	2.634319	100.000000
2	N	Model - data augmentation	5.030684	98.888889
3	RE	Model - data augmentation	5.402357	98.888889
4	RM	Model - data augmentation	4.277941	100.000000

Visualization of the model performance¶