Yumi's Blog

Data augmentation for facial keypoint detection

gif

The python class ImageDataGenerator_landmarks is available at my github account. This blog explains about his class.

Why data augmentation?

Deep learning model is data greedy and the performance of the model may be surprisingly bad when testing images vary from training images a lot. Data augmentation is an essential technique to utilize limited amount of training images. In my previous blog post, I have seen poor performance of a deep learning model when testing images contain the translation of the training images. However, the model performance improves when training data also contains translated images. See Assess the robustness of CapsNet. This experiment shows that it is essential to increase the data size using data augmentation to develop a robust deep learning model.

Keras's ImageDataGenerator and its limit

Data augmentation could increase the number of training images substantially which could raise a storage problem. Keras has a powerful API called ImageDataGenerator that resolve this problem. The generator can generate augmented images from the training images on the fly.

This generator has been used in many of my previous blog posts, for example:

Despite that it is a powerful and popular API, this API is limited to the image classification problem where the target does not depend on the translation of images. For example, the image of a dog is still an image of a dog even if the image is shifted by 3 pixels. So the target label "dog" does not need to be translated.

In landmark detection or facial keypoint detections, the target values also needs to change when an image is translated. That means that if the image of a face is shifted by 3 pixels, the (x,y) coordinates of the eye location also needs to be shifted.

I was looking for some existing API that can translate both images and coordinates. However, I couldn't. In my previous blog post Achieving Top 23% in Kaggle's Facial Keypoints Detection with Keras + Tensorflow, I implemented a python class that can flip the image horizontally and shift the image both along horizontal and vertical axes while adjusting the landmark coordinates. But there are so many other translations that I want to do; e.g., shearing, zooming, or all of them at once! And I do not want to code rotation matrix by myself!

Keras's ImageDataGenerator for facial keypoint detection problem.

I came up with a rather simple approach that takes full advantage of Keras's ImageDataGenerator. Although this is probably not the most optimized approach, it is very simple and the method allows us to use all Keras's ImageDataGenerator functionalities for landmark detection problem.

Simple idea

The idea is simple: I will create a mask having the same size as the image. The pixels of the mask corresponding to a landmark is indexed. The original image is augmented with this mask as the 4th channel (assuming that the image has 3 channels). Then we will pass this 4-channel image to Keras's ImagedataGenerator and find where the indexed landmark will be after image translation.

The code below shows how I implemented this approach.

In [1]:
## Import usual libraries
import matplotlib.pyplot as plt
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
import keras, sys, time, os, warnings 
import numpy as np
import pandas as pd
import cv2
warnings.filterwarnings("ignore")



os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" 
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.025
config.gpu_options.visible_device_list = "4" 
set_session(tf.Session(config=config))   



print("python {}".format(sys.version))
print("keras version {}".format(keras.__version__)); del keras
print("tensorflow version {}".format(tf.__version__))
Using TensorFlow backend.
python 2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
keras version 2.1.3
tensorflow version 1.5.0

Extract a single image with a landmark and bounding box

This image is extracted from CVC11, and I previously analyzed images from this data. See Driver's facial keypoint detection. I will only extract a single image from this data to demonstrate my data augmentation routine.

In [2]:
from keras.preprocessing.image import img_to_array, load_img
dir_data = "DrivFace/"

## For this data, we have annotation  right eye, left eye, nose, right mouth and left mouth
landmarks = ["RE","LE","N","RM","LM"]


img = img_to_array(load_img(dir_data + "/DrivImages/20130529_01_Driv_001_f .jpg"))
row_name = ["xF",  "yF", "wF", "hF", 
            "xRE", "yRE","xLE","yLE","xN","yN","xRM","yRM","xLM","yLM"]
row = [272,  # xF: (xF, yF) : top left corner of the bounding box  292
       189,  # yF:                                                 209
       140,  # wF: width of the bounding box                       100
       152,  # hF: height of the bounding box                      112
       323,  # xRE: (xRE,yRE)  the (x,y) coordinate of right eye 
       232,  # yRE:
       367,  # xLE: (xLE, yLE) the (x,y) coordinate of left eye 
       231,  # yLE:
       353,  # xN : (xN,  yN)  the (x,y) coordinate of the nose 
       254,  # yN :
       332,  # xRM: (xRM, yRM) the (x,y) coordinate of the right mouth tip
       278,  # yRM:
       361,  # xLM: (xLM, yLM) the (x,y) coordinate of the left mouth tip
       278]  # yLM: 


row = pd.DataFrame(row).T
row.columns = row_name
row
Out[2]:
xF yF wF hF xRE yRE xLE yLE xN yN xRM yRM xLM yLM
0 272 189 140 152 323 232 367 231 353 254 332 278 361 278

Let's look at the original image

Driver image!

In [3]:
fig = plt.figure(figsize=(5,5))
## original image
ax = fig.add_subplot(1,1,1)
ax.imshow(img/255.0)
for landmark in landmarks:
    ax.scatter(row["x"+landmark],row["y"+landmark])
plt.show()

Create a function to extract a bounding box

In practice, deep learning model often does not take an original wide-view image as an input of landmark detection model. Instead, an image is usually trimmed to reduced the space that focuses to the face by using face detection algorithm. This data already provide a bounding box so we will use this bounding box to reduce the image size.

I will demonstrate my data augmentation routine using the trimmed image within bounding box.

In [4]:
def get_bbox(row):
    '''
    extract bounding box from the dataframe
    '''  
    faces  = (int(row["xF"]),
              int(row["yF"]),
              int(row["wF"]),
              int(row["hF"]))  
    return(faces)

(x, y, w, h) = get_bbox(row)

As the recorded (x,y) coordinates of landmarks are with respect to original image, I adjust the landmark coordinates to the bounding box.

In [5]:
def adjust_loc(rows,x_topleft=0,y_topleft=0):
    '''
    adjust the landmark coordinates with respect to bbox output:
    '''
    
    out = []
    for lm in landmarks:
        out.append((int(rows["x" + lm]) - x_topleft,
                    int(rows["y" + lm]) - y_topleft))
    return(out)

landmark_bd = adjust_loc(row,x_topleft=x,y_topleft=y)
landmark_bd
Out[5]:
[(51, 43), (95, 42), (81, 65), (60, 89), (89, 89)]

The image in bounding box.

I will use this single image to demonstrate my data augmentation routine.

In [6]:
## image in bounding box
img_bd = img[y:(y+w),x:(x+w)]

fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(1,1,1)
ax.imshow(img_bd/255.0)
for (x,y) in landmark_bd:
    ax.scatter(x,y)

plt.show()

Data augmentation

Step 1: create a mask that record the location of landmark

An image is augmented with a mask as the 4th channel.

In [7]:
def get_ymask(img, xys):
    '''
    img : (N width, N height, N channel) array of image
    xys : A list containint tuple of (x,y) coordinate od landmark. For example:
    
    xys = [(x1,y1),
           (x2,y2),
           (x3,y3),
           (x4,y4),
           ...] 
    '''
    yimg = np.zeros((img.shape[0],img.shape[1],1))
    yimg[:] = -1
    for iland, (ix,iy) in enumerate(xys):
        yimg[iy,ix] = iland
    return(np.dstack([img,yimg]))

yimg = get_ymask(img_bd,landmark_bd)
print("The dimension of the original image {} -> masked image {}".format(img_bd.shape,yimg.shape))

plt.figure(figsize=(6,6))
plt.imshow(yimg[:,:,3])
plt.title("The mask receives non-negative values at landmarks")
plt.show()
The dimension of the original image (140, 140, 3) -> masked image (140, 140, 4)

Step 2: define Keras's ImageDataGenerator with the parameter of your choice.

In [8]:
from keras.preprocessing.image import ImageDataGenerator,  img_to_array, load_img
datagen = ImageDataGenerator(rotation_range=20,
                             width_shift_range=10.0, 
                             height_shift_range=10.0,  
                             ## Float. Shear Intensity (Shear angle in counter-clockwise direction in degrees)
                             shear_range=5.0,
                             ## zoom_range: Float or [lower, upper]. 
                             ## Range for random zoom. If a float, 
                             ## [lower, upper] = [1-zoom_range, 1+zoom_range]
                             zoom_range=[0.6, 1.2], 
                             fill_mode='nearest', 
                             #cval=-2, 
                             horizontal_flip=True, 
                             vertical_flip=False)

Step 3: Define a class ImageDataGenerator_landmarks

  • The class assume that get_ymask is used before the flow method.
  • After translation of image, you can resize the image via target_shape parameter.
    • Translation with original resolution and then down size resolution gives more sample than translating the down sized images.
  • The class defenition of ImageDataGenerator_landmarks is available from my Github account
In [9]:
class ImageDataGenerator_landmarks(object):
    def __init__(self,
                 datagen,
                 preprocessing_function= lambda x,y: (x,y),
                 loc_xRE=None, 
                 loc_xLE=None,
                 flip_indicies=None,
                 target_shape=None,
                 ignore_horizontal_flip=True):
        '''
        datagen : Keras's ImageDataGenerator
        preprocessing_function : The function that will be implied on each input. 
                                 The function will run after the image is resized and augmented. 
                                 The function should take one argument: one image (Numpy tensor with rank 3), 
                                 and should output a Numpy 
        ignore_horizontal_flip : if False, whether the horizontal flip happend is checked 
                                 using  and 
                                 if the flipping happens, 
                                 each pair of the  are flipped.
                                 if True, then , 
                                  and  do not need to be specified.
                                
        target_shape            : If target_shape is not None,
                                  A translated image is resized to target_shape. 
                                  Why? Translation with original resolution and then down size resolution 
                                       gives wider range of modified images than translating the down sized images.
    
        For example,
        
        Suppose the landmarks are 
        
        - right eye (RE) 
        - left eye (LE)
        - mouth (M)
        - right mouth edge (RM)
        - left mouth edge (LM)
        
        then there are 5 x 2 coordinates to predict:
        
        xRE, yRE, xLE, yLE, xN, yN, xRM, yRM, xLM, yLM
        
        When the horizontal flip happens, RE becomes LE and RM becomes LM.
        So we need to change the target variables accordingly.
        
        If the horizontal flip happenes  xRE > xLE
        so loc_xRE = 0 , loc_yRE = 2
        
        In this case, our filp indicies are:
        
        self.flip_indicies =  ((0,2), # xRE <-> xLE
                               (1,3), # yRE <-> yLE
                               (6,8), # xRM <-> xLM
                               (7,9)) # yRM <-> yLM

        '''
        self.datagen = datagen
        self.ignore_horizontal_flip = ignore_horizontal_flip
        self.target_shape = target_shape
        # check if x-cord of landmark1 is less than x-cord of landmark2
        self.loc_xRE, self.loc_xLE = loc_xRE, loc_xLE
        
        self.flip_indicies = flip_indicies
        ## the chanel that records the mask
        self.loc_mask = 3

        self.preprocessing_function = preprocessing_function
        
    def flow(self,imgs,batch_size=20):
        '''
        imgs: the numpy image array : (batch, height, width, image channels + 1)
              the channel (self.loc_mask)th channel must contain mask
        '''
        
        generator = self.datagen.flow(imgs,batch_size=batch_size)
        while 1:
            ## 
            N = 0
            x_bs, y_bs = [], [] 
            while N < batch_size:
                yimgs = generator.next() 
                ## yimgs.shape = (bsize,width,height,channels + 1)
                ## where bsize = np.min(batch_size,x.shape[0])
                x_batch ,y_batch = self._keep_only_valid_image(yimgs)
                if len(x_batch) == 0:
                    continue
                x_batch ,y_batch = self.preprocessing_function(x_batch,y_batch)
                x_bs.append(x_batch)
                y_bs.append(y_batch)
                N += x_batch.shape[0]
            x_batch , y_batch = np.vstack(x_bs), np.vstack(y_bs)
            yield ([x_batch, y_batch])


    def _keep_only_valid_image(self,yimg):
        '''
        Transform the mask to (x,y)-coordiantes.
        Depending on the translation, landmark may "dissapeear".
        For example, if the image is escessively zoomed in, 
        the mask may lose the index of landmark.
        Such image translation is discarded.
        
        x_train and y_train could be an empty array 
        if landmarks of all the translated images are lost i.e.
        np.array([])
        '''
        x_train, y_train = [], []
        for irow in range(yimg.shape[0]):
            x     = yimg[irow,:,:,:self.loc_mask]
            ymask = yimg[irow,:,:,self.loc_mask]
            y     = self._findindex_from_mask(ymask)
            # if some landmarks dissapears, do not use the translated image 
            if y is None:
                continue
            x, y  = self._resize_image(x, y)    
            x_train.append(x)
            y_train.append(y)
        x_train = np.array(x_train)
        y_train = np.array(y_train)
        return(x_train,y_train)
    
    def _resize_image(self,x,y):
        '''
        this function is useful for down scaling the resolution
        '''
        if self.target_shape is not None:
            shape_orig = x.shape
            x = cv2.resize(x,self.target_shape[:2])
            
            y = self.adjust_xy(y,
                               shape_orig,
                               self.target_shape)
        return x,y
    def adjust_xy(self,y,shape_orig,shape_new):
        '''
        y : [x1,y1,x2,y2,...]
        '''
        y[0::2] = y[0::2]*shape_new[1]/float(shape_orig[1])
        y[1::2] = y[1::2]*shape_new[0]/float(shape_orig[0])
        return y

    def _findindex_from_mask(self,ymask):
        '''
        ymask : a mask of shape (height, width, 1)
        '''
        
        ys = []
        for i in range(self.Nlandmarks):
            ix, iy = np.where(ymask==i)
            if len(ix) == 0:
                return(None)
            ys.extend([np.mean(iy),
                       np.mean(ix)])
        ys = np.array(ys)
        ys = self._adjustLR_horizontal_flip(ys)
        return(ys)

    def _adjustLR_horizontal_flip(self,ys):
        '''
        if a horizontal flip happens, 
        right eye becomes left eye and 
        right mouth edge becomes left mouth edge
        So we need to flip the target cordinates accordingly
        '''
        if self.ignore_horizontal_flip:
            return(ys)
        
        if ys[self.loc_xRE] > ys[self.loc_xLE]: ## True if flip happens
            # x-cord of RE is less than x-coord of left eye
            # horizontal flip happened!
            for a, b in self.flip_indicies:
                ys[a],ys[b] = (ys[b],ys[a])
        return(ys)

    def get_ymask(self,img, xys):
        '''
        img : (height, width, channels) array of image
        xys : A list containint tuple of (x,y) coordinates of landmark. For example:

        xys = [(x0,y0),
               (x1,y1),
               (x2,y2),
               (x3,y3),
               (x4,y4),
               ...] 
        output:
        
        mask : A numpy array of size (height, width, channels). 
               All locations without the landmarks are recorded -1 
               A coordinate with (x0, y0) is recorded as 0
               A coordinate with (x1, y1) is recorded as 1
               ...
        
        '''
        yimg = np.zeros((img.shape[0],img.shape[1],1))
        yimg[:] = -1
        for iland, (ix,iy) in enumerate(xys):
            yimg[iy,ix] = iland
        
        self.Nlandmarks = len(xys)
        self.loc_mask   = img.shape[2] 
        return(np.dstack([img,yimg]))

Instantiate the class

To instantiate the class you need to provide

  • datagen : Keras's ImageDataGenerator
  • ignore_horizontal_flip : if False, whether the horizontal flip happened is checked, using and if the horizontal flipping happens, each pair of the are flipped. if True, then , and do not need to be specified
  • loc_xRE: the position where x coordinate of right eye is stored in target variable.
  • loc_xLE: the position where y coordinate of right eye is stored in target variable.
  • flip_indicies: the positions where x, y coordinates aer

What are they?

Consider our scenario:

Our landmarks are:

    - right eye (RE) 
    - left eye (LE)
    - mouth (M)
    - right mouth edge (RM)
    - left mouth edge (LM)

then there are 5 x 2 coordinates to predict:

    xRE, yRE, xLE, yLE, xN, yN, xRM, yRM, xLM, yLM

When the horizontal flip happens, RE becomes LE and RM becomes LM. So we need to change the target variables accordingly.

If the horizontal flip happenes, we see xRE > xLE (right eye is on the right of left eye!)

In this case, we must flip the role of RE and LE as well as RM and LM. So our filp indicies are:

    self.flip_indicies =  ((0,2), # xRE <-> xLE
                           (1,3), # yRE <-> yLE
                           (6,8), # xRM <-> xLM
                           (7,9)) # yRM <-> yLM
In [10]:
generator = ImageDataGenerator_landmarks(datagen,
                                         ignore_horizontal_flip=False,
                                         target_shape=(90,90,3),
                                         loc_xRE=0, 
                                         loc_xLE=2,
                                         flip_indicies =  ((0,2), # xRE <-> xLE
                                                          (1,3), # yRE <-> yLE
                                                          (6,8), # xRM <-> xLM
                                                          (7,9)) # yRM <-> yLM
                                        )
xy = np.array([generator.get_ymask(img_bd,landmark_bd)])

Let's visualize!

Notice that RE is always at the left of LE.

In [11]:
plt.close('all')
def singleplot(ax,x,y):
    ax.imshow(x/255.0)
    colors = ['b','g','r','c','m']
    for i, marker,c in zip(range(0,len(y),2),
                           landmarks,
                           colors):        
        ax.annotate(marker,
                    (y[i],y[i+1]),
                    color=c)
        
def pannelplot(figID=0,dir_image=None, nrow_plot = 6,ncol_plot = 6, fignm="fig",save=True):        
    
    fig = plt.figure(figsize=(15,15))
    #fig.subplots_adjust(hspace=0,wspace=0)
    xs, ys = [], [] 
    count = 1
    for x_train,y_train in generator.flow(xy,batch_size=1):
        if len(x_train) == 1:
            ax = fig.add_subplot(nrow_plot,ncol_plot,count)
            #ax.axis("off")
            singleplot(ax,x_train[0],y_train[0])
            if count == nrow_plot * ncol_plot:
                break
            count += 1
    if save:
         plt.savefig(dir_image + "/fig{:04.0f}.png".format(figID),
                     bbox_inches='tight',pad_inches=0)   
    else:
        plt.show()

pannelplot(save=False)

Make a gif

In [12]:
def create_gif(gifname,dir_image,duration=1):
    import imageio
    filenames = np.sort(os.listdir(dir_image))
    filenames = [ fnm for fnm in filenames if ".png" in fnm]

    with imageio.get_writer(dir_image + '/' + gifname + '.gif', 
                            mode='I',duration=duration) as writer:
        for filename in filenames:
            image = imageio.imread(dir_image + filename)
            writer.append_data(image)

dir_image = 'data_augmentation/'
plt.close('all')

for count in range(100):
    pannelplot(count,dir_image, nrow_plot = 6,ncol_plot = 6, fignm="fig",save=True)            
create_gif("example",dir_image,duration=0.5)

plt.close('all')

Comments