The python class ImageDataGenerator_landmarks is available at my github account. This blog explains about his class.
Why data augmentation?¶
Deep learning model is data greedy and the performance of the model may be surprisingly bad when testing images vary from training images a lot. Data augmentation is an essential technique to utilize limited amount of training images. In my previous blog post, I have seen poor performance of a deep learning model when testing images contain the translation of the training images. However, the model performance improves when training data also contains translated images. See Assess the robustness of CapsNet. This experiment shows that it is essential to increase the data size using data augmentation to develop a robust deep learning model.
Keras's ImageDataGenerator and its limit¶
Data augmentation could increase the number of training images substantially which could raise a storage problem. Keras has a powerful API called ImageDataGenerator that resolve this problem. The generator can generate augmented images from the training images on the fly.
This generator has been used in many of my previous blog posts, for example:
Despite that it is a powerful and popular API, this API is limited to the image classification problem where the target does not depend on the translation of images. For example, the image of a dog is still an image of a dog even if the image is shifted by 3 pixels. So the target label "dog" does not need to be translated.
In landmark detection or facial keypoint detections, the target values also needs to change when an image is translated. That means that if the image of a face is shifted by 3 pixels, the (x,y) coordinates of the eye location also needs to be shifted.
I was looking for some existing API that can translate both images and coordinates. However, I couldn't. In my previous blog post Achieving Top 23% in Kaggle's Facial Keypoints Detection with Keras + Tensorflow, I implemented a python class that can flip the image horizontally and shift the image both along horizontal and vertical axes while adjusting the landmark coordinates. But there are so many other translations that I want to do; e.g., shearing, zooming, or all of them at once! And I do not want to code rotation matrix by myself!
Keras's ImageDataGenerator for facial keypoint detection problem.¶
I came up with a rather simple approach that takes full advantage of Keras's ImageDataGenerator. Although this is probably not the most optimized approach, it is very simple and the method allows us to use all Keras's ImageDataGenerator functionalities for landmark detection problem.
Simple idea¶
The idea is simple: I will create a mask having the same size as the image. The pixels of the mask corresponding to a landmark is indexed. The original image is augmented with this mask as the 4th channel (assuming that the image has 3 channels). Then we will pass this 4-channel image to Keras's ImagedataGenerator and find where the indexed landmark will be after image translation.
The code below shows how I implemented this approach.
## Import usual libraries
import matplotlib.pyplot as plt
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
import keras, sys, time, os, warnings
import numpy as np
import pandas as pd
import cv2
warnings.filterwarnings("ignore")
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.025
config.gpu_options.visible_device_list = "4"
set_session(tf.Session(config=config))
print("python {}".format(sys.version))
print("keras version {}".format(keras.__version__)); del keras
print("tensorflow version {}".format(tf.__version__))
Extract a single image with a landmark and bounding box¶
This image is extracted from CVC11, and I previously analyzed images from this data. See Driver's facial keypoint detection. I will only extract a single image from this data to demonstrate my data augmentation routine.
from keras.preprocessing.image import img_to_array, load_img
dir_data = "DrivFace/"
## For this data, we have annotation right eye, left eye, nose, right mouth and left mouth
landmarks = ["RE","LE","N","RM","LM"]
img = img_to_array(load_img(dir_data + "/DrivImages/20130529_01_Driv_001_f .jpg"))
row_name = ["xF", "yF", "wF", "hF",
"xRE", "yRE","xLE","yLE","xN","yN","xRM","yRM","xLM","yLM"]
row = [272, # xF: (xF, yF) : top left corner of the bounding box 292
189, # yF: 209
140, # wF: width of the bounding box 100
152, # hF: height of the bounding box 112
323, # xRE: (xRE,yRE) the (x,y) coordinate of right eye
232, # yRE:
367, # xLE: (xLE, yLE) the (x,y) coordinate of left eye
231, # yLE:
353, # xN : (xN, yN) the (x,y) coordinate of the nose
254, # yN :
332, # xRM: (xRM, yRM) the (x,y) coordinate of the right mouth tip
278, # yRM:
361, # xLM: (xLM, yLM) the (x,y) coordinate of the left mouth tip
278] # yLM:
row = pd.DataFrame(row).T
row.columns = row_name
row
Let's look at the original image¶
Driver image!
fig = plt.figure(figsize=(5,5))
## original image
ax = fig.add_subplot(1,1,1)
ax.imshow(img/255.0)
for landmark in landmarks:
ax.scatter(row["x"+landmark],row["y"+landmark])
plt.show()
Create a function to extract a bounding box¶
In practice, deep learning model often does not take an original wide-view image as an input of landmark detection model. Instead, an image is usually trimmed to reduced the space that focuses to the face by using face detection algorithm. This data already provide a bounding box so we will use this bounding box to reduce the image size.
I will demonstrate my data augmentation routine using the trimmed image within bounding box.
def get_bbox(row):
'''
extract bounding box from the dataframe
'''
faces = (int(row["xF"]),
int(row["yF"]),
int(row["wF"]),
int(row["hF"]))
return(faces)
(x, y, w, h) = get_bbox(row)
As the recorded (x,y) coordinates of landmarks are with respect to original image, I adjust the landmark coordinates to the bounding box.
def adjust_loc(rows,x_topleft=0,y_topleft=0):
'''
adjust the landmark coordinates with respect to bbox output:
'''
out = []
for lm in landmarks:
out.append((int(rows["x" + lm]) - x_topleft,
int(rows["y" + lm]) - y_topleft))
return(out)
landmark_bd = adjust_loc(row,x_topleft=x,y_topleft=y)
landmark_bd
The image in bounding box.¶
I will use this single image to demonstrate my data augmentation routine.
## image in bounding box
img_bd = img[y:(y+w),x:(x+w)]
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(1,1,1)
ax.imshow(img_bd/255.0)
for (x,y) in landmark_bd:
ax.scatter(x,y)
plt.show()
def get_ymask(img, xys):
'''
img : (N width, N height, N channel) array of image
xys : A list containint tuple of (x,y) coordinate od landmark. For example:
xys = [(x1,y1),
(x2,y2),
(x3,y3),
(x4,y4),
...]
'''
yimg = np.zeros((img.shape[0],img.shape[1],1))
yimg[:] = -1
for iland, (ix,iy) in enumerate(xys):
yimg[iy,ix] = iland
return(np.dstack([img,yimg]))
yimg = get_ymask(img_bd,landmark_bd)
print("The dimension of the original image {} -> masked image {}".format(img_bd.shape,yimg.shape))
plt.figure(figsize=(6,6))
plt.imshow(yimg[:,:,3])
plt.title("The mask receives non-negative values at landmarks")
plt.show()
Step 2: define Keras's ImageDataGenerator with the parameter of your choice.¶
from keras.preprocessing.image import ImageDataGenerator, img_to_array, load_img
datagen = ImageDataGenerator(rotation_range=20,
width_shift_range=10.0,
height_shift_range=10.0,
## Float. Shear Intensity (Shear angle in counter-clockwise direction in degrees)
shear_range=5.0,
## zoom_range: Float or [lower, upper].
## Range for random zoom. If a float,
## [lower, upper] = [1-zoom_range, 1+zoom_range]
zoom_range=[0.6, 1.2],
fill_mode='nearest',
#cval=-2,
horizontal_flip=True,
vertical_flip=False)
Step 3: Define a class ImageDataGenerator_landmarks¶
- The class assume that get_ymask is used before the flow method.
- After translation of image, you can resize the image via target_shape parameter.
- Translation with original resolution and then down size resolution gives more sample than translating the down sized images.
- The class defenition of ImageDataGenerator_landmarks is available from my Github account
class ImageDataGenerator_landmarks(object):
def __init__(self,
datagen,
preprocessing_function= lambda x,y: (x,y),
loc_xRE=None,
loc_xLE=None,
flip_indicies=None,
target_shape=None,
ignore_horizontal_flip=True):
'''
datagen : Keras's ImageDataGenerator
preprocessing_function : The function that will be implied on each input.
The function will run after the image is resized and augmented.
The function should take one argument: one image (Numpy tensor with rank 3),
and should output a Numpy
ignore_horizontal_flip : if False, whether the horizontal flip happend is checked
using and
if the flipping happens,
each pair of the are flipped.
if True, then ,
and do not need to be specified.
target_shape : If target_shape is not None,
A translated image is resized to target_shape.
Why? Translation with original resolution and then down size resolution
gives wider range of modified images than translating the down sized images.
For example,
Suppose the landmarks are
- right eye (RE)
- left eye (LE)
- mouth (M)
- right mouth edge (RM)
- left mouth edge (LM)
then there are 5 x 2 coordinates to predict:
xRE, yRE, xLE, yLE, xN, yN, xRM, yRM, xLM, yLM
When the horizontal flip happens, RE becomes LE and RM becomes LM.
So we need to change the target variables accordingly.
If the horizontal flip happenes xRE > xLE
so loc_xRE = 0 , loc_yRE = 2
In this case, our filp indicies are:
self.flip_indicies = ((0,2), # xRE <-> xLE
(1,3), # yRE <-> yLE
(6,8), # xRM <-> xLM
(7,9)) # yRM <-> yLM
'''
self.datagen = datagen
self.ignore_horizontal_flip = ignore_horizontal_flip
self.target_shape = target_shape
# check if x-cord of landmark1 is less than x-cord of landmark2
self.loc_xRE, self.loc_xLE = loc_xRE, loc_xLE
self.flip_indicies = flip_indicies
## the chanel that records the mask
self.loc_mask = 3
self.preprocessing_function = preprocessing_function
def flow(self,imgs,batch_size=20):
'''
imgs: the numpy image array : (batch, height, width, image channels + 1)
the channel (self.loc_mask)th channel must contain mask
'''
generator = self.datagen.flow(imgs,batch_size=batch_size)
while 1:
##
N = 0
x_bs, y_bs = [], []
while N < batch_size:
yimgs = generator.next()
## yimgs.shape = (bsize,width,height,channels + 1)
## where bsize = np.min(batch_size,x.shape[0])
x_batch ,y_batch = self._keep_only_valid_image(yimgs)
if len(x_batch) == 0:
continue
x_batch ,y_batch = self.preprocessing_function(x_batch,y_batch)
x_bs.append(x_batch)
y_bs.append(y_batch)
N += x_batch.shape[0]
x_batch , y_batch = np.vstack(x_bs), np.vstack(y_bs)
yield ([x_batch, y_batch])
def _keep_only_valid_image(self,yimg):
'''
Transform the mask to (x,y)-coordiantes.
Depending on the translation, landmark may "dissapeear".
For example, if the image is escessively zoomed in,
the mask may lose the index of landmark.
Such image translation is discarded.
x_train and y_train could be an empty array
if landmarks of all the translated images are lost i.e.
np.array([])
'''
x_train, y_train = [], []
for irow in range(yimg.shape[0]):
x = yimg[irow,:,:,:self.loc_mask]
ymask = yimg[irow,:,:,self.loc_mask]
y = self._findindex_from_mask(ymask)
# if some landmarks dissapears, do not use the translated image
if y is None:
continue
x, y = self._resize_image(x, y)
x_train.append(x)
y_train.append(y)
x_train = np.array(x_train)
y_train = np.array(y_train)
return(x_train,y_train)
def _resize_image(self,x,y):
'''
this function is useful for down scaling the resolution
'''
if self.target_shape is not None:
shape_orig = x.shape
x = cv2.resize(x,self.target_shape[:2])
y = self.adjust_xy(y,
shape_orig,
self.target_shape)
return x,y
def adjust_xy(self,y,shape_orig,shape_new):
'''
y : [x1,y1,x2,y2,...]
'''
y[0::2] = y[0::2]*shape_new[1]/float(shape_orig[1])
y[1::2] = y[1::2]*shape_new[0]/float(shape_orig[0])
return y
def _findindex_from_mask(self,ymask):
'''
ymask : a mask of shape (height, width, 1)
'''
ys = []
for i in range(self.Nlandmarks):
ix, iy = np.where(ymask==i)
if len(ix) == 0:
return(None)
ys.extend([np.mean(iy),
np.mean(ix)])
ys = np.array(ys)
ys = self._adjustLR_horizontal_flip(ys)
return(ys)
def _adjustLR_horizontal_flip(self,ys):
'''
if a horizontal flip happens,
right eye becomes left eye and
right mouth edge becomes left mouth edge
So we need to flip the target cordinates accordingly
'''
if self.ignore_horizontal_flip:
return(ys)
if ys[self.loc_xRE] > ys[self.loc_xLE]: ## True if flip happens
# x-cord of RE is less than x-coord of left eye
# horizontal flip happened!
for a, b in self.flip_indicies:
ys[a],ys[b] = (ys[b],ys[a])
return(ys)
def get_ymask(self,img, xys):
'''
img : (height, width, channels) array of image
xys : A list containint tuple of (x,y) coordinates of landmark. For example:
xys = [(x0,y0),
(x1,y1),
(x2,y2),
(x3,y3),
(x4,y4),
...]
output:
mask : A numpy array of size (height, width, channels).
All locations without the landmarks are recorded -1
A coordinate with (x0, y0) is recorded as 0
A coordinate with (x1, y1) is recorded as 1
...
'''
yimg = np.zeros((img.shape[0],img.shape[1],1))
yimg[:] = -1
for iland, (ix,iy) in enumerate(xys):
yimg[iy,ix] = iland
self.Nlandmarks = len(xys)
self.loc_mask = img.shape[2]
return(np.dstack([img,yimg]))
Instantiate the class¶
To instantiate the class you need to provide
- datagen : Keras's ImageDataGenerator
- ignore_horizontal_flip : if False, whether the horizontal flip happened is checked, using
and if the horizontal flipping happens, each pair of the are flipped. if True, then , and do not need to be specified - loc_xRE: the position where x coordinate of right eye is stored in target variable.
- loc_xLE: the position where y coordinate of right eye is stored in target variable.
- flip_indicies: the positions where x, y coordinates aer
What are they?¶
Consider our scenario:
Our landmarks are:
- right eye (RE)
- left eye (LE)
- mouth (M)
- right mouth edge (RM)
- left mouth edge (LM)
then there are 5 x 2 coordinates to predict:
xRE, yRE, xLE, yLE, xN, yN, xRM, yRM, xLM, yLM
When the horizontal flip happens, RE becomes LE and RM becomes LM. So we need to change the target variables accordingly.
If the horizontal flip happenes, we see xRE > xLE (right eye is on the right of left eye!)
In this case, we must flip the role of RE and LE as well as RM and LM. So our filp indicies are:
self.flip_indicies = ((0,2), # xRE <-> xLE
(1,3), # yRE <-> yLE
(6,8), # xRM <-> xLM
(7,9)) # yRM <-> yLM
generator = ImageDataGenerator_landmarks(datagen,
ignore_horizontal_flip=False,
target_shape=(90,90,3),
loc_xRE=0,
loc_xLE=2,
flip_indicies = ((0,2), # xRE <-> xLE
(1,3), # yRE <-> yLE
(6,8), # xRM <-> xLM
(7,9)) # yRM <-> yLM
)
xy = np.array([generator.get_ymask(img_bd,landmark_bd)])
Let's visualize!¶
Notice that RE is always at the left of LE.
plt.close('all')
def singleplot(ax,x,y):
ax.imshow(x/255.0)
colors = ['b','g','r','c','m']
for i, marker,c in zip(range(0,len(y),2),
landmarks,
colors):
ax.annotate(marker,
(y[i],y[i+1]),
color=c)
def pannelplot(figID=0,dir_image=None, nrow_plot = 6,ncol_plot = 6, fignm="fig",save=True):
fig = plt.figure(figsize=(15,15))
#fig.subplots_adjust(hspace=0,wspace=0)
xs, ys = [], []
count = 1
for x_train,y_train in generator.flow(xy,batch_size=1):
if len(x_train) == 1:
ax = fig.add_subplot(nrow_plot,ncol_plot,count)
#ax.axis("off")
singleplot(ax,x_train[0],y_train[0])
if count == nrow_plot * ncol_plot:
break
count += 1
if save:
plt.savefig(dir_image + "/fig{:04.0f}.png".format(figID),
bbox_inches='tight',pad_inches=0)
else:
plt.show()
pannelplot(save=False)
Make a gif¶
def create_gif(gifname,dir_image,duration=1):
import imageio
filenames = np.sort(os.listdir(dir_image))
filenames = [ fnm for fnm in filenames if ".png" in fnm]
with imageio.get_writer(dir_image + '/' + gifname + '.gif',
mode='I',duration=duration) as writer:
for filename in filenames:
image = imageio.imread(dir_image + filename)
writer.append_data(image)
dir_image = 'data_augmentation/'
plt.close('all')
for count in range(100):
pannelplot(count,dir_image, nrow_plot = 6,ncol_plot = 6, fignm="fig",save=True)
create_gif("example",dir_image,duration=0.5)
plt.close('all')