Yumi's Blog

Use pretrained YOLO network for object detection, SJSU data science night



Detecting objects in an image is an important task for machines:

While humans can easily detect and identify objects present in an image, it turns out that this had been a very difficult task for a machine until deep learning comes into the field.

First, look at this cool YouTube video that you are going to reproduce today.

To motivate all of you, the YouTube video below shows the performance of a state-of-art object detection deep learning model on a baby compilation video.


The model used here is so-called YOLO (You Only Look Once). We will get into the details of this today.

For each frame of the video, a YOLO deep learning model detects

  • the location of objects, defined by the center coordinate of bounding box, and its width and height: $$ \begin{bmatrix} x_{\textrm{center of bounding box}}\\ y_{\textrm{center of bounding box}}\\ \textrm{width}_{\textrm{bounding box}}\\ \textrm{height}_{\textrm{bounding box}} \end{bmatrix} \in R^{4} $$

  • the object's class (a person, cat etc)

In this workshop, your final goal is to learn how to use YOLO's pretrained model and reproduce this video. To achieve this goal, there are a lot of milestones along the way:

  • Learn how to use ipython notebook
  • Learn how to import video into ipython notebook, extract frames, and edit each frame.
  • Learn basic concept of the object detection.
  • Learn how to create mp4 file and git in Mac OSX.
In [1]:
from IPython.display import IFrame
IFrame("https://www.youtube.com/embed/MASPcgZ-rCE", width=990, height=800)

Some observations:

It is better to look in the x0.75 speed to understand what really is going on.


  • "person" is detected pretty well. No matter if the person is a baby or an adult
  • "dog" is also detected pretty well.


  • "cat" is often detect as a dog or a person
  • "cat" cannot be detected at all when it stands (See the 4th baby video).


  • The algorithm can recoginize cat as a cat when its face is visible.

This blog post assumes that you already installed python 3.6, Tensorflow 1.9.0, Keras 2.1.2 and Opencv 3.4.2. If you have not done so, please follow previous blog post to set up your virtual environment.

Then, download workshop specific codes by following instruction Download workshop specific codes.

Open your ipython notebook in the downloaded GitHub repository. Step_by_step_DataScience_Night_Complete.ipynb contains this script.

Check the path of the current working directory (ipython notebook accepts Linux command!)

In [2]:

Here is what I see in my current working directory.

Screen Shot 2019-02-01 at 12.33.06 PM

Import python libraries

First, we import python libraries. The following code should work if your virtual environment is correctly set up.

In [3]:
import os, sys
import imageio                  # computer vision library
import matplotlib.pyplot as plt # python's standard visualization library
import cv2                      # opencv, common computer vision library
import numpy as np              # python's matrix calculation library
import keras                    # high level deep learning library
import tensorflow               # low level deep learning library
print(">>My python version is {}".format(sys.version))
print(">>My opencv version is {} (opencv 3.4.2 also works)".format(cv2.__version__))
print(">>My keras version is {}".format(keras.__version__))
print(">>My tensorflow version is {}".format(tensorflow.__version__))
%matplotlib inline
/Users/yumikondo/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
>>My python version is 3.6.3 |Anaconda, Inc.| (default, Oct  6 2017, 12:04:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
>>My opencv version is 3.2.0 (opencv 3.4.2 also works)
>>My keras version is 2.1.2
>>My tensorflow version is 1.9.0

Read in a YouTube video

Now we select the video to run YOLO deep learning algorithm. I choose this ~3 min baby compilation.

You can choose your own video but make sure that it is not too long.

You can convert ~3-minute video into 360 pixel mp4 using online tool YOUTUBEMP4. Save the downloaded video into your current working directory

Now your folder should looks like this with an additional youtube video named "videoplayback.mp4": Screen Shot 2019-02-01 at 12.42.27 PM

Let's read in the videoplayback.mp4 into the ipython notebook.

In [4]:
video_name   = "videoplayback.mp4"
## instantiate the video reader object 
video_reader = cv2.VideoCapture(video_name)

Video consists of the sequence of frames (i.e., bunch of pictures). Running video_read.read() once gives you the first frame of the video.

video_read.read() returns two outputs, a frame and boolean (True/False) flag: If frame is read correctly, the flag will be True. So you can check end of the video by checking this return value. Let's check out the output of the video_read.read()

In [5]:
ret, frame = video_reader.read()
print("successfuly read? {}".format(ret))
successfuly read? True
[[[228 113 203]
  [228 113 203]
  [228 113 203]
  [204 215  75]
  [204 215  75]
  [204 215  75]]

 [[228 113 203]
  [228 113 203]
  [228 113 203]
  [204 215  75]
  [204 215  75]
  [204 215  75]]

 [[228 113 203]
  [228 113 203]
  [228 113 203]
  [204 215  75]
  [204 215  75]
  [204 215  75]]


 [[189 145 214]
  [194 145 214]
  [202 145 213]
  [252 187 131]
  [252 187 131]
  [252 187 131]]

 [[237 139 218]
  [237 140 217]
  [239 140 215]
  [252 187 131]
  [252 187 131]
  [252 187 131]]

 [[255 137 219]
  [255 138 218]
  [253 139 216]
  [252 187 131]
  [252 187 131]
  [252 187 131]]]
  • Lots of numbers between 0 - 254 show up.
  • In fact, an image is read as a numpy matrix of dimention (height, width, 3).
    • 3 is the number of channels. Here, the three color channels are Blue, Green and Red.
  • To see this numpy matrix object contain shape member.
In [6]:
(360, 640, 3)
  • frame has 360 pixels along height and 640 pixels along width
  • We can also visualize this entire image by using matplotlib function.
  • plt.imshow assumes that the colors are ordered as Red, Green and Blue. So I need to reorder the frame.

TODO: Check how the color changes by running plt.imshow on frame.

In [7]:
#reorder the frame so that the 3 channels are orderd as R, G and then B.
frameRGB = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

So we successfuly readin the first frame of our baby compilation video. How can we get the remaining video? We can simply run video_reader.read() again and that will give you the next frame. But how frequency the frames are recorded in this video? To know the frames per second, you can run the following code:

In [8]:
fps = video_reader.get(cv2.CAP_PROP_FPS)
print("{} frames per second".format(fps))
30.0 frames per second

The following code shows the first 5 second of the video. During the 5 seconds, there are 30 x 5 = 150 frames, which is a bit too many to plot in one ipython notebook. So I will only plot the 1 frame per second.

TODO: increase secToSample

In [9]:
video_reader = cv2.VideoCapture(video_name)
secToSample = 5
count = 0 
while count < secToSample*fps:
    count += 1
    ret, frame = video_reader.read()
    if count % fps == 0:
        frameRGB = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        plt.title("{}th frame, t = {:3.2f} sec".format(count,count/fps))