Yumi's Blog

Download all images from Google image search query using python

In this blog post, I describe how I download a lot of images from Google images. I followed pyimagesearch's blog post so please give credits to his blog. His method has two steps:

  • Step 1: The first step is to gather URL links of the images that appear in Google Images when you enter a query. pyimagesearch's blog post did this using Java Script. After running his ~10 lines of Java Script code, you will download a text file named urls.txt, that contains the URL link of the images.
  • Step 2: The second step is to download images from each URL using Python.

I followed Step 1 and downloaded the urls.py. However, I wrote my own script for the Step 2. This was because his python scripts requires quite a few python libraries, and they are:

  • from imutils import paths
  • import argparse
  • import cv2
  • import requests, os

I could not pip install imutils in my environment so here I write a scripts that only require requests and os.

First, let's take a look at the urls.txt file. I renamed this file to be "urls - Hunter x Hunter anime.txt". (This is because "Hunter x Hunter anime" was my query for Google Image. Hunter x Hunter is my favorite anime show.) If you are interested in my .txt files, I pushed it in my github.

In [11]:
path_text = "urls - Hunter x Hunter anime.txt"

First let's look at the first 10 lines of the urls.txt and the number of lines.

In [12]:
o = open(path_text,"r")
url0 = o.read()
o.close()

## list, containing downloaded files 
urls = url0.split()
print("The number of urls: {}".format(len(urls)))
print("____________________________")
for url in urls[:10]:
    print(url)
The number of urls: 614
____________________________
http://img1.ak.crunchyroll.com/i/spire3/cbb55a6382682bf71e91f685c6473c5a1487736090_full.jpg
https://geekandsundry.com/wp-content/uploads/2016/01/JPEG-Promo-1.png
http://cdn1.theouterhaven.net/wp-content/uploads/2017/10/Hunter_x_Hunter.png
http://media.comicbook.com/2017/11/hunter-x-hunter-1019647-1062187-1280x0.jpg
https://myanimelist.cdn-dena.com/s/common/uploaded_files/1456110286-e4719dbe229cff118ffb4c6c2a05bfd6.png
https://d37x086vserhlm.cloudfront.net/wp-content/uploads/2017/03/27180359/hunter-x-hunter.jpg
https://i1.wp.com/eclipsemagazine.com/wp-content/uploads/2015/12/Hunter-X-Hunter.jpg
https://i.ytimg.com/vi/INQTyrlurJE/hqdefault.jpg
http://4hdwallpapers.com/wp-content/uploads/2013/09/Hunter-x-Hunter-Anime.jpg
https://myanimelist.cdn-dena.com/s/common/uploaded_files/1484292524-ba145fd5de7c1c852334fa88ed95b0a0.jpeg

Next download the images from each of the URL using requests.get(). I included the try catch as some requests fail with error messages. Runinng the following script create a folder data if it does not exist in the current directory and save images in jpg format.

In [14]:
import requests,os

loc_data = "./data/"
try:
    os.makedirs(loc_data)
except:
    pass
iimage = 0
for url in urls:
    try:
        f = open(loc_data + 'image{:05.0f}.jpg'.format(iimage),'wb')
        f.write(requests.get(url).content)
        f.close()
        iimage += 1
    except Exception as e:
        print("\n{} {}".format(e,url))
        pass
HTTPConnectionPool(host='www.m5zn.com', port=80): Max retries exceeded with url: /uploads2/2012/2/17/photo/021712160250i32w9kaiuvv4y8iqq0bmzm25.png (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 65] No route to host',)) http://www.m5zn.com/uploads2/2012/2/17/photo/021712160250i32w9kaiuvv4y8iqq0bmzm25.png

So you are done with downloading the images. You do not need the codes hearafter if your solo purpose is to downloading the data, but if you want to take a look at some of the images, here is the codes for peaking the first 9 images.

In [16]:
from keras.preprocessing.image import load_img 
import matplotlib.pyplot as plt

fnames = os.listdir(loc_data)
fig = plt.figure(figsize=(10,10))
count = 1
for fnm in fnames[:9]:
    img = load_img(loc_data +fnm,target_size=(400,400))
    ax = fig.add_subplot(3,3,count)
    count += 1
    ax.imshow(img)
    ax.axis("off")
plt.show()

Comments