CNN on Google Cloud Computing

Differentiating Images of Dogs and Cats using a Convolutional Neural Network and Google Cloud's Deep Learning VM Image

Mark Dodd and Michael Ellsworth

title

Abstract

This project will explore methods for building a Convolutional Neural Network (CNN) to differentiate images between cats and dogs. This somewhat trivial task for humans is a complicated and time consuming process for a desktop computer. Exploiting the resources available on Google Cloud, this project will test a number of different CNNs in order to achieve a target accuracy of 90% or greater.

Packages

This project will lean heavily on the CNN infrastructure available open source via TensorFlow's highlevel API Keras. Keras gives us the ability to build and train deep learning models such as CNNs, which will ultimately be used to differentiate images of cats and dogs. In addition to Keras, the typical Data Science python stack will be used, including pandas, numpy and matplotlib.

In [2]:
import os
import glob
import pathlib
import itertools
import pickle
import time
import math
from PIL import Image
import multiprocessing as mp

import pandas as pd
import numpy as np
from statistics import mean, median, mode
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, roc_curve, precision_recall_curve

import tensorflow as tf
from tensorflow import keras
import tensorflow.keras.backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D, BatchNormalization, Activation, GlobalAveragePooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
AUTOTUNE = tf.data.experimental.AUTOTUNE

%matplotlib inline
sns.set()
mpl.rcParams['figure.dpi'] = 100
mpl.rcParams['axes.titlesize'] = 18
mpl.rcParams['axes.labelsize'] = 14

#TRAIN_PATH = (r'/home/mark/Dodd/software/Python/data/uofc/data608/project/' 
#              + 'dogs-vs-cats/train/')
TRAIN_PATH = r'data/train/'

Data Set

Background

The dataset that will be used to train and test the CNNs in this project is publically available at Kaggle. As per the majority of kaggle datasets, it consists of a training set and a testing set, however, this project will train and test the CNNs via the training set only as the testing set is unlabelled and used for submitting the final model to Kaggle for judging. Therefore, the testing set is unable to serve the purpose of a typical machine learning testing set.

The size of the training data is approximately 850 MB, which, in our opinion, makes this sizeable enough to be constituted as a Big Data problem, especially considering that once the data is read into the memory the arrays for each image actually take up much more space than the compressed .jpg files do. This project will test that assumption by running a handful of CNNs on a local machine to differentiate the training duration between that of a cluster of machines at Google. This topic will be explored further in later sections of the project.

A Quick Exploration of the Data

We want to take a moment to illustrate the actual size of our data set and also the density of dimensions of the images. We will illustrate this with a histogram and density plot of the x and y dimensions. We use multiprocessing to perform this processing and cycle through 8 cores to ascertain the benefit of multiprocessing for this task.

In [3]:
def image_information(file):
    img = np.array(Image.open(file))
    return img.shape[0], img.shape[1], img.nbytes
In [10]:
results = {'cores': [], 'chunksize': [], 'time': [], 'size': []}
for cpu in range(1, mp.cpu_count() + 1):
    for cs in [1, 8, 16, 24, 32]:
        x_dims = []
        y_dims = []
        num_bytes = 0
        data_dir = pathlib.Path(TRAIN_PATH)
        files = data_dir.glob('*.jpg')

        begin = time.perf_counter()
        with mp.Pool(processes = cpu) as p:
            for x, y, n in p.imap_unordered(image_information, files, chunksize = cs):
                x_dims.append(x)
                y_dims.append(y)
                num_bytes += n
        end = time.perf_counter()

        r = (cs, end - begin)
        print('Number of processors in use is {} of {}. '.format(cpu, mp.cpu_count()) + 'chunksize = {}, time = {:.4f} s'.format(*r) )
        #print("cs = {}, {} s.".format(*r))
        results['cores'].append(cpu)
        results['chunksize'].append(cs)
        results['time'].append(r[1])
        results['size'].append(num_bytes)

        results_df = pd.DataFrame(results)
print("mode x = {}, mean x = {:.2f}, median x = {}".format(mode(x_dims), mean(x_dims), median(x_dims)))
print("mode y = {}, mean y = {:.2f}, median y = {}".format(mode(y_dims), mean(y_dims), median(y_dims)))
Number of processors in use is 1 of 8. chunksize = 1, time = 35.2806 s
Number of processors in use is 1 of 8. chunksize = 8, time = 33.4802 s
Number of processors in use is 1 of 8. chunksize = 16, time = 33.4808 s
Number of processors in use is 1 of 8. chunksize = 24, time = 33.4829 s
Number of processors in use is 1 of 8. chunksize = 32, time = 33.2797 s
Number of processors in use is 2 of 8. chunksize = 1, time = 18.1611 s
Number of processors in use is 2 of 8. chunksize = 8, time = 16.9562 s
Number of processors in use is 2 of 8. chunksize = 16, time = 16.8560 s
Number of processors in use is 2 of 8. chunksize = 24, time = 16.8528 s
Number of processors in use is 2 of 8. chunksize = 32, time = 16.8544 s
Number of processors in use is 3 of 8. chunksize = 1, time = 12.2567 s
Number of processors in use is 3 of 8. chunksize = 8, time = 11.5560 s
Number of processors in use is 3 of 8. chunksize = 16, time = 11.3529 s
Number of processors in use is 3 of 8. chunksize = 24, time = 11.4552 s
Number of processors in use is 3 of 8. chunksize = 32, time = 11.4517 s
Number of processors in use is 4 of 8. chunksize = 1, time = 9.5606 s
Number of processors in use is 4 of 8. chunksize = 8, time = 8.8560 s
Number of processors in use is 4 of 8. chunksize = 16, time = 8.7560 s
Number of processors in use is 4 of 8. chunksize = 24, time = 8.7566 s
Number of processors in use is 4 of 8. chunksize = 32, time = 8.6543 s
Number of processors in use is 5 of 8. chunksize = 1, time = 8.8663 s
Number of processors in use is 5 of 8. chunksize = 8, time = 8.2619 s
Number of processors in use is 5 of 8. chunksize = 16, time = 8.1637 s
Number of processors in use is 5 of 8. chunksize = 24, time = 8.1634 s
Number of processors in use is 5 of 8. chunksize = 32, time = 8.2635 s
Number of processors in use is 6 of 8. chunksize = 1, time = 8.3733 s
Number of processors in use is 6 of 8. chunksize = 8, time = 7.7717 s
Number of processors in use is 6 of 8. chunksize = 16, time = 7.6675 s
Number of processors in use is 6 of 8. chunksize = 24, time = 7.6691 s
Number of processors in use is 6 of 8. chunksize = 32, time = 7.7702 s
Number of processors in use is 7 of 8. chunksize = 1, time = 7.9810 s
Number of processors in use is 7 of 8. chunksize = 8, time = 7.3785 s
Number of processors in use is 7 of 8. chunksize = 16, time = 7.2767 s
Number of processors in use is 7 of 8. chunksize = 24, time = 7.2794 s
Number of processors in use is 7 of 8. chunksize = 32, time = 7.2747 s
Number of processors in use is 8 of 8. chunksize = 1, time = 7.6906 s
Number of processors in use is 8 of 8. chunksize = 8, time = 6.9827 s
Number of processors in use is 8 of 8. chunksize = 16, time = 6.9831 s
Number of processors in use is 8 of 8. chunksize = 24, time = 6.9817 s
Number of processors in use is 8 of 8. chunksize = 32, time = 6.8831 s
mode x = 374, mean x = 360.48, median x = 374.0
mode y = 500, mean y = 404.10, median y = 447.0
In [11]:
!du -sh data/train
595M	data/train
In [12]:
fig, ax = plt.subplots(figsize=(12,8))
sns.barplot(data=results_df, x='cores', y='time', hue='chunksize', ax=ax, palette=sns.color_palette("Blues_d"), edgecolor = 'k')
plt.title("Time to Process 25,000 Images vs. Core and Chunksize", size = 18)
plt.annotate("Files are 595MB on drive, {:.1f} GB in memory.\nReadings taken after files stored in disk cache.".format(num_bytes/(1024**3)), xy = (4,10))
plt.xlabel("Number of Cores", size = 14)
plt.ylabel("Time (s)")
plt.show()
In [7]:
fig, ax = plt.subplots(figsize=(12,8))
sns.distplot(x_dims, bins = 30, label = 'x-dim')
sns.distplot(y_dims, bins = 30, label = 'y-dim')
plt.xlabel('Size (pixels)', size = 14)
plt.title('Distribution of Image Sizes in Dog vs. Cats', size = 18)
plt.xlim((0,600))
plt.legend()
plt.show()

From the above plot we see that there is a large, left skewed, distribution of image sizes in our data set. We decided to rescale each image to 150x150 to reduce data size and model size with the idea to aid us in building a model.

Structuring the training data

The training data available at Kaggle consists of 25,000 images; 12,500 images of dogs and 12,500 of cats. Each image file is labeled either dog.n.jpg or cat.n.jpg with n numbered from 1 to 12,500. In order to effectively train a CNN, the 25,000 training images would need to be split into three image subsets; training, validation and testing. In this project, it was decided that in order to effectively train a CNN, an equal set of cat and dog images would be required in each of these image subsets. The following set of functions goes about splitting the training data into the previously mentioned image subsets with equal images of dogs and cats in each.

In [8]:
# Function 1 - train_val_test
# Creates a function to split a list into the three image subsets; training, validation and testing

def train_val_test(img_list, test_size = 0.1, validation_size = 0.15, random_state = 42):
    ''' Split a list into a training, validation and test set
        Parameters: 
            img_list - a list of img file paths
            test_size - 0-1.0 - defines the test size as percentage of the list size
            validation_size - 0-1.0 - defines the validation size as percentage of the list size
        |Returns:
            train_list
            validation_list
            test_list 
    '''
    np.random.seed(random_state)
    imgs_shuffled = img_list.copy()
    np.random.shuffle(imgs_shuffled)

    train_size = 1 - test_size - validation_size
    train_ind = int(len(imgs_shuffled)*train_size)
    val_ind = int(len(imgs_shuffled)*validation_size) + train_ind

    return imgs_shuffled[:train_ind], imgs_shuffled[train_ind:val_ind], imgs_shuffled[val_ind:]

# Function 2 - train_val_test_combined
# Creates a function that will combine the three image subsets for dogs and cats
# This function is required to reconstruct the 25,000 images with an equal amount of dogs and cats in each subset

def train_val_test_combined(dogs, cats, test_size = 0.1, validation_size = 0.15, random_state = 42):
    ''' Split a list into a training, validation and test set
        Parameters: 
            dogs - a list of img file paths
            cats - a list of img file paths
            test_size - 0-1.0 - defines the test size as percentage of the list size
            validation_size - 0-1.0 - defines the validation size as percentage of the list size
        Returns:
            train_list - combined dog and cat list
            validation_list - combined dog and cat list
            test_list - combined dog and cat list
    '''
    dog_train, dog_validation, dog_test = train_val_test(dogs, test_size, validation_size, random_state)
    cat_train, cat_validation, cat_test = train_val_test(cats, test_size, validation_size, random_state)

    train = dog_train + cat_train
    val = dog_validation + cat_validation
    test = dog_test + cat_test

    np.random.seed(random_state)
    np.random.shuffle(train)
    np.random.shuffle(val)
    np.random.shuffle(test)

    return train, val, test

# Function 3 - create_train_val_test
# Creates a function to separate the cats and dogs files and combine them into the three image subsets using
# Function 1 and Function 2

def create_train_val_test(train_path):
    #train_files = glob.glob(train_path + '*')
    train_cat_files = glob.glob(train_path + 'cat*')
    train_dog_files = glob.glob(train_path + 'dog*')

    return train_val_test_combined(train_dog_files, train_cat_files)

Data Cleaning

There are a number of data cleaning steps that are required prior to feeding the images into Keras to create a CNN. These data cleaning steps include:

  • Assigning a label to each image, in this case, 1 being a dog and 0 being a cat
  • Decoding the .jpg file into a 3D uint8 tensor that will assign 3 numbers to each of the images pixels based on the amount of red, green and blue color in each pixel
  • Converting the typical colour coding number format from a 0 to 255 scale to a 0 to 1 scale.

Additionally, an important issue to note about the dataset is that the images are inconsistent in size. Since Keras takes the pixel length and pixel width as features of the image, if the images are left as is, Keras would be unable to train a CNN since the features of each image would be inconsistent. This would be equivalent to creating a linear regression model where each observation has a different number of features. As a result, the images will need to be re-sized to a consistent pixel length and width. This project will resize the images to a 150 by 150 pixel length and width.

The following set of functions were built to complete the aforementioned data cleaning steps. The code was adapted from the following TensorFlow tutorial. After the functions are created, the three subsets of image data can be constructed.

In [9]:
# Constants used to define the image height and width to resize each image consistently
IMG_HEIGHT = 150
IMG_WIDTH = 150
BATCH_SIZE = 32

# Function 1 - get_label
# Creates a function to extract the label from each .jpg file in the Kaggle dataset

def get_label(file_path):
    # convert the path to a list of path components
    parts = tf.strings.split(file_path, os.path.sep)

    # get the cat / dog component of the file name
    cat_or_dog = tf.strings.split(parts[-1], '.')[0]

    return int(cat_or_dog == 'dog')

# Function 2 - decode_img
# Creates a function to decode the .jpg file into a 3D uint8 tensor with a consistent height and width

def decode_img(img):
    # convert the compressed string to a 3D uint8 tensor
    img = tf.image.decode_jpeg(img, channels=3)

    # Use `convert_image_dtype` to convert to floats in the [0,1] range.
    img = tf.image.convert_image_dtype(img, tf.float32)

    # resize the image to the desired size.
    return tf.image.resize(img, [IMG_HEIGHT, IMG_WIDTH])

# Function 3 - process_path
# Creates a function to extract the label using the get_label function and assign it to a tensor using
# the decode_img function

def process_path(file_path):
    label = get_label(file_path)
    img = tf.io.read_file(file_path)
    img = decode_img(img)
    return img, label

# Function 4 - build_dataset
# Creates a function to construct the dataset of images from a list of files.
# This function runs the process_path function in parallel

def build_dataset(file_list):
    # custom function to convert a list of 
    ds = tf.data.Dataset.from_tensor_slices(file_list)
    ds = ds.map(process_path, num_parallel_calls=AUTOTUNE) # parallel routine
    return ds

# Function 5 - ds_len
# Creates a straightforward function to pull the length of a tensorflow dataset

def ds_len(ds):
    ''' get length of a tensorflow dataset '''
    return tf.data.experimental.cardinality(ds).numpy()

Data Augmentation

Write some functions to create some data augmentations. These augmentations will randomly perturbate each of hte images in our data set, which, in a sense, artificially expands our training data set.

In [10]:
# functions modified and impired from:
#    https://www.wouterbulten.nl/blog/tech/data-augmentation-using-tensorflow-data-dataset/

def img_flip(img, label):
    return tf.image.random_flip_left_right(img), label


def img_color(img, label):
    img = tf.image.random_hue(img, 0.08)
    img = tf.image.random_saturation(img, 0.6, 1.6)
    img = tf.image.random_brightness(img, 0.05)
    img = tf.image.random_contrast(img, 0.7, 1.3)
    return img, label


# Chris Deotte - https://www.kaggle.com/cdeotte/rotation-augmentation-gpu-tpu-0-96
def get_mat(rotation, shear, height_zoom, width_zoom, height_shift, width_shift):
    # returns 3x3 transform matrix which transforms indicies

    # CONVERT DEGREES TO RADIANS
    rotation = math.pi * rotation / 180.
    shear = math.pi * shear / 180.

    # ROTATION MATRIX
    c1 = tf.math.cos(rotation)
    s1 = tf.math.sin(rotation)
    one = tf.constant([1],dtype='float32')
    zero = tf.constant([0],dtype='float32')
    rotation_matrix = tf.reshape( tf.concat([c1,s1,zero, -s1,c1,zero, zero,zero,one],axis=0),[3,3] )

    # SHEAR MATRIX
    c2 = tf.math.cos(shear)
    s2 = tf.math.sin(shear)
    shear_matrix = tf.reshape( tf.concat([one,s2,zero, zero,c2,zero, zero,zero,one],axis=0),[3,3] )

    # ZOOM MATRIX
    zoom_matrix = tf.reshape( tf.concat([one/height_zoom,zero,zero, zero,one/width_zoom,zero, zero,zero,one],axis=0),[3,3] )

    # SHIFT MATRIX
    shift_matrix = tf.reshape( tf.concat([one,zero,height_shift, zero,one,width_shift, zero,zero,one],axis=0),[3,3] )

    return K.dot(K.dot(rotation_matrix, shear_matrix), K.dot(zoom_matrix, shift_matrix))


# Chris Deotte - https://www.kaggle.com/cdeotte/rotation-augmentation-gpu-tpu-0-96
def transform(image,label):
    # input image - is one image of size [dim,dim,3] not a batch of [b,dim,dim,3]
    # output - image randomly rotated, sheared, zoomed, and shifted
    DIM = IMG_HEIGHT
    XDIM = DIM%2 #fix for size 331

    rot = 15. * tf.random.normal([1],dtype='float32')
    shr = 5. * tf.random.normal([1],dtype='float32')
    h_zoom = 1.0 + tf.random.normal([1],dtype='float32')/10.
    w_zoom = 1.0 + tf.random.normal([1],dtype='float32')/10.
    h_shift = 16. * tf.random.normal([1],dtype='float32')
    w_shift = 16. * tf.random.normal([1],dtype='float32')

    # GET TRANSFORMATION MATRIX
    m = get_mat(rot,shr,h_zoom,w_zoom,h_shift,w_shift)

    # LIST DESTINATION PIXEL INDICES
    x = tf.repeat( tf.range(DIM//2,-DIM//2,-1), DIM )
    y = tf.tile( tf.range(-DIM//2,DIM//2),[DIM] )
    z = tf.ones([DIM*DIM],dtype='int32')
    idx = tf.stack( [x,y,z] )

    # ROTATE DESTINATION PIXELS ONTO ORIGIN PIXELS
    idx2 = K.dot(m,tf.cast(idx,dtype='float32'))
    idx2 = K.cast(idx2,dtype='int32')
    idx2 = K.clip(idx2,-DIM//2+XDIM+1,DIM//2)

    # FIND ORIGIN PIXEL VALUES           
    idx3 = tf.stack( [DIM//2-idx2[0,], DIM//2-1+idx2[1,]] )
    d = tf.gather_nd(image,tf.transpose(idx3))

    return tf.reshape(d,[DIM,DIM,3]),label


def ds_augment(ds):
    # create a list of augmentation functions
    augmentations = [img_flip, img_color, transform]

    # map each augmentation function to the dataset in parallel
    for f in augmentations:
        ds = ds.map(f, num_parallel_calls=AUTOTUNE)

    # Make sure that the values are still in [0, 1]
    ds = ds.map(lambda x, label: (tf.clip_by_value(x, 0, 1), label), num_parallel_calls=AUTOTUNE)

    return ds

Build Datasets

The following functions uses the previously defined functions to convert the files into trainin / validation / test tensorflow datasets. For testing we create an augmented and a non-augmented version to compare the performance gains realized by augmenting the dataset.

In [11]:
# Function 1 - prepare_for_training
# Creates a function to feed data into the CNN in a random order
#   - cache the dataset in memory
#   - shuffle it fully
#   - repeat the shuffle as needed (buffer is empty)
#   - feed data out at a batch size
#   - prefetch the next batch and have it ready when requierd to speed processing

def prepare_for_training(ds,
                         cache = True,
                         shuffle = False,
                         augment = False,
                         repeat = True,
                         prefetch = True,
                         batch_size = BATCH_SIZE):

    # we always want to cache, but this is set up to be generic in case you wouldn't want to
    if cache:
        if isinstance(cache, str):
            ds = ds.cache(cache)
        else:
            ds = ds.cache()

    # we don't always want to shuffle (validation / test)
    if shuffle:
        ds = ds.shuffle(buffer_size=ds_len(ds), reshuffle_each_iteration = True)

    if repeat:
        ds = ds.repeat() # always repeat

    # we will only augment the training set
    if augment:
        ds = ds_augment(ds)

    ds = ds.batch(batch_size)

    if prefetch:
        ds = ds.prefetch(buffer_size=AUTOTUNE) # fetch a batch in the background

    return ds


# Function 2 - build_labelled_datasets
# Using the create_train_val_test function from the previous code chunk, build_laballed_datsets
# creates a function to convert the 25,000 .jpg images into a Keras readable format and split them
# into training, validation and testing data subsets

def build_labelled_datasets(path):
    train_files, validation_files, test_files = create_train_val_test(path)

    labeled_train_ds = build_dataset(train_files)
    labeled_val_ds = build_dataset(validation_files)
    labeled_test_ds = build_dataset(test_files)

    # display the lengths of the three sets
    print('Train size = {}, Validation Size = {}, Test Size = {}'
          .format(ds_len(labeled_train_ds), ds_len(labeled_val_ds), ds_len(labeled_test_ds)))

    return labeled_train_ds, labeled_val_ds, labeled_test_ds

# Function 3 - prepare_datasets
# Creates a function to (again, Mark, not sure what this does!!)

def prepare_datasets(train, val, test):

    train_ds = prepare_for_training(train, shuffle = True, augment = False)    # unaugmented training set
    train_aug_ds = prepare_for_training(train, shuffle = True, augment = True) # augmented training set
    val_ds = prepare_for_training(val)      # no need to shuffle
    test_ds = prepare_for_training(test,
                                   cache = True,
                                   shuffle = False,
                                   augment = False,
                                   repeat = False,
                                   prefetch = False,
                                   batch_size = ds_len(labeled_test_ds))

    return train_ds, train_aug_ds, val_ds, test_ds
In [12]:
# Create the training, validation and testing TensorFlow datasets for input into Keras
labeled_train_ds, labeled_val_ds, labeled_test_ds = build_labelled_datasets(TRAIN_PATH)
train_ds, train_aug_ds, val_ds, test_ds = prepare_datasets(labeled_train_ds, labeled_val_ds, labeled_test_ds)
Train size = 18750, Validation Size = 3750, Test Size = 2500

Visualizing the data

After constructing the three TensorFlow datasets, we can now visualize what these images looks like. Looking at 3 images from each of the subsets of image data, it can be concluded that the data has been re-sized appropriately and the data is ready for input into Keras.

In [13]:
# Function 1 - plot_n_imgs_tf
# Creates a function to view images in a dataset

def plot_n_imgs_tf(ds, n = 6, title = None):
    ncols = 3
    nrows = (n - 1) // ncols + 1
    figw = 20
    fig, axs = plt.subplots(nrows,
                            ncols,
                            figsize=(figw, figw / ncols * nrows), squeeze=False)
    imgs = [img for img, lab in ds.take(n)]
    for i, (ax, img) in enumerate(zip(axs.flatten(), imgs)):
        ax.grid(False)
        ax.set_xticks([])
        ax.set_yticks([])
        ax.imshow(img, interpolation='bilinear')
        if title is not None and i % 2 == 1: # assumes there is 3 columns
            ax.set_title(title, color = 'dimgrey', size = 18)
In [14]:
# View 3 images from the training dataset
plot_n_imgs_tf(labeled_train_ds, 3, "Training Set")

# View 3 images from the validation dataset
plot_n_imgs_tf(labeled_val_ds, 3, "Validation Set")

# View 3 images from the testing dataset
plot_n_imgs_tf(labeled_test_ds, 3, "Test Set")