Digit Sequence Recognition using Deep Learning

Skills: Python, Keras, Deep Learning, CNN, Computer Vision
Github

In this project, we will design and implement a deep learning model that learns to recognize sequences of digits. We will train the model using synthetic data generated by concatenating character images from MNIST.

To produce a synthetic sequence of digits for testing, we will limit the to sequences to up to five digits, and use five classifiers on top of your deep network. We will incorporate an additional ‘blank’ character to account for shorter number sequences.

We will use ** Keras ** to implement the model. You can read more about Keras at keras.io.

Implementation

Let’s start by importing the modules we’ll require fot this project.

#Module Imports
from __future__ import print_function
import random
from os import listdir
import glob

import numpy as np
from scipy import misc
import tensorflow as tf
import h5py

from keras.datasets import mnist
from keras.utils import np_utils

import matplotlib.pyplot as plt
%matplotlib inline
Using TensorFlow backend.
/Users/sajal/anaconda/envs/py27/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
#Setting the random seed so that the results are reproducible. 
random.seed(101)

#Setting variables for MNIST image dimensions
mnist_image_height = 28
mnist_image_width = 28
#Import MNIST data from keras
(X_train, y_train), (X_test, y_test) = mnist.load_data()
#Checking the downloaded data
print("Shape of training dataset: {}".format(np.shape(X_train)))
print("Shape of test dataset: {}".format(np.shape(X_test)))


plt.figure()
plt.imshow(X_train[0], cmap='gray')

print("Label for image: {}".format(y_train[0]))
Shape of training dataset: (60000, 28, 28)
Shape of test dataset: (10000, 28, 28)
Label for image: 5

png

Building synthetic data

The MNIST dataset is very popular for beginner Deep Learning projects. So, to add a twist to the tale, we’re going to predict images that can contain 1 to 5 digits. We’ll have to change the architecture of our deep learning model for this, but before that, we’ll need to generate this dataset first.

To generate the synthetic training data, we will first start by randomly picking out up to 5 individual digits out from the MNIST training set. The individual images will be then stacked together, and blanks will be used to make up the number of digits if there were less than 5. By this approach, we could increase the size of our training data. We’ll build around 60,000 such examples.

While concatenating images together, we’ll also build the labels for each image. First, labels for single digits will be arranged in tuples of 5. Labels 0-9 will be used for digits 0-9, and a 10 will be used to indicate a blank.

The same approach will be used to build the test data, but using the MNIST test set for individual digits, for 10,000 synthetic test images.

Let’s write a function that does this.

def build_synth_data(data,labels,dataset_size):
    
    #Define synthetic image dimensions
    synth_img_height = 64
    synth_img_width = 64
    
    #Define synthetic data
    synth_data = np.ndarray(shape=(dataset_size,synth_img_height,synth_img_width),
                           dtype=np.float32)
    
    #Define synthetic labels
    synth_labels = [] 
    
    #For a loop till the size of the synthetic dataset
    for i in range(0,dataset_size):
        
        #Pick a random number of digits to be in the dataset
        num_digits = random.randint(1,5)
        
        #Randomly sampling indices to extract digits + labels afterwards
        s_indices = [random.randint(0,len(data)-1) for p in range(0,num_digits)]
        
        #stitch images together
        new_image = np.hstack([X_train[index] for index in s_indices])
        #stitch the labels together
        new_label =  [y_train[index] for index in s_indices]
        
        
        #Loop till number of digits - 5, to concatenate blanks images, and blank labels together
        for j in range(0,5-num_digits):
            new_image = np.hstack([new_image,np.zeros(shape=(mnist_image_height,
                                                                   mnist_image_width))])
            new_label.append(10) #Might need to remove this step
        
        #Resize image
        new_image = misc.imresize(new_image,(64,64))
        
        #Assign the image to synth_data
        synth_data[i,:,:] = new_image
        
        #Assign the label to synth_data
        synth_labels.append(tuple(new_label))
        
    
    #Return the synthetic dataset
    return synth_data,synth_labels
#Building the training dataset
X_synth_train,y_synth_train = build_synth_data(X_train,y_train,60000)
#Building the test dataset
X_synth_test,y_synth_test = build_synth_data(X_test,y_test,10000)
#checking a sample
plt.figure()
plt.imshow(X_synth_train[232], cmap='gray')

y_synth_train[232]
(7, 1, 9, 7, 10)

png

Looks like things work as we expect them to. Let’s prepare the datset and labels so that keras can handle them.

Preparatory Preprocessing

Preprocessing Labels for model

The labels are going to be encoded to “One Hot” arrays, to make them compatible with Keras. Note that, as the our Deep Learning model will have 5 classifiers, we’ll need 5 such One Hot arrays, one for each digit position in the image.

#Converting labels to One-hot representations of shape (set_size,digits,classes)
possible_classes = 11

def convert_labels(labels):
    
    #As per Keras conventions, the multiple labels need to be of the form [array_digit1,...5]
    #Each digit array will be of shape (60000,11)
    
    #Code below could be better, but cba for now. 
    
    #Declare output ndarrays
    dig0_arr = np.ndarray(shape=(len(labels),possible_classes))
    dig1_arr = np.ndarray(shape=(len(labels),possible_classes))
    dig2_arr = np.ndarray(shape=(len(labels),possible_classes))
    dig3_arr = np.ndarray(shape=(len(labels),possible_classes)) #5 for digits, 11 for possible classes  
    dig4_arr = np.ndarray(shape=(len(labels),possible_classes))
    
    for index,label in enumerate(labels):
        
        #Using np_utils from keras to OHE the labels in the image
        dig0_arr[index,:] = np_utils.to_categorical(label[0],possible_classes)
        dig1_arr[index,:] = np_utils.to_categorical(label[1],possible_classes)
        dig2_arr[index,:] = np_utils.to_categorical(label[2],possible_classes)
        dig3_arr[index,:] = np_utils.to_categorical(label[3],possible_classes)
        dig4_arr[index,:] = np_utils.to_categorical(label[4],possible_classes)
        
    return [dig0_arr,dig1_arr,dig2_arr,dig3_arr,dig4_arr]
train_labels = convert_labels(y_synth_train)
test_labels = convert_labels(y_synth_test)
#Checking the shape of the OHE array for the first digit position
np.shape(train_labels[0])
(60000, 11)
np_utils.to_categorical(y_synth_train[234][0],11)
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.]])

Preprocessing Images for model

The function below will pre-process the images so that they can be handled by keras.

def prep_data_keras(img_data):
    
    #Reshaping data for keras, with tensorflow as backend
    img_data = img_data.reshape(len(img_data),64,64,1)
    
    #Converting everything to floats
    img_data = img_data.astype('float32')
    
    #Normalizing values between 0 and 1
    img_data /= 255
    
    return img_data
train_images = prep_data_keras(X_synth_train)
test_images = prep_data_keras(X_synth_test)
np.shape(train_images)
(60000, 64, 64, 1)
np.shape(test_images)
(10000, 64, 64, 1)

Model Building

#Importing relevant keras modules
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, Activation, Flatten, Input
from keras.layers import Convolution2D, MaxPooling2D

We’re going to use a Convolutional Neural Network for our network.

Starting with a 2D Convolutional layer, we’ll use ReLU activations after every Convolutional Layer.

After the second CovLayer + ReLU, we’ll add 2DMaxPooling, and a dropout to make the model robust to overfitting. A flattening layer will be added to make the data ready for classification layers, which were in the form of Dense Layers, of the same size as the no. of classes (11 for us), activated using softmax to give us the probability of each class.

#Building the model

batch_size = 128
nb_classes = 11
nb_epoch = 12

#image input dimensions
img_rows = 64
img_cols = 64
img_channels = 1

#number of convulation filters to use
nb_filters = 32
# size of pooling area for max pooling
pool_size = (2, 2)
# convolution kernel size
kernel_size = (3, 3)

#defining the input
inputs = Input(shape=(img_rows,img_cols,img_channels))

#Model taken from keras example. Worked well for a digit, dunno for multiple
cov = Convolution2D(nb_filters,kernel_size[0],kernel_size[1],border_mode='same')(inputs)
cov = Activation('relu')(cov)
cov = Convolution2D(nb_filters,kernel_size[0],kernel_size[1])(cov)
cov = Activation('relu')(cov)
cov = MaxPooling2D(pool_size=pool_size)(cov)
cov = Dropout(0.25)(cov)
cov_out = Flatten()(cov)


#Dense Layers
cov2 = Dense(128, activation='relu')(cov_out)
cov2 = Dropout(0.5)(cov2)



#Prediction layers
c0 = Dense(nb_classes, activation='softmax')(cov2)
c1 = Dense(nb_classes, activation='softmax')(cov2)
c2 = Dense(nb_classes, activation='softmax')(cov2)
c3 = Dense(nb_classes, activation='softmax')(cov2)
c4 = Dense(nb_classes, activation='softmax')(cov2)

#Defining the model
model = Model(input=inputs,output=[c0,c1,c2,c3,c4])

#Compiling the model
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

#Fitting the model
model.fit(train_images,train_labels,batch_size=batch_size,nb_epoch=nb_epoch,verbose=1,
          validation_data=(test_images, test_labels))
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 944s - loss: 3.7553 - dense_4_loss: 0.9496 - dense_5_loss: 0.9061 - dense_6_loss: 0.7953 - dense_7_loss: 0.6590 - dense_8_loss: 0.4454 - dense_4_acc: 0.6825 - dense_5_acc: 0.7028 - dense_6_acc: 0.7378 - dense_7_acc: 0.7884 - dense_8_acc: 0.8626 - val_loss: 1.0372 - val_dense_4_loss: 0.2358 - val_dense_5_loss: 0.2332 - val_dense_6_loss: 0.2181 - val_dense_7_loss: 0.1906 - val_dense_8_loss: 0.1596 - val_dense_4_acc: 0.9471 - val_dense_5_acc: 0.9511 - val_dense_6_acc: 0.9551 - val_dense_7_acc: 0.9600 - val_dense_8_acc: 0.9570
Epoch 2/12
60000/60000 [==============================] - 923s - loss: 2.3002 - dense_4_loss: 0.5586 - dense_5_loss: 0.5510 - dense_6_loss: 0.4836 - dense_7_loss: 0.4172 - dense_8_loss: 0.2899 - dense_4_acc: 0.8151 - dense_5_acc: 0.8167 - dense_6_acc: 0.8378 - dense_7_acc: 0.8628 - dense_8_acc: 0.9041 - val_loss: 0.6792 - val_dense_4_loss: 0.1599 - val_dense_5_loss: 0.1576 - val_dense_6_loss: 0.1411 - val_dense_7_loss: 0.1201 - val_dense_8_loss: 0.1005 - val_dense_4_acc: 0.9607 - val_dense_5_acc: 0.9678 - val_dense_6_acc: 0.9688 - val_dense_7_acc: 0.9757 - val_dense_8_acc: 0.9795
...
...
Epoch 11/12
60000/60000 [==============================] - 989s - loss: 0.9621 - dense_4_loss: 0.2213 - dense_5_loss: 0.2163 - dense_6_loss: 0.2024 - dense_7_loss: 0.1902 - dense_8_loss: 0.1319 - dense_4_acc: 0.9194 - dense_5_acc: 0.9226 - dense_6_acc: 0.9249 - dense_7_acc: 0.9315 - dense_8_acc: 0.9522 - val_loss: 0.2832 - val_dense_4_loss: 0.0675 - val_dense_5_loss: 0.0698 - val_dense_6_loss: 0.0516 - val_dense_7_loss: 0.0445 - val_dense_8_loss: 0.0498 - val_dense_4_acc: 0.9801 - val_dense_5_acc: 0.9806 - val_dense_6_acc: 0.9847 - val_dense_7_acc: 0.9871 - val_dense_8_acc: 0.9885
Epoch 12/12
60000/60000 [==============================] - 1003s - loss: 0.9082 - dense_4_loss: 0.2061 - dense_5_loss: 0.2059 - dense_6_loss: 0.1902 - dense_7_loss: 0.1798 - dense_8_loss: 0.1262 - dense_4_acc: 0.9246 - dense_5_acc: 0.9250 - dense_6_acc: 0.9273 - dense_7_acc: 0.9345 - dense_8_acc: 0.9545 - val_loss: 0.2760 - val_dense_4_loss: 0.0670 - val_dense_5_loss: 0.0719 - val_dense_6_loss: 0.0525 - val_dense_7_loss: 0.0404 - val_dense_8_loss: 0.0442 - val_dense_4_acc: 0.9810 - val_dense_5_acc: 0.9812 - val_dense_6_acc: 0.9846 - val_dense_7_acc: 0.9888 - val_dense_8_acc: 0.9901

<keras.callbacks.History at 0x1228eaa90>
predictions = model.predict(test_images)
np.shape(predictions)
(5, 10000, 11)
len(predictions[0])
10000
np.shape(test_labels)
(5, 10000, 11)

We’ll define a custom to calculate accuracy for predicting individual digits, as well as for predicting complete sequence of images.

def calculate_acc(predictions,real_labels):
    
    individual_counter = 0
    global_sequence_counter = 0
    for i in range(0,len(predictions[0])):
        #Reset sequence counter at the start of each image
        sequence_counter = 0 
        
        for j in range(0,5):
            if np.argmax(predictions[j][i]) == np.argmax(real_labels[j][i]):
                individual_counter += 1
                sequence_counter +=1
        
        if sequence_counter == 5:
            global_sequence_counter += 1
         
    ind_accuracy = individual_counter/50000.0
    global_accuracy = global_sequence_counter/10000.0
    
    return ind_accuracy,global_accuracy
ind_acc,glob_acc = calculate_acc(predictions,test_labels)
print("The individual accuracy is {} %".format(ind_acc*100))
print("The sequence prediction accuracy is {} %".format(glob_acc*100))
The individual accuracy is 98.514 %
The sequence prediction accuracy is 92.86 %
#Printing some examples of real and predicted labels
for i in random.sample(range(0,10000),5):
    
    actual_labels = []
    predicted_labels = []
    
    for j in range(0,5):
        actual_labels.append(np.argmax(test_labels[j][i]))
        predicted_labels.append(np.argmax(predictions[j][i]))
        
    print("Actual labels: {}".format(actual_labels))
    print("Predicted labels: {}\n".format(predicted_labels))
    
Actual labels: [0, 9, 7, 10, 10]
Predicted labels: [0, 8, 7, 10, 10]

Actual labels: [0, 5, 7, 10, 10]
Predicted labels: [0, 5, 7, 10, 10]

Actual labels: [3, 8, 1, 10, 10]
Predicted labels: [3, 8, 1, 10, 10]

Actual labels: [6, 8, 1, 10, 10]
Predicted labels: [6, 8, 1, 10, 10]

Actual labels: [7, 6, 3, 2, 9]
Predicted labels: [7, 6, 3, 2, 9]

We can see that model achieved good accuracy, with around 98.5% accurate for identifying individual digits or blanks, and around 92.8% for identifying whole sequences.