Yumi's Blog

Create deep learning calculators based on Encoder-Decoder RNN using Keras

Google search is great. You can ask any question you have and give you links to the potential solutions. It can sometimes give you the solution itself when the question is simple enough. One of such example is that google search can act as a calcualator. You can ask "1 + 1" or "1 + 1" to get the value 2. It somehow knows that the spaces are nuicense and return correct value. You can even ask the calculation with some strings. See pic below:

In [3]:
from keras.preprocessing.image import ImageDataGenerator,  img_to_array, load_img
Using TensorFlow backend.

Somehow, google understand that what I am asking is "32+123" and do the calculations. Can we do the same using deep learning model that takes a string as input?

I was fascinated by one of the Keras's examples in Github called addition rnn. This script shows the implementation of sequence to sequence learning for performing addition. The script considers the summation of two 3-digit numbers, for example, 123+420=543. The input of the model is a string "123+429" and output is "543".

In this blog post, we try to understand how the RNNs learn calculations. We will first consider very VERY simple example of summation of two 1-digits, then we will work on more complex senario.

In [2]:
import matplotlib.pyplot as plt
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
import keras
import sys
import warnings
print("python {}".format(sys.version))
print("keras version {}".format(keras.__version__))
print("tensorflow version {}".format(tf.__version__))
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.gpu_options.visible_device_list = "0"
#### 2 GPU1
#### 0 GPU3
#### 4 GPU4
#### 3 GPU2

def set_seed(sd=123):
    from numpy.random import seed
    from tensorflow import set_random_seed
    import random as rn
    ## numpy random seed
    ## core python's random number 
    ## tensor flow's random number
python 2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
keras version 2.0.6
tensorflow version 1.2.1

Create one-hot encoders that takes string and return one-hot encoded matrix of size as many as the number of characters in the string

This class is bollowed from addition rnn with no change.

In [3]:
class CharacterTable(object): 
    """Given a set of characters: 
     + Encode them to a one hot integer representation 
     + Decode the one hot integer representation to their character output 
     + Decode a vector of probabilities to their character output 
    def __init__(self, chars): 
        """Initialize character table. 
         # Arguments 
             chars: Characters that can appear in the input. 
        self.chars = sorted(set(chars)) 
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars)) 
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars)) 
    def encode(self, C, num_rows): 
        """One hot encode given string C. 
         # Arguments 
             num_rows: Number of rows in the returned one hot encoding. This is 
                 used to keep the # of rows for each data the same. 
        x = np.zeros((num_rows, len(self.chars))) 
        for i, c in enumerate(C): 
             x[i, self.char_indices[c]] = 1 
        return x 
    def decode(self, x, calc_argmax=True): 
        if calc_argmax: 
            x = x.argmax(axis=-1) 
        return ''.join(self.indices_char[x] for x in x) 
class colors: 
    ok = '\033[92m' 
    fail = '\033[91m' 
    close = '\033[0m' 

RNN model for summation of two single digits.

Generate 10,000 samples.

As we are only considering the summation of two single digits, there are only 10 x 10 = 100 possible samples. Sampling 10,000 samples means we are sampling identical samples 100 times on average.

In [4]:
import numpy as np

# Parameters for the model and dataset. 

# Maximum length of input is 'int + int' (e.g., '345+678'). Maximum length of 
# int is DIGITS. 

# All the numbers, plus sign and space for padding. 
chars = '0123456789+ ' 
ctable = CharacterTable(chars) 
f = lambda: int(''.join(np.random.choice(list('0123456789')) 
            for i in range(np.random.randint(1, DIGITS + 1)))) 
questions = [] 
expected = [] 
print('Generating data...') 
while len(questions) < TRAINING_SIZE: 
    if len(questions) % 1000 == 0:
           print("{} samples are generated...".format(len(questions)))
    a, b = f(), f()  
    q = '{}+{}'.format(a, b) 
    query = q + ' ' * (MAXLEN - len(q)) 
    ans = str(a + b) 
    # Answers can be of maximum size DIGITS + 1. 
    ans += ' ' * (DIGITS + 1 - len(ans))  
print('Total addition questions:{}'.format( len(questions)))
Generating data...
0 samples are generated...
1000 samples are generated...
2000 samples are generated...
3000 samples are generated...
4000 samples are generated...
5000 samples are generated...
6000 samples are generated...
7000 samples are generated...
8000 samples are generated...
9000 samples are generated...
Total addition questions:10000

Example data strings look like:

In [5]:
for q,a in zip(questions,expected)[:10]:
    print("{:3} = {:4}".format(q,a))
2+2 = 4   
6+1 = 7   
3+9 = 12  
6+1 = 7   
0+1 = 1   
9+0 = 9   
0+9 = 9   
3+4 = 7   
0+0 = 0   
4+1 = 5   

Using the class previously created, we one-hot encode each character.


There are 3 characters per sequence and each character could potential take 12 values ("0123456789+ "). The input of the model is 3 one-hot encoded vectors each of which represents a single character. Hence one sample sequence is represented as 3 by 12. The example of the one-hot encoded vectors ($\boldsymbol{x}_{i,t} \in R^{12}$ The time index take values: $t=1,2,3$.) are:

  • $\boldsymbol{x}_{i,t_1}=[0,0,0,0,0,0,1,0,0,0,0,0]$ indicates "6" as the 6th position (position counting starting from 0) is nonzero.

  • $\boldsymbol{x}_{i,t_2}=[0,0,0,0,0,0,0,0,0,0,1,0]$ indicates "+".

  • $\boldsymbol{x}_{i,t_3}=[1,0,0,0,0,0,0,0,0,0,0,0]$ indicates "0".

  • $[\boldsymbol{x}_{i,t_1},\boldsymbol{x}_{i,t_2},\boldsymbol{x}_{i,t_3}]$ together represents a single sentence "6+0"

  • $\boldsymbol{x}_{i,t_4}=[0,0,0,0,0,0,0,0,0,0,0,1]$ indicates empty space i.e., " "


The targets are 2 one-hot encoded vectors, and each one-hot encoded vector must have length 12. If the solution is a single digit (for example, the solution to "0+6" is "6" and "6" is a single digit), then solution has to be "6 ", which can be represetned as $[\boldsymbol{x}_{i,t_1}, \boldsymbol{x}_{i,t_4}]$ using the notation above. In practice we use $y$ to represent the target vectors.

$ \boldsymbol{y}_{i,k} \in [0,1]^{12}, k=1,2 $

In [6]:
def one_hot_encoder(expected,questions,x_dim,y_dim,chars,ctable):
    x = np.zeros((len(questions), x_dim, len(chars)), dtype=np.bool) 
    y = np.zeros((len(questions), y_dim, len(chars)), dtype=np.bool) 
    for i, sentence in enumerate(questions): 
         x[i] = ctable.encode(sentence, x_dim) 
    for i, sentence in enumerate(expected): 
         y[i] = ctable.encode(sentence, y_dim)
x, y = one_hot_encoder(expected,questions,MAXLEN,DIGITS + 1,chars,ctable)

Split between training and testing

In [7]:
def split_train_test(x,y):
    # Explicitly set apart 10% for validation data that we never train over. 
    split_at = len(x) - len(x) // 10 
    (x_train, x_val) = x[:split_at], x[split_at:] 
    (y_train, y_val) = y[:split_at], y[split_at:] 
    return (x_train, x_val),(y_train, y_val)
(x_train, x_val),(y_train, y_val) = split_train_test(x,y)
print('Training Data:') 
Training Data:
(9000, 3, 12)
(9000, 2, 12)

Model Definition

We consider encoder-decoder RNN models. The RNN is previously discussed here. The encoder-decoder model is often used in the field of machine translation, and it has been explained by many blogs. So I will not give you a nice smooth introduction, but interested readers should read (or at least see the graph) Peeking into the neural network architecture used for Google's Neural Machine Translation or How Does Attention Work in Encoder-Decoder Recurrent Neural Networks.

As the model's name suggests, from a high-level, the model is comprised of two sub-models: an encoder and a decoder.

  • Encoder: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector.
  • Decoder: The decoder is responsible for stepping through the output time steps while reading from the context vector.

The original addition rnn considers the solutions to the summation of two 3-or-less-digit numbers. The model used was an encoder-decoder RNN with 208,529 parameters. As there are 1 million potential combination sample (1,000 x 1,000), this number of parameters may be necessary considering the complexity of the problem.

However, in my example, I only consider the summatoin of two single digit numbers, meaning that there are only 10 x 10 = 100 unique samples. The function with 100 if-else statements can do the perfect jobs in predicting the outcome. So I should be able to create a model with less than 100 parameters! Simpler model is preferable also because it is easier to understand.

So for now, we consider an encoder-decoder model with 77 parameters. The model defenition follows:

In [9]:

Encoder layer:

In total there are 48 parameters ( = 12x1 + 1x1 + 1 = 12 + 1 + 1 = 14)

$ \boldsymbol{w}_{e} \in R^{\textrm{HIDDEN_ENCODER x 12}},\boldsymbol{w}_{h_{e}} \in R^{\textrm{HIDDEN_ENCODER x HIDDEN_ENCODER}}, \boldsymbol{b}_e \in R^{\textrm{HIDDEN_ENCODER}} $

layer definition

$ \boldsymbol{e}_{i,0}=0\in R^{\textrm{HIDDEN_ENCODER}}\\ \boldsymbol{e}_{i,t} = \textrm{tanh}(\boldsymbol{x}_{i,t}^T \boldsymbol{w}_{e} + \boldsymbol{e}_{i,t-1}^T \boldsymbol{w}_{e} +\boldsymbol{b}_e) \\ $

Decoder layer:

In total there are 21 parameters ( = 3x1 + 3x3 + 3 = 3 + 9 + 3 = 15)

$ \boldsymbol{w}_{d} \in R^{\textrm{HIDDEN_DECODER x HIDDEN_ENCODER}},\boldsymbol{w}_{h_{d}} \in R^{\textrm{HIDDEN_DECODER x HIDDEN_DECODER}}, \boldsymbol{b}_d \in R^{\textrm{HIDDEN_DECODER}} $

layer definition


$ \boldsymbol{d}_{i,0}=0\in R^{\textrm{HIDDEN_DECODER}}\\ \boldsymbol{d}_{i,k} = \textrm{tanh}(\boldsymbol{e}_{i,3}^T \boldsymbol{w}_{d} + \boldsymbol{d}_{i,k-1}^T \boldsymbol{w}_{d} +\boldsymbol{b}_d) \\ $

Time-distributed Dense layer:

In total there are 48 parameters ( = 12x3 + 12 = 48)

$ \boldsymbol{w}_{dense}\in R^{\textrm{12 x HIDDEN_DECODER}}\boldsymbol{b}_{dense}\in R^{12}\\ $

layer defenition

$ \boldsymbol{y}_{i,k} = \textrm{softmax}(\boldsymbol{d}_{i,k}^T \boldsymbol{w}_{dense} + \boldsymbol{b}_{dense}) $

In [8]:
import keras.layers as layers
from keras.models import Model
from keras.layers.core import Dense

print('Build model...')


def define_model(MAXLEN_x,MAXLEN_y,chars,
    def scalarToList(a):
        if not isinstance(a,list):
            a = [a]
        lena = len(a)
    def return_seq(nlayer,lenEncoder):
        return sequence must be False if this layer is the final one
        return(False if nlayer == lenENCODER else True)


    inp = layers.Input(batch_shape=(None, MAXLEN_x, len(chars)),name="Input")  
    # "Encode" the input sequence using an RNN, producing an output of HIDDEN_SIZE. 
    encoder = RNN(HIDDEN_ENCODER[0], 
    nlayer = 1
    if lenENCODER > 1:
            for HE in HIDDEN_ENCODER[1:]:
                nlayer +=1
                encoder = RNN(HE, 
    rep_encoder = layers.RepeatVector(MAXLEN_y,name="repeat_vector")(encoder)
    # By setting return_sequences to True, return not only the last output but 
    # all the outputs so far in the form of (num_samples, timesteps, 
    # output_dim). This is necessary as TimeDistributed in the below expects 
    # the first dimension to be the timesteps.
    decoder = RNN(HIDDEN_DECODER[0],
    if lenDECODER > 1:
            for HE in HIDDEN_DECODER[1:]:
                decoder = RNN(HE, 
    # Apply a dense layer to the every temporal slice of an input. For each of step 
    # of the output sequence, decide which character should be chosen. 
    #out = layers.Dense(len(chars))(decoder)
    out = layers.TimeDistributed(layers.Dense(len(chars),
    model = Model(inputs=[inp],outputs=[out])
    encoder = Model(inputs=[inp],outputs=[rep_encoder])

model, encoder = define_model(MAXLEN,DIGITS + 1,chars,

Build model...
Layer (type)                 Output Shape              Param #   
Input (InputLayer)           (None, 3, 12)             0         
encoder1 (SimpleRNN)         (None, 1)                 14        
repeat_vector (RepeatVector) (None, 2, 1)              0         
decoder1 (SimpleRNN)         (None, 2, 3)              15        
time_distributed_dense (Time (None, 2, 12)             48        
Total params: 77
Trainable params: 77
Non-trainable params: 0

Define printing functions for model training

In [9]:
def print_predicted(x,y,model,ctable,colors):
    for i in range(x.shape[0]): 
        rowx, rowy = x[[i]], y[[i]]
        preds = model.predict(rowx, verbose=0) 
        preds = preds.argmax(axis=2)
        q = ctable.decode(rowx[0]) 
        correct = ctable.decode(rowy[0]) 
        guess = ctable.decode(preds[0], calc_argmax=False) 
        print('  Q:{}'.format(q)), 
        print('T:{}'.format( correct)), 
        if correct == guess: 
             print(colors.ok + '☑' + colors.close) 
             print(colors.fail + '☒' + colors.close) 

Model Training

In [10]:
def train(model,ctable,
    history = {'acc':[],'loss':[],'val_acc':[],'val_loss':[]}
    for iteration in range(1, nb_epochs): 
        hist = model.fit(x_train, y_train, 
                         validation_data=(x_val, y_val))
        ## printing 
        if iteration % print_every ==0:
            print('-' * 50)
            print('Iteration {}'.format(iteration))

        for key in hist.history.keys():
            if iteration % print_every ==0:

        if iteration % print_every ==0:    

        # Select 10 samples from the validation set at random so we can visualize 
        # errors. 
        if iteration % print_every ==0: 
            index = np.random.randint(0, len(x_val),10)
In [11]:

history = train(model,ctable,
((9000, 3, 12), (9000, 2, 12), (1000, 3, 12), (1000, 2, 12))
Iteration 400
acc:0.909 loss:0.435 val_acc:0.908 val_loss:0.437 
  Q:6+0 T:6  Model:6  
  Q:6+1 T:7  Model:7  
  Q:3+9 T:12 Model:12 
  Q:2+8 T:10 Model:10 
  Q:2+9 T:11 Model:11 
  Q:5+9 T:14 Model:13 
  Q:8+6 T:14 Model:13 
  Q:6+9 T:15 Model:12 
  Q:0+0 T:0  Model:5  
  Q:0+6 T:6  Model:6  
Iteration 800
acc:0.958 loss:0.233 val_acc:0.950 val_loss:0.234 
  Q:4+2 T:6  Model:6  
  Q:9+2 T:11 Model:11 
  Q:8+4 T:12 Model:12 
  Q:3+9 T:12 Model:12 
  Q:2+7 T:9  Model:9  
  Q:0+3 T:3  Model:4  
  Q:3+3 T:6  Model:6  
  Q:6+0 T:6  Model:6  
  Q:6+3 T:9  Model:9  
  Q:6+9 T:15 Model:15 
Iteration 1200
acc:0.985 loss:0.142 val_acc:0.985 val_loss:0.142 
  Q:3+2 T:5  Model:5  
  Q:1+7 T:8  Model:8  
  Q:7+4 T:11 Model:11 
  Q:6+7 T:13 Model:13 
  Q:3+1 T:4  Model:4  
  Q:5+8 T:13 Model:13 
  Q:9+1 T:10 Model:10 
  Q:6+5 T:11 Model:11 
  Q:5+7 T:12 Model:12 
  Q:1+2 T:3  Model:3  
Iteration 1600
acc:0.993 loss:0.103 val_acc:0.985 val_loss:0.105 
  Q:3+4 T:7  Model:7  
  Q:5+6 T:11 Model:11 
  Q:4+7 T:11 Model:11 
  Q:0+4 T:4  Model:4  
  Q:1+4 T:5  Model:5  
  Q:3+9 T:12 Model:12 
  Q:5+6 T:11 Model:11 
  Q:3+5 T:8  Model:8  
  Q:8+8 T:16 Model:16 
  Q:2+0 T:2  Model:3  
Iteration 2000
acc:1.000 loss:0.082 val_acc:1.000 val_loss:0.083 
  Q:2+7 T:9  Model:9  
  Q:3+3 T:6  Model:6  
  Q:2+5 T:7  Model:7  
  Q:1+2 T:3  Model:3  
  Q:1+2 T:3  Model:3  
  Q:9+6 T:15 Model:15 
  Q:3+7 T:10 Model:10 
  Q:5+2 T:7  Model:7  
  Q:2+3 T:5  Model:5  
  Q:8+4 T:12 Model:12 

The validation loss and the training loss

This small model performs surprisingly well. Since both training and testing data contain all possible samples, the model performance in the two data is the same.

In [12]:
def plot_loss_acc(history):
    fig = plt.figure(figsize=(20,8))
    labels = [["acc","val_acc"],["loss","val_loss"]]

    count = 1
    for ilab in range(len(labels)):
        ax = fig.add_subplot(2,1,count)
        count += 1
        for label in labels[ilab]:

Plot encoders

My encoder only has 1 dimention.

I plot the values of this encoder for every possible sample values.

The plot shows that 1 + 3, 2 + 2 and 3 + 1 receive the same encoder value. Similarly, 1 + 5, 2 + 4, 3 + 3, 4 + 2 and 1 + 5 receive the same encoder value.

In fact, plot shows that the ecoder seems to record the (scaled) values of summation. Encoders discard the infomation of what original digits were and remember only the summation values.

In [13]:
import seaborn as sns
import pandas as pd 

hidden = {"encoder": [],"num1":[],"num2":[]}
for num1 in range(10):
    for num2 in range(10):
        string = "{}+{}".format(num1,num2)
        myx = ctable.encode(string,MAXLEN)
        e = encoder.predict(myx.reshape(1,myx.shape[0],myx.shape[1]))[0]
        ## notice that the predicted encoders are duplicated twice k = 1 , 2 
        ## because of repeatVector
        ## we only extract one of the vector.
        if ~ np.all(e[0] == e[1]): ## should never be TRUE!
        h = e[0]
hidden = pd.DataFrame(hidden)
plt.title("encoder\n num1 + num2")

Deep learning calculators for more complex calculations

Now I consider creating a deep learning calculator for more complex calculations:

  • Mupltiple operations: summation (+), negation (-), multiplication (*) and devision (/).
  • Number of digits up to 3 (i.e., 0 - 999) as an input
  • Allow " " to appear in the input calculation e.g., "1 23 + 3 12" = "123+312". However, output is clean: it does not contain space in between digits.

This results in more characters in my character dictionary. In addition to '0123456789+ ', now I have '-*/Na'. Na arises because when a number is devided by 0 the output sould be "NaN ".

I consider integer devision. This means, for example, 1/3 = 0, 5/3=1.

I tried to train a model with the trainig data having the same size, i.e., 10,000. But it seemd that 10,000 samples is not enough to create a generalizable model. This makes sense because previously we only have 100 combinatiosn of problems (=10x10). But now, there are 4 million (4,000,000 = 4*1,000^2) combinations of problems (even excluding the spaces that can randomly appear in the left hand side of the equation). The problem is much more complex! I demonstrate the data training with 100,000 samples.

In [20]:
## "N" and "a" are  

chars2 = '0123456789+-*/Na ' 
print("The number of characters in dictionary: {}".format(len(chars2)))
ctable2 = CharacterTable(chars2) 

## Maximum length of the output would occur 
## when 99999x99999=9,999,800,001 which has length 10
MAXLEN_y = int(np.ceil(np.log10( ((10**DIGITS)-1)**2 )))

def frand(): 
    st = ''
    for i in range(DIGITS):
        st += str(np.random.choice(list('0123456789 ')))
    if st==' '*DIGITS:
          st = frand()          
    return st
## arithmetic operation
foperation = lambda: np.random.choice(list('+/*-')) 

def get_answer(a,b,oper):
    clean = lambda a: int(a.replace(" ","")) 
    a = clean(a)
    b = clean(b)
    if oper=="+":
        out = a + b
    elif oper=="-":
        out = a - b
    elif oper=="*":
        out = a*b
    elif oper=="/":
        if b == 0:
            out = "NaN"
            out = a/b

questions = [] 
expected = [] 
print('Generating data...') 
while len(questions) < TRAINING_SIZE: 
    if len(questions) % 10000 == 0:
           print("{} samples are generated...".format(len(questions)))

    a, b, oper = frand(), frand(), foperation() 
    # Pad the data with spaces such that it is always MAXLEN. 
    q = '{}{}{}'.format(a,oper, b) 
    query = q + ' ' * (MAXLEN_x - len(q)) 
    ans = str(get_answer(a,b,oper)) 
    # Answers can be of maximum size MAXLEN_y. 
    ans += ' ' * (MAXLEN_y - len(ans))  
print('Total addition questions:{}'.format( len(questions)))
The number of characters in dictionary: 17
Generating data...
0 samples are generated...
10000 samples are generated...
20000 samples are generated...
30000 samples are generated...
40000 samples are generated...
50000 samples are generated...
60000 samples are generated...
70000 samples are generated...
80000 samples are generated...
90000 samples are generated...
Total addition questions:100000

Example data strings look like:

In [15]:
for q, e in zip(questions,expected)[:20]:
    print("{} = {}".format(q,e))
226-13  = 213   
961+019 = 980   
093+400 = 493   
173*247 = 42731 
480+793 = 1273  
461/562 = 0     
83 *502 = 41666 
 62/446 = 0     
3 0-647 = -617  
671/ 57 = 11    
248/121 = 2     
359*081 = 29079 
335*979 = 327965
333-869 = -536  
639/666 = 0     
 34/310 = 0     
868-910 = -42   
134+761 = 895   
337/686 = 0     
447+009 = 456   

One-hot encoding and split between training and testing.

In [16]:
x, y = one_hot_encoder(expected,questions,MAXLEN_x,MAXLEN_y,chars2,ctable2)
(x_train, x_val),(y_train, y_val) = split_train_test(x,y)

Define Model

This time, I consider LSTM with more nodes and more layers for encoders!

Note we also consider the encoder structure with bottole neck encoder dimention to be 1. This model was converging prohibitably slow: e.g., at epoch 500, it was still about validation accuracy was still as low as 0.6. Due to the computational resource, we consider complex model that converges faster.

In [17]:

model2, encoder2 = define_model(MAXLEN_x,MAXLEN_y,chars2,

Layer (type)                 Output Shape              Param #   
Input (InputLayer)           (None, 7, 17)             0         
encoder1 (LSTM)              (None, 256)               280576    
repeat_vector (RepeatVector) (None, 6, 256)            0         
decoder1 (LSTM)              (None, 6, 256)            525312    
time_distributed_dense (Time (None, 6, 17)             4369      
Total params: 810,257
Trainable params: 810,257
Non-trainable params: 0


In [18]:
history2 = train(model2,ctable2,
Iteration 10
acc:0.799 loss:0.504 val_acc:0.801 val_loss:0.498 
  Q:405*934 T:378270 Model:359600 
  Q:799+472 T:1271   Model:1271   
  Q:33 *952 T:31416  Model:30966  
  Q:283+926 T:1209   Model:1211   
  Q:232+59  T:291    Model:399    
  Q: 88/154 T:0      Model:0      
  Q:412*280 T:115360 Model:110880 
  Q:787+074 T:861    Model:861    
  Q:2 3+ 39 T:62     Model:50     
  Q:1 7/397 T:0      Model:0      
Iteration 20
acc:0.904 loss:0.256 val_acc:0.886 val_loss:0.296 
  Q:077/558 T:0      Model:0      
  Q:765* 60 T:45900  Model:46300  
  Q:037+3 8 T:75     Model:75     
  Q: 54-721 T:-667   Model:-667   
  Q:308-666 T:-358   Model:-358   
  Q:976/619 T:1      Model:1      
  Q: 19/127 T:0      Model:0      
  Q:982+899 T:1881   Model:1881   
  Q:426/669 T:0      Model:0      
  Q:8  -5 2 T:-44    Model:-44    
Iteration 30
acc:0.929 loss:0.197 val_acc:0.894 val_loss:0.293 
  Q:236*520 T:122720 Model:123720 
  Q:525/402 T:1      Model:1      
  Q:012-869 T:-857   Model:-857   
  Q:115*615 T:70725  Model:66925  
  Q:419-66  T:353    Model:353    
  Q:838+4 3 T:881    Model:881    
  Q:932*006 T:5592   Model:6632   
  Q: 58*619 T:35902  Model:35742  
  Q:144/353 T:0      Model:0      
  Q:898* 04 T:3592   Model:3992   
Iteration 40
acc:0.952 loss:0.139 val_acc:0.889 val_loss:0.359 
  Q:923*709 T:654407 Model:655407 
  Q:709- 43 T:666    Model:666    
  Q:64 /269 T:0      Model:0      
  Q:172+618 T:790    Model:790    
  Q:234- 08 T:226    Model:226    
  Q:152-313 T:-161   Model:-161   
  Q:62 *513 T:31806  Model:32346  
  Q:354-1 7 T:337    Model:337    
  Q:535+604 T:1139   Model:1139   
  Q:934*7 2 T:67248  Model:68768  
Iteration 50
acc:0.968 loss:0.099 val_acc:0.889 val_loss:0.424 
  Q:3 2/525 T:0      Model:0      
  Q:592/864 T:0      Model:0      
  Q:616*016 T:9856   Model:9034   
  Q:53 -08  T:45     Model:55     
  Q:92 +145 T:237    Model:237    
  Q:488*419 T:204472 Model:205772 
  Q:020/312 T:0      Model:0      
  Q:151*607 T:91657  Model:96957  
  Q:507-537 T:-30    Model:-30    
  Q:2 8*649 T:18172  Model:17812  
Iteration 60
acc:0.983 loss:0.058 val_acc:0.889 val_loss:0.505 
  Q:192*998 T:191616 Model:188778 
  Q:290+24  T:314    Model:314    
  Q:540*034 T:18360  Model:18360  
  Q:368*910 T:334880 Model:330080 
  Q: 07-417 T:-410   Model:-410   
  Q:0 1*642 T:642    Model:642    
  Q: 99/219 T:0      Model:0      
  Q:01 *872 T:872    Model:872    
  Q:8 0*394 T:31520  Model:31780  
  Q:178/654 T:0      Model:0      
Iteration 70
acc:0.989 loss:0.039 val_acc:0.883 val_loss:0.623 
  Q: 22*841 T:18502  Model:17422  
  Q:298-465 T:-167   Model:-167   
  Q: 94-619 T:-525   Model:-525   
  Q:3 4+035 T:69     Model:69     
  Q:367*213 T:78171  Model:77231  
  Q:6  +0 8 T:14     Model:14     
  Q:215+211 T:426    Model:426    
  Q:035+695 T:730    Model:730    
  Q:3 3*615 T:20295  Model:20765  
  Q:845+416 T:1261   Model:1261   
Iteration 80
acc:0.990 loss:0.037 val_acc:0.884 val_loss:0.670 
  Q:121+944 T:1065   Model:1065   
  Q: 08+933 T:941    Model:941    
  Q:368+626 T:994    Model:994    
  Q:751-895 T:-144   Model:-144   
  Q:913/495 T:1      Model:1      
  Q:305/381 T:0      Model:0      
  Q:511+  7 T:518    Model:528    
  Q: 19/187 T:0      Model:0      
  Q:293/ 11 T:26     Model:26     
  Q:040-889 T:-849   Model:-849   
Iteration 90
acc:0.993 loss:0.027 val_acc:0.884 val_loss:0.710 
  Q:731/15  T:48     Model:48     
  Q:013*567 T:7371   Model:7471   
  Q:008*013 T:104    Model:144    
  Q:424*260 T:110240 Model:100460 
  Q: 95*529 T:50255  Model:58135  
  Q: 67*294 T:19698  Model:20058  
  Q:33 *952 T:31416  Model:32756  
  Q:142+201 T:343    Model:343    
  Q:806/559 T:1      Model:1      
  Q:648-055 T:593    Model:593    

Plot validation and accuracy over epochs

The validation loss starts increasing after 30 epochs indicating that the model overfits. On the other hand, the overfitting behavior cannot be seen from the validation accuracy (it remains the same (and not decreasing) after the 40 epochs.) This is probably because the accuracy is a more "corse" measure than the "softmax-categorical_crossentropy" which was used as our loss.

I would increase training sample size or change the model structure to improve the model performance.

In [19]:

Model performance

The code block below shows the randomly selected 100 validation samples. It is clear that the model performance is reasonable in many problems but the performance becomes poor when the solution requires more than 4 digits.

In [30]:
random_index = np.random.choice(x_val.shape[0],100,replace=False)
  Q:478+629 T:1107   Model:1107   
  Q:383*249 T:95367  Model:944977 
  Q:231-94  T:137    Model:137    
  Q:841/232 T:3      Model:3      
  Q:151+448 T:599    Model:597    
  Q:539-661 T:-122   Model:-122   
  Q:9 5*474 T:45030  Model:44070  
  Q:801+180 T:981    Model:981    
  Q:767/260 T:2      Model:2      
  Q:079/983 T:0      Model:0      
  Q:180+926 T:1106   Model:1106   
  Q:411+515 T:926    Model:926    
  Q: 78/983 T:0      Model:0      
  Q:067+029 T:96     Model:96     
  Q:595+ 76 T:671    Model:671    
  Q:182/691 T:0      Model:0      
  Q:  5+800 T:805    Model:805    
  Q:238-617 T:-379   Model:-379   
  Q: 05- 64 T:-59    Model:-59    
  Q:734-59  T:675    Model:675    
  Q:53 +037 T:90     Model:80     
  Q:326-157 T:169    Model:169    
  Q:681+695 T:1376   Model:1376   
  Q:1 0/6 9 T:0      Model:0      
  Q:825/518 T:1      Model:1      
  Q:3 4/809 T:0      Model:0      
  Q:097+312 T:409    Model:409    
  Q:435*892 T:388020 Model:388180 
  Q:699+420 T:1119   Model:1109   
  Q:2 8/985 T:0      Model:0      
  Q:574+725 T:1299   Model:1299   
  Q:35 -581 T:-546   Model:-546   
  Q: 98/171 T:0      Model:0      
  Q:679-237 T:442    Model:442    
  Q:088+401 T:489    Model:489    
  Q:390/007 T:55     Model:52     
  Q:577/091 T:6      Model:6      
  Q:887*909 T:806283 Model:811773 
  Q:225+324 T:549    Model:549    
  Q:165/0 0 T:NaN    Model:NaN    
  Q:551- 1  T:550    Model:550    
  Q:15 -747 T:-732   Model:-732   
  Q: 63+520 T:583    Model:583    
  Q:251+158 T:409    Model:309    
  Q:698-518 T:180    Model:180    
  Q:3 1+980 T:1011   Model:1011   
  Q:970/902 T:1      Model:1      
  Q:183*53  T:9699   Model:1669   
  Q:294-213 T:81     Model:71     
  Q:707+400 T:1107   Model:1107   
  Q:402+846 T:1248   Model:1248   
  Q:533/971 T:0      Model:0      
  Q:2 7+549 T:576    Model:576    
  Q:654-805 T:-151   Model:-151   
  Q:974/609 T:1      Model:1      
  Q:783+540 T:1323   Model:1323   
  Q:01 *835 T:835    Model:835    
  Q:068*650 T:44200  Model:44400  
  Q:071-566 T:-495   Model:-495   
  Q:390/821 T:0      Model:0      
  Q:794+954 T:1748   Model:1748   
  Q: 17-323 T:-306   Model:-306   
  Q:386-042 T:344    Model:344    
  Q:797*354 T:282138 Model:270618 
  Q:1  /44  T:0      Model:0      
  Q:531+299 T:830    Model:820    
  Q:357*6 8 T:24276  Model:22716  
  Q:862*608 T:524096 Model:528016 
  Q:7 5*36  T:2700   Model:2600   
  Q:326/ 13 T:25     Model:20     
  Q:538+214 T:752    Model:752    
  Q:713/ 54 T:13     Model:13     
  Q:126+013 T:139    Model:139    
  Q:906* 2  T:1812   Model:1972   
  Q:39 /0 7 T:5      Model:5      
  Q:664*5 7 T:37848  Model:37348  
  Q:790+914 T:1704   Model:1704   
  Q:909*1 6 T:14544  Model:14564  
  Q:49 *529 T:25921  Model:25981  
  Q:299+308 T:607    Model:607    
  Q:040+503 T:543    Model:543    
  Q:47 -088 T:-41    Model:-41    
  Q:525/3 1 T:16     Model:17     
  Q:479+981 T:1460   Model:1460   
  Q:731-358 T:373    Model:373    
  Q:96 *856 T:82176  Model:81536  
  Q: 65-541 T:-476   Model:-476   
  Q:806+323 T:1129   Model:1129   
  Q:099* 73 T:7227   Model:7287   
  Q:481-06  T:475    Model:475    
  Q: 89/6 4 T:1      Model:1      
  Q:080/919 T:0      Model:0      
  Q:917-741 T:176    Model:176    
  Q:449-565 T:-116   Model:-116   
  Q:851-67  T:784    Model:794    
  Q:427/083 T:5      Model:5      
  Q:76 *00  T:0      Model:0      
  Q:101+30  T:131    Model:131    
  Q:961*179 T:172019 Model:168879 
  Q:6 2+ 5  T:67     Model:67