Once you create a cool deep learning application, next step is to deploy it so that anyone all over the world can use your cool application. In this blog and the next blog, I will explore simple ways to deploy deep learning models into a public cloud platform. This blog will focus on the model development phase.
Quick google search shows there are various public clouds that let us deploy our model:
The great advantages of AWS and Google cloud is that they offer GPU instances. However, it requires DevOps work needed to set up a server i.e., it takes longer for setting up the server. While Heroku does not offer GPU instance, it is VERY easy to set up and host at most 5 applications FOR FREE. It seems that Heroku is a good place to learn deployment in general for the first time. So this blog will use Heroku as a tool.
In the first blog post, I will create a simple deep learning model for sentiment analysis using tensor-flow. The model takes tweet or short text as a sentence and return the likelihood of happiness. Although it is relatively easy to deploy a prediction model to Heroku, this comes with some costs: for example, maximum slug size is 500 MB or a request has not been processed by a worker within 30 secounds. Because of these restrictions, my focus in this blog was to simplify the model and to reduce the size of the model weights.
In the next blog, I will discuss how to deploy this model in Heroku. If you are not interested in the model development, I recommend you to jump to the second phase.
To motivate the readers, click here to see my deployed web app¶
Reference:¶
Sentiment analysis model development¶
Social media is a great place to share your thoughts about anything to anyone. Social media became a important place to learn public opinions. Especially twitter is considered one of the most important place for this purpose because of its scale, diversity of topics and public access to the contents.
SemEval offers public tweet dataset that researchers can share task on Sentiment Analysis on Twitter. The task ran in 2013, 2014, 2015 and 2016, attracting over 40+ participating teams in all four editions. I will use their public labeled data to develop a model that tells the happiness level.
I will first create a simgle layer LSTM model. Later, I will try transfer learning using GloVe weights.
To start, I import necessary modules.
import matplotlib.pyplot as plt
import sys, time, os, warnings
import numpy as np
import pandas as pd
from collections import Counter
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" #"2,3" GPU 2 and 3 is only visible
import tensorflow as tf
warnings.filterwarnings("ignore")
print("python {}".format(sys.version))
print("tensorflow version {}".format(tf.__version__))
def set_seed(sd=123):
from numpy.random import seed
from tensorflow import set_random_seed
import random as rn
## numpy random seed
seed(sd)
## core python's random number
rn.seed(sd)
## tensor flow's random number
set_random_seed(sd)
dir_data = "../Sentiment/2017_English_final/GOLD/Subtask_A/"
ls $dir_data
Load dataset¶
I will only use the training data because only training data contains labels.
def extractData(path_data):
file = open(path_data,'r')
text = file.read()
file.close()
texts = text.split("\n")
data = []
for line in texts:
cols = line.split("\t")
data.append(cols)
data = pd.DataFrame(data)
return(data)
d = {}
for i in [3,5,6]:
path_data = dir_data + "twitter-201{}train-A.txt".format(i)
d[path_data] = extractData(path_data)
print("combine the data")
data = pd.concat(d)
data = data.reset_index()
data = data[[1,2]]
data.columns = ["class","text"]
## remove NaN
data = data.loc[~data["class"].isnull(),:]
Descriptive analysis¶
The distribution of the positive, neutral and negative tweets. Roughly the same number of positive and negative tweets.
c = Counter(data["class"].values)
tot = np.sum(c.values())
labels = [str(k) + " "+ str(np.round(i*100/tot,3))+"%" for k,i in c.items()]
x = range(len(c))
plt.bar(x,c.values())
plt.xticks(x,labels)
plt.show()
Example tweets¶
Let's take a look at 20 randomly selected raw tweets.
Ntweet = 20
index = np.random.choice(range(len(data)),Ntweet)
for i in index:
row = data.iloc[i,:]
print("{:6}: {:}".format(row.iloc[0],row.iloc[1]))
Give integer labels to indicate the negative, neutral and positive tweets¶
To simplify the problem and focus only on whether the tweet is happy, I will combine the negative and neutral tweets.
- Negative : 0
- Neutral : 0
- Positive : 1
This means that the 42% of tweets are 0 and 58% are 1.
classes = []
for cl in data["class"]:
if cl == "negative":
classes.append(0)
elif cl == "positive":
classes.append(1)
elif cl == "neutral":
classes.append(0)
else:
print("SHOULD NOT BE HERE")
data["class"] = classes
Text cleaning¶
I need to admit the this text cleaning section needs serious improvement... for example, I should remove URL or hash tags. We should also combine grammatically same words, e.g. "I've", "I have", "I hav" should be all treated the same. Serious text cleaning will improve the model performance substantially. I should also
For now, I just do:
- get rid of quotes from the text
- assign a single space between word and !. So that ! is treated as a single word. Do the same for ?.
import re
from copy import copy
def clean_text(texts_original):
texts = []
for text in texts_original:
otext = copy(text)
text = otext.replace("!"," !").replace("?"," ?")
text = text.replace('"',"")
text = text.replace('“',"")
text = text.replace('”',"")
if text == "":
print(otext)
texts.append(text)
return(texts)
data["text"] = clean_text(data["text"].values)
Tokenizer¶
Here I use Keras's preprocessing tokenizer to create a dictionary that maps word string to index ID. I will only extract the 5000 most common words and everything else is ignored.
from tensorflow.contrib.keras import preprocessing
nb_words = 5000
tokenizer = preprocessing.text.Tokenizer(nb_words)
tokenizer.fit_on_texts(data["text"].values)
vocab_size = nb_words + 1
index_word = {v: k for k,v in tokenizer.word_index.items()}
print("The sentence has {} unique words".format(len(np.unique(tokenizer.word_index.keys()))))
print("> vocab_size={}".format(vocab_size ))
Record tokenizer¶
The tokenizer takes a string sentence, and create a list containing index IDs. This tokenizer also needs to be used during prediction to process the input sentence. So I need do save it. In general, tokenizer object can be saved and loaded using pickle as discussed in stack overflow:
import pickle
# saving
with open('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)
However, this create 1.5MB pickle object. This may or may not be too large for Heroku application, which only allows 500MB memory space and 30 seconds for every backend calculation. In practice, this tokenizer function is quite simple: it just maps the word and the word index. So instead of saving the tokenizer object itself, I will just record the words in the order of ID and save it into a csv file. This file size is as small as 33KB.
mytokenizer = []
for i in range(1,nb_words+1):
mytokenizer.append(index_word[i])
pd.DataFrame({"tokenizer":mytokenizer}).to_csv("tokenizer.csv",index=False)
The following code contains extract word_index dictionary and index_word dictionary from the csv.
def get_word_index_from_csv():
tokenizer = pd.read_csv("tokenizer.csv")["tokenizer"].values
word_index = {}
index_word = {}
for index,word in enumerate(tokenizer,1):
word_index[word]=index
index_word[index]=word
return(word_index,index_word)
word_index, index_word = get_word_index_from_csv()
Now using word_index, encode each sentence.
def texts_to_sequences(line):
out = []
for l in line.split():
llower = l.lower()
if llower in word_index.keys():
out.append(word_index[llower])
return(out)
N = data.shape[0]
prop_train = 0.8
Ntrain = int(N*prop_train)
Ntest = N - Ntrain
sequences, index_train, index_test = [], [], []
count = 0
for irow,line in enumerate(data["text"]):
encoded = texts_to_sequences(line)
sequences.append(encoded)
if irow < Ntrain:
index_train.append(count)
else:
index_test.append(count)
count += 1
print('Total Sequences: %d' % (len(sequences)))
Let's look at the example encoded tweets.
def print_text(encoded):
'''
encoded : a list containing index e.g. [1, 300, 2]
index_word : dictionary {0 : "am", 1 : "I", 2 : ".",..}
'''
for k in encoded:
print("{}({})".format(index_word[k],k)),
print("")
set_seed(1)
random_index = np.random.choice(data.shape[0],20)
for irow,line in enumerate(data["text"].iloc[random_index]):
encoded = texts_to_sequences(line)
print("irow={}".format(random_index[irow]))
print_text(encoded)
print("")
Add zero padding to the sentence that is shorter than the maximum length. This zero padding function will be also necessary in deployment.
X
- X has a shape (N tweets, max length of tweet)
- X has zero padding when the original tweet length is less than max length of tweet.
def pad_pre_sequences(arr,maxlen):
lines = []
for iline in range(len(arr)):
oline = arr[iline]
lo = len(oline)
if maxlen > lo:
line = [0]*(maxlen - lo) + list(oline)
else:
line = oline[:maxlen]
lines.append(line)
if len(line) != maxlen:
print(maxlen)
print(line)
lines = np.array(lines)
return(lines)
max_length = max([len(seq) for seq in sequences])
X = pad_pre_sequences(sequences, maxlen=max_length)
print('Max Sequence Length: %d' % max_length)
One-hot encoding to create y
- y[i,0] == 1 if the i^th tweet is negative or neutral otherwise 0
- y[i,1] == 1 if the i^th tweet is positive otherwise 0
from keras.utils import to_categorical
y = to_categorical(data["class"].values, num_classes=2)
Split between training and testing data
X_train, y_train, X_test, y_test = X[index_train], y[index_train],X[index_test], y[index_test]
Model definition¶
Here, I define a deep learning model with embedding layer + single layer LSTM followed by a single dense layer. Notice that I am using tensorflow to extract Keras's models and layers. I could have extracted these modules directly from Keras. However, somehow Heroku gave me warning when I try to import keras. Importing tensorflow did not yield the same error. This is the reason why I am using tensorflow in this blog post.
from tensorflow.contrib.keras import models
from tensorflow.contrib.keras import layers
def define_model(input_length,Embedding,dim_out=2):
hidden_unit_LSTM = 4
main_input = layers.Input(shape=(input_length,),dtype='int32',name='main_input')
embedding = Embedding(main_input)
x = layers.LSTM(hidden_unit_LSTM)(embedding)
main_output = layers.Dense(dim_out, activation='softmax')(x)
model = models.Model(inputs=[main_input],
outputs=[main_output])
# compile network
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
print(model.summary())
return(model)
dim_dense_embedding = 50
Embedding1 = layers.Embedding(vocab_size,
dim_dense_embedding)
model1 = define_model(X.shape[1],Embedding1)
Training starts here:¶
import time
start = time.time()
hist1 = model1.fit(X_train,y_train,
validation_data=(X_test,y_test),
epochs=10,batch_size=64,
verbose=2)
end = time.time()
print("Time took: {:3.2f}MIN".format((end - start)/60.0))
Validation loss and validation accuracies over epochs¶
The validation loss increases very quickly. Model is overfitting.
def plot_loss(hist1):
for label in hist1.history.keys():
plt.plot(hist1.history[label],label=label)
plt.legend()
plt.show()
plot_loss(hist1)
Attempt 2: GloVe pre-trained weights¶
The model performance was not very good. Let's try to improve the model performance by transfer learning.
The researchers behind GloVe method provide a suite of pre-trained word embeddings on their website released under a public domain license. See:
The smallest package of embeddings is called “glove.6B.zip“.
Weights are trained with 6 billion tokens (words) with 400,000 vocabularies. There are four dimensions of embedding vectors, i.e., 50d, 100d, 200d, and 300d.
wget 'nlp.stanford.edu/data/glove.6B.zip'
unzipping gives four .txt files:
- 164M Aug 4 2014 glove.6B.50d.txt
- 332M Aug 4 2014 glove.6B.100d.txt
- 662M Aug 4 2014 glove.6B.200d.txt
- 990M Aug 27 2014 glove.6B.300d.txt
Reference¶
dim_embedding = 50
path_GloVe = "../output/glove.6B.{}d.txt".format(dim_embedding)
def GloVe_embedding(path_GloVe):
embeddings_index = {}
with open(path_GloVe) as f:
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print("{} vocabs ".format(len(embeddings_index)))
return(embeddings_index)
GloVe_embedding = GloVe_embedding(path_GloVe)
Let's see 50 example tokens in GloVe¶
Some tokens contain digits, all the tokens are in lower case.
set_seed(10)
isamples = np.random.choice(len(GloVe_embedding),50)
def print_words(words):
count = 1
for word in words:
print("{:20}".format(word)),
if count % 5 == 0:
print("")
count += 1
print_words(np.array(GloVe_embedding.keys())[isamples])
Now let's look at the vocabularies that do not appear in the GloVe but appear in the top 5000 most frequent words in my SemEval data.
word_not_in_GloVe= np.array(list(set(word_index.keys()) - set(GloVe_embedding.keys())))
Nvoc = len(tokenizer.word_index.keys())
print("Out of {} vocabulary in original data, {} exists in the vocabulary of GloVe".format(
Nvoc, Nvoc - len(word_not_in_GloVe)))
print("Following {} tokens ({:4.2f}%) do not exist in GloVe!!".format(
len(word_not_in_GloVe),len(word_not_in_GloVe)*100/float(Nvoc)))
print("-"*100)
isamples = np.random.choice(len(word_not_in_GloVe),50)
print_words(word_not_in_GloVe[isamples])
Not all the embedding vector is necessary for training. I will simply extract the embedding vectors that appear in our tweeter data.
# prepare embedding matrix
num_words = len(word_index) + 1
embedding_matrix = np.zeros((num_words, dim_embedding))
count = 0
for word, i in tokenizer.word_index.items():
if i >= nb_words:
continue
embedding_vector = GloVe_embedding.get(word)
if embedding_vector is None:
count+= 1
else:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
print(" {} tokens did not exist".format(count))
Define the model¶
This time, I will provide embedding matrix from GloVe and freeze the parameters. Notice that the model summary shows that there are only 890 trainable weights. These are the weights from LSTM and Dense layers.
dim_dense_embedding = embedding_matrix.shape[1]
Embedding2 = layers.Embedding(vocab_size, dim_dense_embedding,
weights=[embedding_matrix],
trainable=False)
model2 = define_model(X.shape[1],Embedding2)
Model training¶
start = time.time()
hist2 = model2.fit(X_train,y_train,
epochs=10,batch_size=64,
validation_data=(X_test,y_test),
verbose=2)
end = time.time()
print("Time took: {:3.1f}MIN".format((end - start)/60.0))
Validation loss and validation accuracies over epochs¶
The validation loss decreases more by using weights from GloVe.
plot_loss(hist2)
Save model 2¶
Save the model weights. In the deployment, the model weights together with the tokenizer's word index are extracted during the prediction.
model2.save_weights('sentiment_weights.h5')
Model validation with example tweets¶
Hummm the happiness level makes some sense??! In the next blog post, I will deploy this model to web app.
def predict(line,model):
encoded = texts_to_sequences(line)
sequences = pad_pre_sequences([encoded], maxlen=max_length)
probs = model.predict(sequences)[0]
## The probability of positive tweet is recorded in the 1st position
return(probs[1])
model = define_model(X.shape[1],Embedding1)
model.load_weights('sentiment_weights.h5')
word_index,_ = get_word_index_from_csv()
texts = ["I feel happy!",
"What a great day!",
"Going to work.",
"I broke up with my boyfriend.",
"happy happy",
"happy happy happy happy so happy",
"Life sucks",
"I want to kill myself",
"kill",
"kill hate hate hate kill"]
for text in texts:
print("Prob(Happy Tweet)={:5.3f}, {:20}".format(predict(text,model2),text))