The goal of this blog is to learn the functionalities of Keras for language processing applications.
In the first section, I create a very simple single-word-in single-word-out model based on a single sentence. With this application, I make sure that the model works in this simple possible scenario and it can correctly predict the next word given the current word correctly for this trainning sentence.
In the second section, we will use a similar model with more nodes to more than one sentence. Here I try to create AI that tweets like President Trump. I will use the President Trump's latest ~3,000 tweets to train the model. The data extraction procedure using tweepy is previously discussed.
import sys
print(sys.version)
import keras
print("keras {}".format(keras.__version__))
import tensorflow as tf
print("tensorflow {}".format(tf.__version__))
import numpy as np
print("numpy {}".format(np.__version__))
import matplotlib.pyplot as plt
from keras.backend.tensorflow_backend import set_session
print(tf.__version__)
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.gpu_options.visible_device_list = "0"
#### 1 GPU1
#### 2 GPU2
#### 0 GPU3
#### 4 GPU4
set_session(tf.Session(config=config))
Very very simple example: One-word-in one-word-out model¶
The first training data is a single sentence with 11 words followed by three exclamation marks. I will create a simple LSTM model using this single sentence. The model should be able to overfit and reproduce this sentence!
# source text
data = """I want my deep learning model to guess this sentence perfectly ... YES!"""
First create Tokenizer object. Tokenizer creates a mapping between word and idnex. The mapping is recorded in dictionary: key = words, value = index. The dictionary can be accessed via "tokenizer.word_index".
Observation¶
- The word_index removes special character e.g., ... or !
- All the words are converted to lower case
- The indexing starts from '1' NOT zero!
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
for key , value in tokenizer.word_index.items():
print("key:{:10} value:{:4}".format(key,value))
tokenizer.texts_to_sequences converts string to a list of index. The index is from tokenizer.word_index. You see that the index is in the order of the words appearing in the sentence.
encoded = tokenizer.texts_to_sequences([data])[0] ## [0] to extract the first sentence
print(encoded)
print("The sentence has {} words".format(len(encoded)))
Define the vocabulary size to the number of unique words + 1. This vocab_size is used for defining the number of classes to predict. Plus 1 is neccesary to include class "0". tokenizer.word_index.values() has no word defined with 0. So we can potentially use this class 0 for a place holder class (i.e., padding class).
## + 1 for potential padding
vocab_size = len(tokenizer.word_index) + 1
print("The sentence has {} unique words".format(len(np.unique(tokenizer.word_index.keys()))))
print("> vocab_size={}".format(vocab_size ))
Prepare training data X, y. As we are creating one-word-in one-word out model,
- X.shape = (N words in sentence - 1, 1)
- y.shape = (N words in sentence - 1, 1)
sequences = []
n_lookback = 1
for i in range(n_lookback,len(encoded)):
sequences.append(encoded[(i-n_lookback):i+1])
# split into X and y elements
sequences = np.array(sequences)
X, y = sequences[:,0],sequences[:,1]
print(X.shape,y.shape)
One hot encoding for outputs. Notice that num_class is set to vocab_size which is N of unique words + 1. The printing of y shows that there is no row that has index in the 0th and 1st columns. This is because:
- there is no word belonging to class 0 in our original sentence "data".
- the word indexed with 1 appears at the start of the sentence hence it does not appear in the target y.
from keras.utils import to_categorical
y = to_categorical(y, num_classes=vocab_size)
print(y.shape)
print(y)
Define model and train data¶
The model is simple; one embedding layer followed by one LSTM layer and then feed foward layer.
"Word embeddings" are a family of natural language processing techniques aiming at mapping semantic meaning of each word into a geometric space.
Parameter¶
- Embedding layer: for each word, create a continuous vector of length 10 to represent itself
- 130 parameters = "vocab_size" x 10
- LSTM layers: 10 hidden units each has 4 gates
- 840 parameters = 10 hidden LSTM untis 4 (3 gates and 1 state) ((10 input + 1 bias) + 10 hidden LSTM untis)
- Feed forward layer:
- 143 parameters = (10 hidden LSTM units + 1 bias) x 13 class
from keras.models import Model
from keras.layers import Input, Dense, Activation, Embedding, LSTM
def define_model(vocab_size,
input_length=1,
dim_dense_embedding=10,
hidden_unit_LSTM=5):
main_input = Input(shape=(input_length,),dtype='int32',name='main_input')
embedding = Embedding(vocab_size, dim_dense_embedding,
input_length=input_length)(main_input)
x = LSTM(hidden_unit_LSTM)(embedding)
main_output = Dense(vocab_size, activation='softmax')(x)
model = Model(inputs=[main_input],
output=[main_output])
print(model.summary())
return(model)
model = define_model(vocab_size,
input_length=1,
dim_dense_embedding=10,
hidden_unit_LSTM=10)
# compile network
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
# fit network
hist = model.fit(X, y, epochs=500, verbose=False)
The traiing accuracy is perfect, indicating that the model can predict perfectly for training sentence.
plt.plot(hist.history["acc"])
plt.xlabel("epoch")
plt.ylabel("accuracy")
plt.show()
Now check if my model can correctly generate the trained sentence. Generate a 13 words sentence starting with 'I'. Successfully it generates the original sentence. The original sentence has 12 words so the 13th word predicted after "yes" could be anything. In this case the word after yes was predicted to be "to". But this value changes if you train with different initial values.
def predict_sentence(model,tokenizer,in_text,n_words):
index_word = {v: k for k,v in tokenizer.word_index.items()}
encoded = tokenizer.texts_to_sequences([in_text])[0]
encoded = np.array(encoded)
words = [in_text]
for _ in range(n_words-1):
## encoded is index
probs = model.predict(encoded, verbose=0)
encoded = np.argmax(probs,axis=1)[0]
word = index_word[encoded]
encoded = np.array([encoded])
words.append(word)
return(words)
n_words = 13
in_text = 'I'
pred_sentence = predict_sentence(model,tokenizer,in_text,n_words)
print("Predicted sentence with {} words:".format(n_words))
for k in pred_sentence:
print(k),
Look at the probability distribution of the word given the previous word.
# words is a list of vocabularly words such that words[word_index.values()[i]] = word_index.keys()[i]
words = tokenizer.word_index.keys()
words = np.array(words)
words = list(words[np.argsort(tokenizer.word_index.values())])
words = ["padding"] + words
print(words)
The extimated probability distribution of all words except 'yes' have large peak and flat elsewhere. The peak is located at the next word. For example, the peak of the probability distribution of the word after 'deep' is at 'learning'. However, the probability distribution of the word after 'yes' is rather flat.
choice_words = ["yes",'deep','my']
for choice_word in choice_words:
encoded = tokenizer.texts_to_sequences([choice_word])[0]
encoded = np.array([encoded])
probs = model.predict(encoded).flatten()
y_pos = range(len(probs))
plt.figure(figsize=(14,1))
plt.bar(y_pos,probs)
plt.xticks(y_pos,words)
plt.title("The probability distribution: word after '{}'".format(choice_word))
plt.show()
Train a NLP model with President Trump's Tweets¶
Now let's move on to more complex application. In the previous example, we only have a single sentence to train the model. I will now use ~3,000 tweets from President Trump to train a deep learning model.
In the previous post, I presented how to extract ~3,000 President trump's latest tweets using tweepy. I use this data to create a simple deep learning model that tweets like he does!
import pandas as pd
data = pd.read_csv("data/realDonaldTrump_tweets.csv")
data.head(5)
Let's look at randomly selected 10 tweets in my dataframe. It shows that the tweets contains lots of terms that only appears single times or the terms that are not interesting for predictions. So I will first clean the texts.
random_index = np.random.choice(data.shape[0],10)
for index,k in enumerate(data["text"].iloc[random_index]):
print("")
print("irow={}".format(random_index[index]))
print(k)
Tweet cleaning stranteges:
- remove quotes
- Ideally I want to treat each of '`' and '"' as a single word. However, I found that Tokenizer is not always treating these as single words. For example, if there is a sentence "I say so ". "I is treated as a single word while the ending quote " is treated as a single word. For simplicity, I will ignore quotes in this blog.
- remove URL. # and @. Most of these appear only once. Therefore including URL decreases the model performance on valdiation set substantially.
import re
texts = []
for text in data["text"].values:
text = text.replace('"',"")
text = text.replace('“',"")
text = text.replace('”',"")
## remove link
text = re.sub(r'https?://.*', '', text, flags=re.MULTILINE)
## remove hashtag
text = re.sub(r'#.*', '', text, flags=re.MULTILINE)
## remove
text = re.sub(r'@.*', '', text, flags=re.MULTILINE)
texts.append(text)
data["text"] = texts
I found that these cleaning is very important to create a meaningful model. Without the cleaning, the model's training accuracy did not increase more than 0.05. I tried to solve this problem by substantially increasing the complexity of the model but I was not very successful. It seems that removing infrequently appearing words is very useful approach.
This makes sense because the removals of these infrequently appearing words reduce the size of Tokenizer.word_index by more than 20% times (1 - 5689/7300).
Now, we create a mapping between a word and an index. Tokenizer nicely filters special characters.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data["text"].values)
vocab_size = len(tokenizer.word_index) + 1
index_word = {v: k for k,v in tokenizer.word_index.items()}
print("The sentence has {} unique words".format(len(np.unique(tokenizer.word_index.keys()))))
print("> vocab_size={}".format(vocab_size ))
Using the Tokenizer's word index dictionary, represents each sentence just by the word indecies. Let's see how the sentence is represented with word indecies.
def print_text(ks):
for k in ks:
print("{}({})".format(index_word[k],k)),
print("")
for irow,line in enumerate(data["text"].iloc[random_index]):
encoded = tokenizer.texts_to_sequences([line])[0]
print("irow={}".format(random_index[irow]))
print_text(encoded)
print("")
- Restructure the sentence data
- Currently each row is a sentence
- We will change it so that every row corresponds to a single word for prediction For example, if there are two sentences "Make America Great Again" and "Thanks United States" this will create 5 rows such that: "- - Make America", "- Make America Great", "Make America Great Again", "- - Thanks United", "- Thanks United States".
- Devide the sentences into taining and testing dataseat.
- I make sure that any sub sentences from the same single original sentence go to the same dataset.
N = data.shape[0]
prop_train = 0.8
Ntrain = int(N*prop_train)
Ntest = N - Ntrain
sequences, index_train, index_test = [], [], []
count = 0
for irow,line in enumerate(data["text"]):
encoded = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(encoded)):
sequence = encoded[:i+1]
sequences.append(sequence)
if irow < Ntrain:
index_train.append(count)
else:
index_test.append(count)
count += 1
print('Total Sequences: %d' % (len(sequences)))
The sequence lengths differ in the data. We add "0" to fill make each sentence the same. See pad_sequence.
Transform the target variables into one-hot encoded vectors.
from keras.preprocessing.sequence import pad_sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)
# split into input and output elements
sequences = np.array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
X_train, y_train, X_test, y_test = X[index_train], y[index_train],X[index_test], y[index_test]
Model training starts here!¶
I made the model more complex than the previous example by increasing the dimention of the dense embedding vector, and increasing the number of hidden units in LSTM.
The training accuracy keeps increasing but the validation accuracies do not increase as much. This is reasonable considering the small training data size; the model is overfitting.
model = define_model(vocab_size,
input_length=X.shape[1],
dim_dense_embedding=30,
hidden_unit_LSTM=64)
# compile network
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
# fit network
hist = model.fit(X_train, y_train,
validation_data = (X_test,y_test),
epochs=50, verbose=2,batch_size=128)
Plot validation accuracy and training accuracy¶
plt.figure(figsize=(8,8))
plt.plot(hist.history["val_acc"],label="val_acc")
plt.plot(hist.history["acc"],label="acc")
plt.legend()
plt.show()
Check the dimentions of available weights
count_layer = 0
for layer in model.layers:
ws = layer.get_weights()
c = 0
for w in ws:
print("layer{}_{} {:20} {}".format(count_layer,
c,
layer.name,w.shape))
c += 1
count_layer += 1
Reduce the dimention of the word vectors using PCA and visualize their distirubtions in 2d.¶
weight = model.layers[1].get_weights()[0]
print(weight.shape)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
y_embed_pca = pca.fit_transform(weight )
print(y_embed_pca.shape)
Observations¶
- Seemingly similar words are clustered in the same area:
- kim, saudi, radical, differently
- north, south
- estimates, statistically
- knives, dialogs
- independent, united
- thank, honor
fig, ax = plt.subplots(figsize=(25,25))
ax.scatter(y_embed_pca[:,0],y_embed_pca[:,1],c="white")
for txt, irow in tokenizer.word_index.items():
try:
ax.annotate(txt,
(y_embed_pca[irow,0],y_embed_pca[irow,1]))
except:
pass
ax.set_xlabel("pca embedding 1")
ax.set_ylabel("pca embedding 2")
plt.show()
Let President Trump's AI talk about some topics¶
I feed the first few words, and let's see what opnion of President Trump's AI is!! I randomly sample the next word according to the estimated probability distribution. This way, I can ensure that the outputs are different every time I feed the same initial words.
Hummm the predicted tweets makes sense, kind of?
Observations¶
- When "Make America" is provided as the first 2 words, the AI almost always predicts "great again" as the next words.
- When "North" is provided, the next word is almost always "Korea" followed often by some negative sentence.
- The sentence starting with "Omaga is" tends to have negative meaning.
def sample(probs):
return(np.random.choice(range(len(probs)),p=probs))
def predict_sentence(in_text,n_words,tokenizer,model,max_length):
words = []
for _ in range(n_words):
# encode the text as integer
enc = tokenizer.texts_to_sequences([in_text])[0]
# pre-pad sequences to a fixed length
enc_padding = pad_sequences([enc], maxlen=max_length-1, padding='pre')
probs = model.predict(enc_padding, verbose=0).flatten()
index = sample(probs)
word = index_word[index]
in_text += ' ' + word
return(in_text)
print(predict_sentence("North",n_words,tokenizer,model,max_length))
print(predict_sentence("America",n_words,tokenizer,model,max_length))
print(predict_sentence("I'm",n_words,tokenizer,model,max_length))
print(predict_sentence("I won't",n_words,tokenizer,model,max_length))
print(predict_sentence("MAKE AMERICA",n_words,tokenizer,model,max_length))
print(predict_sentence("american jobs",n_words,tokenizer,model,max_length))
print(predict_sentence("Obama is",n_words,tokenizer,model,max_length))
print(predict_sentence("Universal healthcare",n_words,tokenizer,model,max_length))
print(predict_sentence("H1B visa",n_words,tokenizer,model,max_length))
print(predict_sentence("Women",n_words,tokenizer,model,max_length))
print(predict_sentence("fat",n_words,tokenizer,model,max_length))
Next Steps¶
- Try transfer learning
- Increase more tweets
- Prediction of the number of retweets/likes given tweets