Yumi's Blog

The first deep learning model for NLP - Let AI tweet like President Trump -

The goal of this blog is to learn the functionalities of Keras for language processing applications.

In the first section, I create a very simple single-word-in single-word-out model based on a single sentence. With this application, I make sure that the model works in this simple possible scenario and it can correctly predict the next word given the current word correctly for this trainning sentence.

In the second section, we will use a similar model with more nodes to more than one sentence. Here I try to create AI that tweets like President Trump. I will use the President Trump's latest ~3,000 tweets to train the model. The data extraction procedure using tweepy is previously discussed.

In [1]:
import sys 
print(sys.version)
import keras 
print("keras {}".format(keras.__version__))
import tensorflow as tf
print("tensorflow {}".format(tf.__version__))
import numpy as np
print("numpy {}".format(np.__version__))
import matplotlib.pyplot as plt
2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Using TensorFlow backend.
keras 2.0.6
tensorflow 1.2.1
numpy 1.11.3
In [2]:
from keras.backend.tensorflow_backend import set_session
print(tf.__version__)
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.95
config.gpu_options.visible_device_list = "0"
#### 1 GPU1
#### 2 GPU2
#### 0 GPU3
#### 4 GPU4
set_session(tf.Session(config=config))
1.2.1

Very very simple example: One-word-in one-word-out model

The first training data is a single sentence with 11 words followed by three exclamation marks. I will create a simple LSTM model using this single sentence. The model should be able to overfit and reproduce this sentence!

In [3]:
# source text
data = """I want my deep learning model to guess this sentence perfectly ... YES!"""

First create Tokenizer object. Tokenizer creates a mapping between word and idnex. The mapping is recorded in dictionary: key = words, value = index. The dictionary can be accessed via "tokenizer.word_index".

Observation

  • The word_index removes special character e.g., ... or !
  • All the words are converted to lower case
  • The indexing starts from '1' NOT zero!
In [4]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
for key , value in tokenizer.word_index.items():
    print("key:{:10} value:{:4}".format(key,value))
key:this       value:   9
key:guess      value:   8
key:want       value:   2
key:sentence   value:  10
key:i          value:   1
key:deep       value:   4
key:to         value:   7
key:learning   value:   5
key:model      value:   6
key:perfectly  value:  11
key:my         value:   3
key:yes        value:  12

tokenizer.texts_to_sequences converts string to a list of index. The index is from tokenizer.word_index. You see that the index is in the order of the words appearing in the sentence.

In [5]:
encoded = tokenizer.texts_to_sequences([data])[0] ## [0] to extract the first sentence 
print(encoded)
print("The sentence has {} words".format(len(encoded)))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
The sentence has 12 words

Define the vocabulary size to the number of unique words + 1. This vocab_size is used for defining the number of classes to predict. Plus 1 is neccesary to include class "0". tokenizer.word_index.values() has no word defined with 0. So we can potentially use this class 0 for a place holder class (i.e., padding class).

In [6]:
## + 1 for potential padding
vocab_size = len(tokenizer.word_index) + 1
print("The sentence has {} unique words".format(len(np.unique(tokenizer.word_index.keys()))))
print("> vocab_size={}".format(vocab_size ))
The sentence has 12 unique words
> vocab_size=13

Prepare training data X, y. As we are creating one-word-in one-word out model,

  • X.shape = (N words in sentence - 1, 1)
  • y.shape = (N words in sentence - 1, 1)
In [7]:
sequences = []
n_lookback = 1
for i in range(n_lookback,len(encoded)):
    sequences.append(encoded[(i-n_lookback):i+1])
# split into X and y elements
sequences = np.array(sequences)
X, y = sequences[:,0],sequences[:,1]
print(X.shape,y.shape)
((11,), (11,))

One hot encoding for outputs. Notice that num_class is set to vocab_size which is N of unique words + 1. The printing of y shows that there is no row that has index in the 0th and 1st columns. This is because:

  • there is no word belonging to class 0 in our original sentence "data".
  • the word indexed with 1 appears at the start of the sentence hence it does not appear in the target y.
In [8]:
from keras.utils import to_categorical
y = to_categorical(y, num_classes=vocab_size)
print(y.shape)
print(y)
(11, 13)
[[ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.]]

Define model and train data

The model is simple; one embedding layer followed by one LSTM layer and then feed foward layer.

"Word embeddings" are a family of natural language processing techniques aiming at mapping semantic meaning of each word into a geometric space.

Parameter

  • Embedding layer: for each word, create a continuous vector of length 10 to represent itself
    • 130 parameters = "vocab_size" x 10
  • LSTM layers: 10 hidden units each has 4 gates
    • 840 parameters = 10 hidden LSTM untis 4 (3 gates and 1 state) ((10 input + 1 bias) + 10 hidden LSTM untis)
  • Feed forward layer:
    • 143 parameters = (10 hidden LSTM units + 1 bias) x 13 class
In [9]:
from keras.models import Model
from keras.layers import Input, Dense, Activation, Embedding, LSTM


def define_model(vocab_size,
                 input_length=1,
                 dim_dense_embedding=10,
                 hidden_unit_LSTM=5):
    main_input = Input(shape=(input_length,),dtype='int32',name='main_input')
    embedding = Embedding(vocab_size, dim_dense_embedding, 
                         input_length=input_length)(main_input)
    x = LSTM(hidden_unit_LSTM)(embedding)
    main_output = Dense(vocab_size, activation='softmax')(x)
    model = Model(inputs=[main_input],
                  output=[main_output])
    print(model.summary())
    return(model)



model = define_model(vocab_size,
                       input_length=1,
                       dim_dense_embedding=10,
                       hidden_unit_LSTM=10)
# compile network
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', metrics=['accuracy'])
# fit network
hist = model.fit(X, y, epochs=500, verbose=False)
/home/fairy/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:16: UserWarning: Update your `Model` call to the Keras 2 API: `Model(outputs=[
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
main_input (InputLayer)      (None, 1)                 0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1, 10)             130       
_________________________________________________________________
lstm_1 (LSTM)                (None, 10)                840       
_________________________________________________________________
dense_1 (Dense)              (None, 13)                143       
=================================================================
Total params: 1,113
Trainable params: 1,113
Non-trainable params: 0
_________________________________________________________________
None

The traiing accuracy is perfect, indicating that the model can predict perfectly for training sentence.

In [10]:
plt.plot(hist.history["acc"])
plt.xlabel("epoch")
plt.ylabel("accuracy")
plt.show()

Now check if my model can correctly generate the trained sentence. Generate a 13 words sentence starting with 'I'. Successfully it generates the original sentence. The original sentence has 12 words so the 13th word predicted after "yes" could be anything. In this case the word after yes was predicted to be "to". But this value changes if you train with different initial values.

In [11]:
def predict_sentence(model,tokenizer,in_text,n_words):
    index_word = {v: k for k,v in tokenizer.word_index.items()}
    encoded = tokenizer.texts_to_sequences([in_text])[0]
    encoded = np.array(encoded)
    words = [in_text]
    for _ in range(n_words-1):
        ## encoded is index
        probs = model.predict(encoded, verbose=0)
        encoded = np.argmax(probs,axis=1)[0]
        word = index_word[encoded] 
        encoded = np.array([encoded])
        words.append(word)
    return(words)

n_words = 13
in_text = 'I'
pred_sentence = predict_sentence(model,tokenizer,in_text,n_words)
print("Predicted sentence with {} words:".format(n_words))
for k in pred_sentence:
    print(k),
Predicted sentence with 13 words:
I want my deep learning model to guess this sentence perfectly yes to

Look at the probability distribution of the word given the previous word.

In [12]:
# words is a list of vocabularly words such that words[word_index.values()[i]] = word_index.keys()[i] 
words = tokenizer.word_index.keys()
words = np.array(words)
words = list(words[np.argsort(tokenizer.word_index.values())])
words = ["padding"] + words 
print(words)
['padding', 'i', 'want', 'my', 'deep', 'learning', 'model', 'to', 'guess', 'this', 'sentence', 'perfectly', 'yes']

The extimated probability distribution of all words except 'yes' have large peak and flat elsewhere. The peak is located at the next word. For example, the peak of the probability distribution of the word after 'deep' is at 'learning'. However, the probability distribution of the word after 'yes' is rather flat.

In [13]:
choice_words = ["yes",'deep','my']
for choice_word in choice_words:
    encoded = tokenizer.texts_to_sequences([choice_word])[0]
    encoded = np.array([encoded])
    probs = model.predict(encoded).flatten()
    y_pos = range(len(probs))
    plt.figure(figsize=(14,1))
    plt.bar(y_pos,probs)
    plt.xticks(y_pos,words)
    plt.title("The probability distribution: word after '{}'".format(choice_word))
    plt.show()

Train a NLP model with President Trump's Tweets

Now let's move on to more complex application. In the previous example, we only have a single sentence to train the model. I will now use ~3,000 tweets from President Trump to train a deep learning model.

In the previous post, I presented how to extract ~3,000 President trump's latest tweets using tweepy. I use this data to create a simple deep learning model that tweets like he does!

In [14]:
import pandas as pd 
data = pd.read_csv("data/realDonaldTrump_tweets.csv")
data.head(5)
Out[14]:
id created_at favorite_count retweet_count text
0 952540700683497472 2018-01-14 13:59:35 63773 14402 ...big unnecessary regulation cuts made it all...
1 952538350333939713 2018-01-14 13:50:14 81577 18816 “President Trump is not getting the credit he ...
2 952530515894169601 2018-01-14 13:19:06 112532 28970 I, as President, want people coming into our C...
3 952528011869478912 2018-01-14 13:09:09 91606 21864 DACA is probably dead because the Democrats do...
4 952526145064505345 2018-01-14 13:01:44 66552 13551 ...and they knew exactly what I said and meant...

Let's look at randomly selected 10 tweets in my dataframe. It shows that the tweets contains lots of terms that only appears single times or the terms that are not interesting for predictions. So I will first clean the texts.

In [15]:
random_index = np.random.choice(data.shape[0],10)
for index,k in enumerate(data["text"].iloc[random_index]):
    print("")
    print("irow={}".format(random_index[index]))
    print(k)
irow=1846
Getting ready to meet President al-Sisi of Egypt. On behalf of the United States, I look forward to a long and wonderful relationship.

irow=2153
Christians in the Middle-East have been executed in large numbers. We cannot allow this horror to continue!

irow=962
Thank you, our great honor! https://t.co/StrciEwuWs

irow=394
Will be leaving the Philippines tomorrow after many days of constant mtgs & work in order to #MAGA! My promises are rapidly being fulfilled.

irow=488
It is finally happening for our great clean coal miners! https://t.co/suAnjs6Ccz

irow=611
Great news on the 2018 budget @SenateMajLdr McConnell - first step toward delivering MASSIVE tax cuts for the American people! #TaxReform https://t.co/aBzQR7KR0c

irow=1276
My son Donald openly gave his e-mails to the media & authorities whereas Crooked Hillary Clinton deleted (& acid washed) her 33,000 e-mails!

irow=246
MAKE AMERICA GREAT AGAIN!

irow=457
Getting ready to land in Hawaii. Looking so much forward to meeting with our great Military/Veterans at Pearl Harbor!

irow=2238
.@FoxNews "Outgoing CIA Chief, John Brennan, blasts Pres-Elect Trump on Russia threat. Does not fully understand." Oh really, couldn't do...

Tweet cleaning stranteges:

  • remove quotes
    • Ideally I want to treat each of '`' and '"' as a single word. However, I found that Tokenizer is not always treating these as single words. For example, if there is a sentence "I say so ". "I is treated as a single word while the ending quote " is treated as a single word. For simplicity, I will ignore quotes in this blog.
  • remove URL. # and @. Most of these appear only once. Therefore including URL decreases the model performance on valdiation set substantially.
In [16]:
import re
texts = []
for text in data["text"].values:
    text = text.replace('"',"")
    text = text.replace('“',"")
    text = text.replace('”',"")
    ## remove link
    text = re.sub(r'https?://.*', '', text, flags=re.MULTILINE)
    ## remove hashtag
    text = re.sub(r'#.*', '', text, flags=re.MULTILINE)
    ## remove 
    text = re.sub(r'@.*', '', text, flags=re.MULTILINE)
    texts.append(text)

data["text"] = texts

I found that these cleaning is very important to create a meaningful model. Without the cleaning, the model's training accuracy did not increase more than 0.05. I tried to solve this problem by substantially increasing the complexity of the model but I was not very successful. It seems that removing infrequently appearing words is very useful approach.

This makes sense because the removals of these infrequently appearing words reduce the size of Tokenizer.word_index by more than 20% times (1 - 5689/7300).

Now, we create a mapping between a word and an index. Tokenizer nicely filters special characters.

In [17]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data["text"].values)
vocab_size = len(tokenizer.word_index) + 1

index_word = {v: k for k,v in tokenizer.word_index.items()}

print("The sentence has {} unique words".format(len(np.unique(tokenizer.word_index.keys()))))
print("> vocab_size={}".format(vocab_size ))
The sentence has 5688 unique words
> vocab_size=5689

Using the Tokenizer's word index dictionary, represents each sentence just by the word indecies. Let's see how the sentence is represented with word indecies.

In [18]:
def print_text(ks):
        for k in ks:
            print("{}({})".format(index_word[k],k)),
        print("")
        

for irow,line in enumerate(data["text"].iloc[random_index]):
    encoded = tokenizer.texts_to_sequences([line])[0]    
    print("irow={}".format(random_index[irow]))
    print_text(encoded)
    print("")
irow=1846
getting(182) ready(321) to(2) meet(561) president(54) al(1042) sisi(2117) of(4) egypt(1158) on(12) behalf(489) of(4) the(1) united(95) states(91) i(10) look(180) forward(236) to(2) a(6) long(159) and(3) wonderful(171) relationship(674) 

irow=2153
christians(5059) in(5) the(1) middle(348) east(595) have(22) been(87) executed(5060) in(5) large(741) numbers(282) we(15) cannot(469) allow(568) this(28) horror(1582) to(2) continue(456) 

irow=962
thank(30) you(20) our(14) great(11) honor(106) 

irow=394
will(9) be(13) leaving(359) the(1) philippines(1590) tomorrow(150) after(97) many(62) days(305) of(4) constant(3653) mtgs(2491) amp(18) work(177) in(5) order(184) to(2) 

irow=488
it(21) is(7) finally(355) happening(501) for(8) our(14) great(11) clean(1941) coal(1609) miners(2532) 

irow=611
great(11) news(36) on(12) the(1) 2018(676) budget(589) 

irow=1276
my(26) son(666) donald(319) openly(2811) gave(364) his(109) e(565) mails(935) to(2) the(1) media(69) amp(18) authorities(1828) whereas(4405) crooked(141) hillary(77) clinton(81) deleted(826) amp(18) acid(2393) washed(2394) her(272) 33(762) 000(157) e(565) mails(935) 

irow=246
make(66) america(42) great(11) again(63) 

irow=457
getting(182) ready(321) to(2) land(1447) in(5) hawaii(2517) looking(181) so(34) much(72) forward(236) to(2) meeting(151) with(16) our(14) great(11) military(117) veterans(432) at(23) pearl(1139) harbor(1330) 

irow=2238


  • Restructure the sentence data
    • Currently each row is a sentence
    • We will change it so that every row corresponds to a single word for prediction For example, if there are two sentences "Make America Great Again" and "Thanks United States" this will create 5 rows such that: "- - Make America", "- Make America Great", "Make America Great Again", "- - Thanks United", "- Thanks United States".
  • Devide the sentences into taining and testing dataseat.
    • I make sure that any sub sentences from the same single original sentence go to the same dataset.
In [19]:
N = data.shape[0]
prop_train = 0.8
Ntrain = int(N*prop_train)
Ntest = N - Ntrain

sequences, index_train, index_test = [], [], [] 
count = 0
for irow,line in enumerate(data["text"]):
    encoded = tokenizer.texts_to_sequences([line])[0]    
    for i in range(1, len(encoded)):
        sequence = encoded[:i+1]
        sequences.append(sequence)
        
        if irow < Ntrain:
            index_train.append(count)
        else:
            index_test.append(count)
        count += 1
print('Total Sequences: %d' % (len(sequences)))
Total Sequences: 50854

The sequence lengths differ in the data. We add "0" to fill make each sentence the same. See pad_sequence.

Transform the target variables into one-hot encoded vectors.

In [20]:
from keras.preprocessing.sequence import pad_sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)

# split into input and output elements
sequences = np.array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

X_train, y_train, X_test, y_test = X[index_train], y[index_train],X[index_test], y[index_test]
Max Sequence Length: 55

Model training starts here!

I made the model more complex than the previous example by increasing the dimention of the dense embedding vector, and increasing the number of hidden units in LSTM.

The training accuracy keeps increasing but the validation accuracies do not increase as much. This is reasonable considering the small training data size; the model is overfitting.

In [21]:
model = define_model(vocab_size,
                               input_length=X.shape[1],
                               dim_dense_embedding=30,
                               hidden_unit_LSTM=64)

# compile network

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', metrics=['accuracy'])
# fit network
hist = model.fit(X_train, y_train, 
                 validation_data = (X_test,y_test),
                 epochs=50, verbose=2,batch_size=128)
/home/fairy/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:16: UserWarning: Update your `Model` call to the Keras 2 API: `Model(outputs=[
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
main_input (InputLayer)      (None, 54)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 54, 30)            170670    
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                24320     
_________________________________________________________________
dense_2 (Dense)              (None, 5689)              369785    
=================================================================
Total params: 564,775
Trainable params: 564,775
Non-trainable params: 0
_________________________________________________________________
None
Train on 42933 samples, validate on 7921 samples
Epoch 1/50
42s - loss: 7.0381 - acc: 0.0457 - val_loss: 7.1778 - val_acc: 0.0376
Epoch 2/50
43s - loss: 6.7187 - acc: 0.0468 - val_loss: 7.1333 - val_acc: 0.0489
Epoch 3/50
42s - loss: 6.5754 - acc: 0.0561 - val_loss: 7.1526 - val_acc: 0.0518
Epoch 4/50
42s - loss: 6.4653 - acc: 0.0620 - val_loss: 7.1248 - val_acc: 0.0542
Epoch 5/50
42s - loss: 6.3519 - acc: 0.0712 - val_loss: 7.0637 - val_acc: 0.0636
Epoch 6/50
41s - loss: 6.2376 - acc: 0.0831 - val_loss: 7.0578 - val_acc: 0.0659
Epoch 7/50
42s - loss: 6.1370 - acc: 0.0893 - val_loss: 7.0022 - val_acc: 0.0819
Epoch 8/50
42s - loss: 6.0357 - acc: 0.0985 - val_loss: 6.9901 - val_acc: 0.0889
Epoch 9/50
42s - loss: 5.9305 - acc: 0.1113 - val_loss: 6.9461 - val_acc: 0.0971
Epoch 10/50
42s - loss: 5.8292 - acc: 0.1210 - val_loss: 6.9221 - val_acc: 0.1023
Epoch 11/50
41s - loss: 5.7368 - acc: 0.1259 - val_loss: 6.9087 - val_acc: 0.1050
Epoch 12/50
41s - loss: 5.6531 - acc: 0.1313 - val_loss: 6.8999 - val_acc: 0.1057
Epoch 13/50
41s - loss: 5.5744 - acc: 0.1361 - val_loss: 6.8956 - val_acc: 0.1066
Epoch 14/50
41s - loss: 5.4992 - acc: 0.1419 - val_loss: 6.8832 - val_acc: 0.1117
Epoch 15/50
40s - loss: 5.4262 - acc: 0.1450 - val_loss: 6.8832 - val_acc: 0.1146
Epoch 16/50
41s - loss: 5.3554 - acc: 0.1498 - val_loss: 6.8813 - val_acc: 0.1197
Epoch 17/50
42s - loss: 5.2873 - acc: 0.1535 - val_loss: 6.8838 - val_acc: 0.1179
Epoch 18/50
41s - loss: 5.2193 - acc: 0.1569 - val_loss: 6.8880 - val_acc: 0.1199
Epoch 19/50
41s - loss: 5.1511 - acc: 0.1615 - val_loss: 6.8848 - val_acc: 0.1225
Epoch 20/50
41s - loss: 5.0858 - acc: 0.1661 - val_loss: 6.8957 - val_acc: 0.1240
Epoch 21/50
41s - loss: 5.0204 - acc: 0.1709 - val_loss: 6.9047 - val_acc: 0.1278
Epoch 22/50
41s - loss: 4.9554 - acc: 0.1759 - val_loss: 6.9067 - val_acc: 0.1260
Epoch 23/50
40s - loss: 4.8918 - acc: 0.1811 - val_loss: 6.9166 - val_acc: 0.1283
Epoch 24/50
40s - loss: 4.8282 - acc: 0.1861 - val_loss: 6.9308 - val_acc: 0.1299
Epoch 25/50
40s - loss: 4.7659 - acc: 0.1903 - val_loss: 6.9423 - val_acc: 0.1321
Epoch 26/50
39s - loss: 4.7034 - acc: 0.1951 - val_loss: 6.9518 - val_acc: 0.1327
Epoch 27/50
38s - loss: 4.6431 - acc: 0.1998 - val_loss: 6.9729 - val_acc: 0.1312
Epoch 28/50
38s - loss: 4.5823 - acc: 0.2061 - val_loss: 6.9828 - val_acc: 0.1324
Epoch 29/50
38s - loss: 4.5236 - acc: 0.2094 - val_loss: 7.0053 - val_acc: 0.1329
Epoch 30/50
37s - loss: 4.4663 - acc: 0.2153 - val_loss: 7.0225 - val_acc: 0.1328
Epoch 31/50
38s - loss: 4.4097 - acc: 0.2198 - val_loss: 7.0357 - val_acc: 0.1357
Epoch 32/50
40s - loss: 4.3553 - acc: 0.2236 - val_loss: 7.0553 - val_acc: 0.1348
Epoch 33/50
39s - loss: 4.3003 - acc: 0.2296 - val_loss: 7.0785 - val_acc: 0.1360
Epoch 34/50
40s - loss: 4.2490 - acc: 0.2343 - val_loss: 7.0932 - val_acc: 0.1396
Epoch 35/50
40s - loss: 4.1977 - acc: 0.2382 - val_loss: 7.1158 - val_acc: 0.1384
Epoch 36/50
39s - loss: 4.1474 - acc: 0.2444 - val_loss: 7.1318 - val_acc: 0.1385
Epoch 37/50
40s - loss: 4.0999 - acc: 0.2499 - val_loss: 7.1532 - val_acc: 0.1381
Epoch 38/50
40s - loss: 4.0533 - acc: 0.2543 - val_loss: 7.1659 - val_acc: 0.1409
Epoch 39/50
40s - loss: 4.0083 - acc: 0.2604 - val_loss: 7.1852 - val_acc: 0.1389
Epoch 40/50
37s - loss: 3.9632 - acc: 0.2663 - val_loss: 7.2059 - val_acc: 0.1393
Epoch 41/50
41s - loss: 3.9196 - acc: 0.2719 - val_loss: 7.2269 - val_acc: 0.1386
Epoch 42/50
40s - loss: 3.8774 - acc: 0.2764 - val_loss: 7.2385 - val_acc: 0.1386
Epoch 43/50
41s - loss: 3.8377 - acc: 0.2825 - val_loss: 7.2597 - val_acc: 0.1408
Epoch 44/50
41s - loss: 3.7978 - acc: 0.2874 - val_loss: 7.2728 - val_acc: 0.1391
Epoch 45/50
41s - loss: 3.7604 - acc: 0.2923 - val_loss: 7.2909 - val_acc: 0.1372
Epoch 46/50
41s - loss: 3.7229 - acc: 0.2977 - val_loss: 7.3075 - val_acc: 0.1387
Epoch 47/50
41s - loss: 3.6861 - acc: 0.3024 - val_loss: 7.3216 - val_acc: 0.1403
Epoch 48/50
39s - loss: 3.6515 - acc: 0.3062 - val_loss: 7.3379 - val_acc: 0.1384
Epoch 49/50
40s - loss: 3.6179 - acc: 0.3112 - val_loss: 7.3497 - val_acc: 0.1401
Epoch 50/50
42s - loss: 3.5841 - acc: 0.3150 - val_loss: 7.3721 - val_acc: 0.1372

Plot validation accuracy and training accuracy

In [34]:
plt.figure(figsize=(8,8))
plt.plot(hist.history["val_acc"],label="val_acc")
plt.plot(hist.history["acc"],label="acc")
plt.legend()
plt.show()

Check the dimentions of available weights

In [22]:
count_layer = 0
for layer in model.layers:
    ws = layer.get_weights()
    c = 0 
    for w in ws: 
        print("layer{}_{} {:20} {}".format(count_layer,
                                           c,
                                           layer.name,w.shape))
        c += 1 
    count_layer += 1
layer1_0 embedding_2          (5689, 30)
layer2_0 lstm_2               (30, 256)
layer2_1 lstm_2               (64, 256)
layer2_2 lstm_2               (256,)
layer3_0 dense_2              (64, 5689)
layer3_1 dense_2              (5689,)

Reduce the dimention of the word vectors using PCA and visualize their distirubtions in 2d.

In [23]:
weight = model.layers[1].get_weights()[0]
print(weight.shape)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
y_embed_pca = pca.fit_transform(weight )
print(y_embed_pca.shape)
(5689, 30)
(5689, 2)

Observations

  • Seemingly similar words are clustered in the same area:
    • kim, saudi, radical, differently
    • north, south
    • estimates, statistically
    • knives, dialogs
    • independent, united
    • thank, honor
In [27]:
fig, ax = plt.subplots(figsize=(25,25))
ax.scatter(y_embed_pca[:,0],y_embed_pca[:,1],c="white")
for txt, irow in tokenizer.word_index.items():
    try:
        ax.annotate(txt,
                (y_embed_pca[irow,0],y_embed_pca[irow,1]))
    except:
        pass
ax.set_xlabel("pca embedding 1")
ax.set_ylabel("pca embedding 2")
plt.show()

Let President Trump's AI talk about some topics

I feed the first few words, and let's see what opnion of President Trump's AI is!! I randomly sample the next word according to the estimated probability distribution. This way, I can ensure that the outputs are different every time I feed the same initial words.

Hummm the predicted tweets makes sense, kind of?

Observations

  • When "Make America" is provided as the first 2 words, the AI almost always predicts "great again" as the next words.
  • When "North" is provided, the next word is almost always "Korea" followed often by some negative sentence.
  • The sentence starting with "Omaga is" tends to have negative meaning.
In [37]:
def sample(probs):
    return(np.random.choice(range(len(probs)),p=probs))

def predict_sentence(in_text,n_words,tokenizer,model,max_length):
    words = []
    for _ in range(n_words):
        # encode the text as integer
        enc = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        enc_padding = pad_sequences([enc], maxlen=max_length-1, padding='pre')
        probs = model.predict(enc_padding, verbose=0).flatten()
        index = sample(probs)
        
        word = index_word[index] 
        in_text += ' ' + word
        
    return(in_text)
print(predict_sentence("North",n_words,tokenizer,model,max_length))

print(predict_sentence("America",n_words,tokenizer,model,max_length))

print(predict_sentence("I'm",n_words,tokenizer,model,max_length))

print(predict_sentence("I won't",n_words,tokenizer,model,max_length))

print(predict_sentence("MAKE AMERICA",n_words,tokenizer,model,max_length))

print(predict_sentence("american jobs",n_words,tokenizer,model,max_length))

print(predict_sentence("Obama is",n_words,tokenizer,model,max_length))

print(predict_sentence("Universal healthcare",n_words,tokenizer,model,max_length))

print(predict_sentence("H1B visa",n_words,tokenizer,model,max_length))

print(predict_sentence("Women",n_words,tokenizer,model,max_length))

print(predict_sentence("fat",n_words,tokenizer,model,max_length))
North korea is so out with my the story on crooked hillary investigation as
America is time blames came played many agencies before he voted out this is
I'm so is dead they got china failing big crowd amp yates of historic
I won't been would be a great nation the u s made what a grateful
MAKE AMERICA great again to have forced to be a last legs in alabama uttered
american jobs growth and now elections great healthcare with mine that is many fake news
Obama is also way china media about representatives if it is a great days or
Universal healthcare and now we will the u s demand manchester should do real james
H1B visa person jobs in louvre powell to stand for what on the roosevelt room
Women individuals our great healthcare amp tax cuts is approved hopefully terminate bonuses in
fat other in china like killed presidential support that the rulers of she which

Next Steps

  • Try transfer learning
  • Increase more tweets
  • Prediction of the number of retweets/likes given tweets

Comments