This blog post is to remind myself the simple useage of the tweepy. I will extract someone's past tweets using tweepy and create .csv file that can be used to train machine learning models. I created the scripts by referencing the following seminal blog posts:
Importing necessary python scripts.
In [28]:
## credentials contain:
# customer_key = "XXX"
# customer_secret = "XXX"
# access_token = "XXX"
# access_token_secret = "XXX"
from credentials import *
import tweepy
print(tweepy.__version__)
Select the userID.
In [2]:
userID = "realDonaldTrump"
Step 1:¶
- extract the latest 200 tweets using api.user_timeline
In [31]:
# Authorize our Twitter credentials
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
tweets = api.user_timeline(screen_name=userID,
# 200 is the maximum allowed count
count=200,
include_rts = False,
# Necessary to keep full_text
# otherwise only the first 140 words are extracted
tweet_mode = 'extended'
)
Show the extracted 3 latest tweets
info.id is larger for the later tweets
In [32]:
for info in tweets[:3]:
print("ID: {}".format(info.id))
print(info.created_at)
print(info.full_text)
print("\n")
Step 2:¶
Extract as many past tweets as possible. I was able to extract
In [33]:
all_tweets = []
all_tweets.extend(tweets)
oldest_id = tweets[-1].id
while True:
tweets = api.user_timeline(screen_name=userID,
# 200 is the maximum allowed count
count=200,
include_rts = False,
max_id = oldest_id - 1,
# Necessary to keep full_text
# otherwise only the first 140 words are extracted
tweet_mode = 'extended'
)
if len(tweets) == 0:
break
oldest_id = tweets[-1].id
all_tweets.extend(tweets)
print('N of tweets downloaded till now {}'.format(len(all_tweets)))
Step 3:¶
Save the tweets into csv
In [18]:
tweet.full_text.encode("utf-8")
In [34]:
#transform the tweepy tweets into a 2D array that will populate the csv
from pandas import DataFrame
outtweets = [[tweet.id_str,
tweet.created_at,
tweet.favorite_count,
tweet.retweet_count,
tweet.full_text.encode("utf-8").decode("utf-8")]
for idx,tweet in enumerate(all_tweets)]
df = DataFrame(outtweets,columns=["id","created_at","favorite_count","retweet_count", "text"])
df.to_csv('%s_tweets.csv' % userID,index=False)
df.head(3)
Out[34]:
The data is saved at current working directory as:
In [23]:
ls *.csv
In [24]:
cat *.csv | head -4
Preliminary analysis of President Trump's tweets¶
Let's look at how the favorite counts and retweet counts change over time.
- There are some extraordinary popular tweets.
- It shows that we extracted the tweets since 2016-10.
In [25]:
import matplotlib.pyplot as plt
ylabels = ["favorite_count","retweet_count"]
fig = plt.figure(figsize=(13,3))
fig.subplots_adjust(hspace=0.01,wspace=0.01)
n_row = len(ylabels)
n_col = 1
for count, ylabel in enumerate(ylabels):
ax = fig.add_subplot(n_row,n_col,count+1)
ax.plot(df["created_at"],df[ylabel])
ax.set_ylabel(ylabel)
plt.show()
Let's look at the actual most popular tweets. Here, most popular tweets are defined as favorite_count > 400,000 and retweet_count > 200,000.
- The 1st peak: The tweets that President Trump made when he was selected to President.
- The 2nd peak: President Trump's response to CNN. This tweet includes a youtube video where President Trump body slams a man whose face is covered with the text "CNN".
- The 3rd peak: President Trump's response to Kim Jong-un.
In [26]:
df_sub = df.loc[(df["favorite_count"] > 400000) & (df["retweet_count"] > 200000),:]
for irow in range(df_sub.shape[0]):
df_row = df_sub.iloc[irow,:]
print(df_row["created_at"])
print("favorite_count={:6} retweet_count={:6}".format(df_row["favorite_count"],df_row["retweet_count"]))
print(df_row["text"])
print("\n")