Quick intro
Getting familiar with the new tidytext
package was a great weekend project. This example follows the structure of the Introduction to tidytext article by the authors of the package, Julia Silge and David Robinson.
The source of the text for this example are tweets. More specifically, tweets with the rstats hashtag. This project will also be an attempt to learn something about the community.
Twitter data import
The twitterR
package opens the Twitter API to R users. The config
package is used to prevent placing the credentials in the code.
Because these tweets were pulled at a specific point in time, anyone recreating this analysis may get different results.
library(twitteR)
token <- config::get()
setup_twitter_oauth(token$key, token$secret, token$token, token$tsecret)
## [1] "Using direct authentication"
tweets <- searchTwitter("#rstats", 10000)
## Warning in doRppAPICall("search/tweets", n, params = params,
## retryOnRateLimit = retryOnRateLimit, : 10000 tweets were requested but the
## API can only return 9050
Tidy data
The searchTwitter()
function returns a rather complex list object. Using the packages in the tidyverse, the complex list is converted to a tidy table, retweets are removed, and an usable date field is added
library(tidyverse)
tidy_tweets <- tibble(
screen_name = tweets %>% map_chr(~.x$screenName),
tweetid = tweets %>% map_chr(~.x$id),
created_timestamp = seq_len(length(tweets)) %>% map_chr(~as.character(tweets[[.x]]$created)),
is_retweet = tweets %>% map_chr(~.x$isRetweet),
text = tweets %>% map_chr(~.x$text)
) %>%
mutate(created_date = as.Date(created_timestamp)) %>%
filter(is_retweet == FALSE,
substr(text, 1,2) != "RT")
tidy_tweets
## # A tibble: 2,389 x 6
## screen_name tweetid created_timestamp is_retweet
## <chr> <chr> <chr> <chr>
## 1 kearneymw 904169552526934017 2017-09-03 02:29:54 FALSE
## 2 ahmad_m_mobin 904167193428074496 2017-09-03 02:20:32 FALSE
## 3 SanghaChick 904166809083027457 2017-09-03 02:19:00 FALSE
## 4 SavranWeb 904165843533213696 2017-09-03 02:15:10 FALSE
## 5 nibrivia 904165791972687872 2017-09-03 02:14:58 FALSE
## 6 Cruz_Julian_ 904162499343376385 2017-09-03 02:01:53 FALSE
## 7 zabormetrics 904157256408813569 2017-09-03 01:41:03 FALSE
## 8 AriLamstein 904156509755637760 2017-09-03 01:38:04 FALSE
## 9 o_gonzales 904151758754238464 2017-09-03 01:19:12 FALSE
## 10 Rbloggers 904149400884215809 2017-09-03 01:09:50 FALSE
## # ... with 2,379 more rows, and 2 more variables: text <chr>,
## # created_date <date>
tidytext, transform!
Word tokens
The unnest_tokens()
command from the tidytext
package easily transforms the existing tidy table with one row (observation) per tweet, to a table with one row (token) per word inside the tweet.
library(tidytext)
tweet_words <- tidy_tweets %>%
select(tweetid,
screen_name,
text,
created_date) %>%
unnest_tokens(word, text)
tweet_words
## # A tibble: 37,846 x 4
## tweetid screen_name created_date word
## <chr> <chr> <date> <chr>
## 1 904169552526934017 kearneymw 2017-09-03 new
## 2 904169552526934017 kearneymw 2017-09-03 hex
## 3 904169552526934017 kearneymw 2017-09-03 sticker
## 4 904169552526934017 kearneymw 2017-09-03 for
## 5 904169552526934017 kearneymw 2017-09-03 rtweet
## 6 904169552526934017 kearneymw 2017-09-03 rstats
## 7 904169552526934017 kearneymw 2017-09-03 https
## 8 904169552526934017 kearneymw 2017-09-03 t.co
## 9 904169552526934017 kearneymw 2017-09-03 dfvl8xj2x6
## 10 904167193428074496 ahmad_m_mobin 2017-09-03 pretty
## # ... with 37,836 more rows
Stop words
The stop_words
table is part of tidytext
, it contains common words that can be used to discard from an analysis. This is the kind of list that analysts usually have to find online and then clean up manually.
stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
An small custom stop words list is put together to reduce the noise caused by terms common in tweets.
my_stop_words <- tibble(
word = c(
"https",
"t.co",
"rt",
"amp",
"rstats",
"gt"
),
lexicon = "twitter"
)
The combined list of stop words are then used to remove such words from the words in the tweets. An additional filter is added to remove words that are numbers.
all_stop_words <- stop_words %>%
bind_rows(my_stop_words)
suppressWarnings({
no_numbers <- tweet_words %>%
filter(is.na(as.numeric(word)))
})
no_stop_words <- no_numbers %>%
anti_join(all_stop_words, by = "word")
tibble(
total_words = nrow(tweet_words),
after_cleanup = nrow(no_stop_words)
)
## # A tibble: 1 x 2
## total_words after_cleanup
## <int> <int>
## 1 37846 17964
More than half of the words in the tweets are considered stop words. Here is a quick look of the words that are currently at the top, based on occurrence:
top_words <- no_stop_words %>%
group_by(word) %>%
tally %>%
arrange(desc(n)) %>%
head(10)
top_words
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 datascience 449
## 2 data 284
## 3 cran 201
## 4 package 196
## 5 machinelearning 108
## 6 rstudio 102
## 7 python 93
## 8 updates 92
## 9 code 88
## 10 y5w2ntksxt 86
Sentiment matching
The get_sentiments()
functions in tidytext
makes it really easy to match words against different lexicons (vocabularies). The NRC lexicon was chosen for this analysis. The get_sentiments()
function returns a data frame, a simple table join makes the lexicon part of the analysis.
nrc_words <- no_stop_words %>%
inner_join(get_sentiments("nrc"), by = "word")
nrc_words
## # A tibble: 4,126 x 5
## tweetid screen_name created_date word sentiment
## <chr> <chr> <date> <chr> <chr>
## 1 904167193428074496 ahmad_m_mobin 2017-09-03 pretty anticipation
## 2 904167193428074496 ahmad_m_mobin 2017-09-03 pretty joy
## 3 904167193428074496 ahmad_m_mobin 2017-09-03 pretty positive
## 4 904167193428074496 ahmad_m_mobin 2017-09-03 pretty trust
## 5 904167193428074496 ahmad_m_mobin 2017-09-03 cool positive
## 6 904166809083027457 SanghaChick 2017-09-03 fun anticipation
## 7 904166809083027457 SanghaChick 2017-09-03 fun joy
## 8 904166809083027457 SanghaChick 2017-09-03 fun positive
## 9 904165843533213696 SavranWeb 2017-09-03 script positive
## 10 904165791972687872 nibrivia 2017-09-03 start anticipation
## # ... with 4,116 more rows
It is worth mentioning that in the NRC lexicon, one word may have multiple sentiments. For example, the word wait, has a negative and an anticipation classification. From the data joining perspective, this means multiple matches for words that have more than one sentiment.
nrc_words %>%
group_by(sentiment) %>%
tally %>%
arrange(desc(n))
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 positive 1225
## 2 trust 770
## 3 anticipation 512
## 4 joy 385
## 5 negative 381
## 6 fear 190
## 7 sadness 180
## 8 disgust 173
## 9 surprise 171
## 10 anger 139
Removing that many words from the analysis may mean that there are tweets that had no words that matched NRC. A quick count of the unique tweetid
will provide the answer. In this case, all the tweets from tidy_tweets
had at least 1 word that matched NRC list.
nrc_words %>%
group_by(tweetid) %>%
tally %>%
ungroup %>%
count %>%
pull
## [1] 1362
Visualize results
The first visualization is a joyplot using ggplot2
and the new ggjoy
extension created by Claus O. Wilke.
library(ggjoy)
ggplot(nrc_words) +
geom_joy(aes(
x = created_date,
y = sentiment,
fill = sentiment),
rel_min_height = 0.01,
alpha = 0.7,
scale = 3) +
theme_joy() +
labs(title = "Twitter #rstats sentiment analysis",
x = "Tweet Date",
y = "Sentiment") +
scale_fill_discrete(guide=FALSE)
## Picking joint bandwidth of 0.683
Joyful words
The influence of words classified as “joy” by NRC are analyzed in the wordcloud()
function inside the wordcloud
package. The words, love and create come out one top.
library(RColorBrewer)
library(wordcloud)
set.seed(10)
joy_words <- nrc_words %>%
filter(sentiment == "joy") %>%
group_by(word) %>%
tally
joy_words %>%
with(wordcloud(word, n, max.words = 50, colors = c("#56B4E9", "#E69F00")))
Because a tweet is short, it made sense to find out what words surround joyful words. The next wordcloud will use tweets with at least one word consider joyful. The joyful words are removed, as well as the top 10 orverall words to get a better picture of the surrounding words unique with this sentiment.
other_words <- nrc_words %>%
filter(sentiment == "joy") %>%
group_by(tweetid) %>%
tally %>%
ungroup() %>%
inner_join(no_stop_words, by = "tweetid") %>%
anti_join(joy_words, by = "word") %>%
anti_join(top_words, by = "word") %>%
group_by(word) %>%
count
other_words %>%
with(wordcloud(word, nn, max.words = 30, colors = c( "#56B4E9", "#E69F00")))
## Warning in wordcloud(word, nn, max.words = 30, colors = c("#56B4E9",
## "#E69F00")): proposals could not be fit on page. It will not be plotted.
Conclusion
The commentary on the results on the visualizations was limited because I am not a text mining expert. Personally, the results of the “joyful” words was a bit inpirational.
A more objective conclusion is that the tidyverse
packages, which seems that it soon will include tidytext
, make getting started on text mining easy and actually, fun!