The Power of Social Media Analytics: Twitter Text Mining Using R

8 min readJul 23, 2019

Introduction

Imagine you are a data scientist/ data enthusiast and you are really curious to know what people online are saying about a new product you just launched. Now, you most definitely introduced a hashtag whereby anyone who sees the hashtag can click on it and be brought to a page featuring the feed of all the most recent tweets that contain that particular hashtag.

Moving on, the main question becomes, how do you extract Twitter tweets in R? Well, there are definitely different softwares that can be used to analyze twitter texts but I choose R because it offers a wide variety of options to do lots of interesting things.

Prerequisites

Installed R and R Studio. Links( https://cran.r-project.org/ and https://www.rstudio.com/products/rstudio/download/#downloadrespectively)
Make sure you have a Twitter account(https://twitter.com/login) first and a Twitter API( https://developer.twitter.com/en/apps). This API will help us to tap into the Twitter application so that we can be able to mine all tweets related to the hashtag that we shall specify in our API call.

Resources/ Book Recommendations

Text Mining with R( https://www.tidytextmining.com/), A Tidy Approach, Julia Silge and David Robinson.

and any other resources you might find helpful:)

Let’s Get Started

Step 1. Twitter Authentication for extracting tweets

If you have clicked on this https://developer.twitter.com/en/apps then you should be on a page that looks like this;

2. Click on “Create an app” then you should get what is shown below;

3. Fill in the App details and give a simple description(This step should not worry you). For directed steps, I will guide you using a short GIF.

A few things to note here;

a)You can leave all the other parts blank for now except “ Website URL(required)”. Here, you should insert your website’s name. Hopefully, you have one. It could be a medium. com site, wordpress.com or any other functional website you own.

b) Once you have inserted your website name at “ Website URL (required)”, scroll down to the “Create”. ((blue in colour should change from inactive(blurred blue) to active(clear blue)))

c) If all is clear and perfect, click “Create”. You will then get this message “Review our Developer Terms. As a reminder, you have agreed to our Developer Agreement and Policy. Please be mindful of the following restricted use cases etc.”. After reading and understanding the terms, click on “Create”.

Congratulations! You have now created your Twitter app.

From this step, you can definitely do a “happy dance” for your small but awesome achievement.

4. After following the above steps, you will get to a page like this;

5. Click on “Keys and tokens” to access your Consumer API keys(API key and API secret key). Click on create “ Access token & access token secret” to get your Access token and Access token secret.

I will not include the Access Token and Consumer keys in any image which were provided by Twitter for security purposes of my app.

Please note somewhere the API key, API secret key, Access token and Access token secret numbers as they will be used in R later. You will see how.

Once the Twitter Application is ready we can now move forward towards programming in R to extract data from Twitter.

Step 2: R Programming

Install and Load the Libraries

I will use the ‘rtweet’ package for collecting twitter data whose author and maintainer is Michael W. Kearney. Thank you Michael!

Other packages in use;

tidyverse — For data cleaning and data visualization

tidytext — Text mining

In case you don’t have any of these packages installed, use the function:

install.packages(“package name”)

Code

# load twitter library - the rtweet library is recommended now over twitteR
library(rtweet)
# plotting and pipes - tidyverse!
library(ggplot2)
library(dplyr)
# text mining library
library(tidytext)

In this example, let’s find #YouthSDGs tweets. So what are Sustainable Development Goals? The Sustainable Development Goals are a collection of 17 global goals set by the United Nations General Assembly in 2015 for the year 2030. The SDGs are part of Resolution 70/1 of the United Nations General Assembly, the 2030 Agenda. The image below also explains it all.

Search the Tweets

Code

youth_tweets <- search_tweets(q = "#YouthSDGs", n = 10000,
                                lang = "en",
                                include_rts = FALSE)

Let’s look at the results

# check data to see if there are emojis
head(youth_tweets$text)
[1] "Happy Tuesday! We can all protect the environment by doing simple things like carrying our own reusable water bottles. #BeatPlasticPollution #greenrwanda #greengrowth #Rwanda #Rwot #youthsdgs https://t.co/3eGG9AvwTY"                                                                  
[2] "#youngpeople are key, the #future of our country and the future of our #communities \n\n#Peace #SDGs #sustainable #youth #youthsdgs #WEAREHERE  #Youth2030 #Youth4Peace \n\n@ConnectSDGs @IYCM @Agenda2030MX @UNYouthEnvoy @UN4Youth https://t.co/jy0RYDG5yh"                             
[3] "Guest blog: TO A DIFFERENT KIND OF HLPF – from a Youthful Soul https://t.co/fjKlgw6a4z @ICLEI_advocacy  @TaboadaDaniela  @GlblCtzn @hlpf2019 @SDGActors  #SDG #YOUTH #youthsdgs #GlobalGoals #SDGs"                                                                                       
[4] "@YouthSDGs Firstly, raising awareness among the youth about what are the #SDGs and why it matters for the youth to be fully involved and committed in the process of achieving all 17 #SDGs across the world \U0001f30e! #YouthSDGs"                                                               
[5] "@YouthSDGs I did YOUTH &amp; SUSTAINABLE DEVELOPMENT as a subject of Bachelor of Arts in Youth Development Work. Sustainable development main goal is all about securing the livelihoods of the future generations of which the youth belong to automatically! #SDGs #YouthSDGs"          
[6] "- Promote young innovators\n- Give young people the platform to actively participate in dialogues about #SDGs \n- Let young people lead in implementing #SDGs projects. We have the energy and skill to make it happen\n\n#LeaveNoOneBehind #youthsdgs @YouthSDGs https://t.co/qQ61KhZ6jR"

Data Cleaning

Looking at the data above, it becomes clear that there is a lot of clean-up associated with social media data.

First, there are url’s in your tweets. If you want to do a text analysis to figure out what words are most common in your tweets, the URL’s won’t be helpful. Let’s remove those.

Code

# remove http elements manually
youth_tweets$text <- gsub("http.*","",  youth_tweets$text)
youth_tweets$text <- gsub("https.*","", youth_tweets$text)

You can use the tidytext::unnest_tokens() function in the tidytext package to magically clean up your text! When you use this function the following things will be cleaned up in the text:

Convert text to lowercase: each word found in the text will be converted to lowercase so ensure that you don’t get duplicate words due to variation in capitalization.
Punctuation is removed: all instances of periods, commas etc will be removed from your list of words and
Unique id associated with the tweet: will be added for each occurrence of the word

The unnest_tokens() function takes two arguments:

The name of the column where the unique word will be stored and
The column name from the data.frame that you are using that you want to pull unique words from.

In your case, you want to use the text column which is where you have your cleaned up tweet text stored.

# remove punctuation, convert to lowercase, add id for each tweet!
youth_tweets_clean <- youth_tweets %>%
  dplyr::select(text) %>%
  unnest_tokens(word, text)

Now, let’s plot the data;

Plot the top 20 words

#Now you can plot your data. What do you notice?
  
# plot the top 20 wordsyouth_tweets_clean %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(x = "Count",
       y = "Unique words",
       title = "Count of unique words found in #YouthSDGs tweets")

Output

You plot of unique words contains some words that may not be useful to use. For instance “a”, “to” and “in”. In the word of text mining you call those words — ‘stop words’. You want to remove these words from your analysis as they are fillers used to compose a sentence.

Lucky for use, the tidytext package has a function that will help us clean up stop words! To use this you:

Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in natural language analysis.
Then you use anti_join to remove all stop words from your analysis.

On we go, let’s give it a try!

# load list of stop words - from the tidytext package
data("stop_words")
# view first 6 words
head(stop_words)# A tibble: 6 x 2
# word      lexicon
# <chr>     <chr>
# 1 a         SMART
# 2 a's       SMART
# 3 able      SMART
# 4 about     SMART
# 5 above     SMART
# 6 according SMARTnrow(youth_tweets_clean)
## [1] 232# remove stop words from your list of words
youth_tweets_words <- youth_tweets_clean %>%
  anti_join(stop_words)# there should be fewer words now
nrow(youth_tweets_words)
## [1] 132

Since we have performed the last step of the data cleaning process, let’s see our code and output;

Code

# plot the top 10 words -- notice any issues?youth_tweets_words %>%
  count(word, sort = TRUE) %>%
  top_n(10) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(y = "Count",
       x = "Unique words",
       title = "Count of top 10 unique words found in #YouthSDGs tweets",
       subtitle = "Stop words removed from the list")

Output

3. Let’s try a Comparison Word Cloud

The comparson_cloud() features in wordcloud allow a split of the most common words in the positive and negative sentiment dictionaries.

Code

library(wordcloud) 
library(reshape2)youth_tweets_words%>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment,sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("blue","purple"),
                   max.words = 150)

Output