The Power of Social Media Analytics: Twitter Text Mining Using R
Imagine you are a data scientist/ data enthusiast and you are really curious to know what people online are saying about a new product you just launched. Now, you most definitely introduced a hashtag whereby anyone who sees the hashtag can click on it and be brought to a page featuring the feed of all the most recent tweets that contain that particular hashtag.
Moving on, the main question becomes, how do you extract Twitter tweets in R? Well, there are definitely different softwares that can be used to analyze twitter texts but I choose R because it offers a wide variety of options to do lots of interesting things.
- Installed R and R Studio. Links( https://cran.r-project.org/ and https://www.rstudio.com/products/rstudio/download/#downloadrespectively)
- Make sure you have a Twitter account(https://twitter.com/login) first and a Twitter API( https://developer.twitter.com/en/apps). This API will help us to tap into the Twitter application so that we can be able to mine all tweets related to the hashtag that we shall specify in our API call.
Resources/ Book Recommendations
Text Mining with R( https://www.tidytextmining.com/), A Tidy Approach, Julia Silge and David Robinson.
and any other resources you might find helpful:)
Let’s Get Started
Step 1. Twitter Authentication for extracting tweets
- If you have clicked on this https://developer.twitter.com/en/apps then you should be on a page that looks like this;
2. Click on “Create an app” then you should get what is shown below;
3. Fill in the App details and give a simple description(This step should not worry you). For directed steps, I will guide you using a short GIF.
A few things to note here;
a)You can leave all the other parts blank for now except “ Website URL(required)”. Here, you should insert your website’s name. Hopefully, you have one. It could be a medium. com site, wordpress.com or any other functional website you own.
b) Once you have inserted your website name at “ Website URL (required)”, scroll down to the “Create”. ((blue in colour should change from inactive(blurred blue) to active(clear blue)))
c) If all is clear and perfect, click “Create”. You will then get this message “Review our Developer Terms. As a reminder, you have agreed to our Developer Agreement and Policy. Please be mindful of the following restricted use cases etc.”. After reading and understanding the terms, click on “Create”.
Congratulations! You have now created your Twitter app.
From this step, you can definitely do a “happy dance” for your small but awesome achievement.
4. After following the above steps, you will get to a page like this;
5. Click on “Keys and tokens” to access your Consumer API keys(API key and API secret key). Click on create “ Access token & access token secret” to get your Access token and Access token secret.
I will not include the Access Token and Consumer keys in any image which were provided by Twitter for security purposes of my app.
Please note somewhere the API key, API secret key, Access token and Access token secret numbers as they will be used in R later. You will see how.
Once the Twitter Application is ready we can now move forward towards programming in R to extract data from Twitter.
Step 2: R Programming
Install and Load the Libraries
I will use the ‘rtweet’ package for collecting twitter data whose author and maintainer is Michael W. Kearney. Thank you Michael!
Other packages in use;
tidyverse — For data cleaning and data visualization
tidytext — Text mining
In case you don’t have any of these packages installed, use the function:
# load twitter library - the rtweet library is recommended now over twitteR
# plotting and pipes - tidyverse!
# text mining library
In this example, let’s find #YouthSDGs tweets. So what are Sustainable Development Goals? The Sustainable Development Goals are a collection of 17 global goals set by the United Nations General Assembly in 2015 for the year 2030. The SDGs are part of Resolution 70/1 of the United Nations General Assembly, the 2030 Agenda. The image below also explains it all.
Search the Tweets
youth_tweets <- search_tweets(q = "#YouthSDGs", n = 10000,
lang = "en",
include_rts = FALSE)
Let’s look at the results
# check data to see if there are emojis
 "Happy Tuesday! We can all protect the environment by doing simple things like carrying our own reusable water bottles. #BeatPlasticPollution #greenrwanda #greengrowth #Rwanda #Rwot #youthsdgs https://t.co/3eGG9AvwTY"
 "#youngpeople are key, the #future of our country and the future of our #communities \n\n#Peace #SDGs #sustainable #youth #youthsdgs #WEAREHERE #Youth2030 #Youth4Peace \n\n@ConnectSDGs @IYCM @Agenda2030MX @UNYouthEnvoy @UN4Youth https://t.co/jy0RYDG5yh"
 "Guest blog: TO A DIFFERENT KIND OF HLPF – from a Youthful Soul https://t.co/fjKlgw6a4z @ICLEI_advocacy @TaboadaDaniela @GlblCtzn @hlpf2019 @SDGActors #SDG #YOUTH #youthsdgs #GlobalGoals #SDGs"
 "@YouthSDGs Firstly, raising awareness among the youth about what are the #SDGs and why it matters for the youth to be fully involved and committed in the process of achieving all 17 #SDGs across the world \U0001f30e! #YouthSDGs"
 "@YouthSDGs I did YOUTH & SUSTAINABLE DEVELOPMENT as a subject of Bachelor of Arts in Youth Development Work. Sustainable development main goal is all about securing the livelihoods of the future generations of which the youth belong to automatically! #SDGs #YouthSDGs"
 "- Promote young innovators\n- Give young people the platform to actively participate in dialogues about #SDGs \n- Let young people lead in implementing #SDGs projects. We have the energy and skill to make it happen\n\n#LeaveNoOneBehind #youthsdgs @YouthSDGs https://t.co/qQ61KhZ6jR"
Looking at the data above, it becomes clear that there is a lot of clean-up associated with social media data.
First, there are url’s in your tweets. If you want to do a text analysis to figure out what words are most common in your tweets, the URL’s won’t be helpful. Let’s remove those.
# remove http elements manually
youth_tweets$text <- gsub("http.*","", youth_tweets$text)
youth_tweets$text <- gsub("https.*","", youth_tweets$text)
You can use the
tidytext::unnest_tokens() function in the tidytext package to magically clean up your text! When you use this function the following things will be cleaned up in the text:
- Convert text to lowercase: each word found in the text will be converted to lowercase so ensure that you don’t get duplicate words due to variation in capitalization.
- Punctuation is removed: all instances of periods, commas etc will be removed from your list of words and
- Unique id associated with the tweet: will be added for each occurrence of the word
unnest_tokens() function takes two arguments:
- The name of the column where the unique word will be stored and
- The column name from the
data.framethat you are using that you want to pull unique words from.
In your case, you want to use the
text column which is where you have your cleaned up tweet text stored.
# remove punctuation, convert to lowercase, add id for each tweet!
youth_tweets_clean <- youth_tweets %>%
Now, let’s plot the data;
Plot the top 20 words
#Now you can plot your data. What do you notice?
# plot the top 20 wordsyouth_tweets_clean %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
labs(x = "Count",
y = "Unique words",
title = "Count of unique words found in #YouthSDGs tweets")
You plot of unique words contains some words that may not be useful to use. For instance “a”, “to” and “in”. In the word of text mining you call those words — ‘stop words’. You want to remove these words from your analysis as they are fillers used to compose a sentence.
Lucky for use, the
tidytext package has a function that will help us clean up stop words! To use this you:
- Load the
stop_wordsdata included with
tidytext. This data is simply a list of words that you may want to remove in natural language analysis.
- Then you use
anti_jointo remove all stop words from your analysis.
On we go, let’s give it a try!
# load list of stop words - from the tidytext package
# view first 6 words
head(stop_words)# A tibble: 6 x 2
# word lexicon
# <chr> <chr>
# 1 a SMART
# 2 a's SMART
# 3 able SMART
# 4 about SMART
# 5 above SMART
# 6 according SMARTnrow(youth_tweets_clean)
##  232# remove stop words from your list of words
youth_tweets_words <- youth_tweets_clean %>%
anti_join(stop_words)# there should be fewer words now
##  132
Since we have performed the last step of the data cleaning process, let’s see our code and output;
# plot the top 10 words -- notice any issues?youth_tweets_words %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
labs(y = "Count",
x = "Unique words",
title = "Count of top 10 unique words found in #YouthSDGs tweets",
subtitle = "Stop words removed from the list")
3. Let’s try a Comparison Word Cloud
comparson_cloud() features in
wordcloud allow a split of the most common words in the positive and negative sentiment dictionaries.
count(word, sentiment,sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("blue","purple"),
max.words = 150)
There’s so much more one can do with Twitter Text mining and sentiment analytics. Thanks for reading!
For more ideas, check;
Feel free to contact me and share feedback on Twitter (https:// https://twitter.com/magwanjiru).