Blog > Big data

Building a simple sentiment classifier in R using Trump's tweets

12/08/2019

For the past few years, tasks involving text and speech processing have become really hot-trendy. Among the various researches belonging to the fields of Natural Language Processing and Machine Learning, sentiment analysis ranks really high. Sentiment analysis allows identifying and getting subjective information from the source data using data analysis and visualization, ML models for classification, text mining and analysis. This helps to understand social opinions on the subject, so sentiment analysis is widely used in business and politics and usually conducted in social networks.

Nowadays, social nets and forums are the main stage for people sharing opinions. That is why they are so interesting for researches to figure out the attitude to one or another object. Sentiment analysis allows challenging the problem of analyzing textual data created by users on social nets, microblogging platforms and forums, as well as business platforms regarding the opinions the users have about some product, service, person, idea, and so on. In the most common approach, text can be classified into two classes (binary sentiment classification): positive and negative, but sentiment analysis can have way more classes involving multi-class problem. Sentiment analysis allows processing hundreds and thousands of texts in a short period of time. This is another reason for its popularity - while people need many hours to do the same work, sentiment analysis is able to finish it in a few seconds.

Common approaches for classifying sentiments

Sentiment analysis of the text data can be done via three commonly used methods: machine learning, using dictionaries, and hybrid.

Learning-based approach

Machine learning approach is one of the most popular nowadays. Using ML techniques and various methods, users can build a classifier that can identify different sentiment expressions in the text.

Dictionary-based approach

The main concept of this approach is using a bag of words with polarity scores, that can help to establish whether the word has a positive, negative, or neutral connotation. Such an approach doesn't require any training set to be used allowing to classify even a small amount of data. However, there are a lot of words and expressions that are still not included in sentiment dictionaries.

Hybrid approach

As is evident from the title, this approach combines machine learning and lexicon-based techniques. Despite the fact that it's not widely used, the hybrid approach shows more promising and valuable results than the two approaches used separately.

In this article, we will implement a dictionary-based approach, so let's deep into its basis.

Dictionary (or Lexicon)-based sentiment analysis uses special dictionaries, lexicons, and methods, a lot number of which is available for calculating sentiment in text. The main are:

All three are the sentiment dictionaries which help to evaluate the valence of the textual data by searching for words that describe emotion or opinion.

Things needed to be done before sentiment analysis

Before starting building sentiment analyzer, a few steps must be taken. First of all, we need to state the problem we are going to explore, understand its objective. Since we will use data from Donald Trump twitter, let’s claim our objective as an attempt to analyze which connotation his last tweets have.

As the problem is outlined, we need to prepare our data for examining.

Data preprocessing is basically an initial step in text and sentiment classification. Depending on the input data, various amount of techniques can be applied in order to make data more comprehensible and improve the effectiveness of the analysis. The most common steps in data processing are:

removing numbers
removing stopwords
removing punctuation and so on.

Building sentiment classifier

The first step in building our classifier is installing the needed packages. We will need the following packages that can be installed via command install.packages("name of the package") directly from the development environment:

twitteR
dplyr
splitstackshape
purrr

As soon as all the packages are installed, let's initialize them.

In [6]:

library(twitteR)
library(dplyr)
library(splitstackshape)
library(tidytext)
library(purrr)

We're going to get tweets from Donald Trump's account directly from Twitter, and that is why we need to provide Twitter API credentials.

In [26]:

api_key <- "----"
api_secret <- "----"
access_token <- "----"
access_token_secret <- "----"

In [27]:

setup_twitter_oauth(api_key,api_secret,access_token,access_token_secret)

[1] "Using direct authentication"

And now it's time to get tweets from Donald Trump's account and convert the data into dataframe.

In [114]:

TrumpTweets <- userTimeline("realDonaldTrump", n = 3200)

In [115]:

TrumpTweets <- tbl_df(map_df(TrumpTweets, as.data.frame))

Here how our initial dataframe looks:

In [116]:

head(TrumpTweets)

text	favorited	favoriteCount	replyToSN	created	truncated	replyToSID	id	replyToUID	statusSource	screenName	retweetCount	isRetweet	retweeted	longitude	latitude
It was my great honor to host Canadian Prime Minister @JustinTrudeau at the @WhiteHouse today!🇺🇸🇨🇦 https://t.co/orlejZ9FFs	FALSE	17424	NA	2019-06-20 17:49:52	FALSE	NA	1141765119929700353	NA	<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>	realDonaldTrump	3540	FALSE	FALSE	NA	NA
Iran made a very big mistake!	FALSE	127069	NA	2019-06-20 14:15:04	FALSE	NA	1141711064305983488	NA	<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>	realDonaldTrump	38351	FALSE	FALSE	NA	NA
“The President has a really good story to tell. We have unemployment lower than we’ve seen in decades. We have peop… https://t.co/Pl2HsZbiRK	FALSE	36218	NA	2019-06-20 14:14:13	TRUE	NA	1141710851617034240	NA	<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>	realDonaldTrump	8753	FALSE	FALSE	NA	NA
S&P opens at Record High!	FALSE	43995	NA	2019-06-20 13:58:53	FALSE	NA	1141706991464849408	NA	<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>	realDonaldTrump	9037	FALSE	FALSE	NA	NA
Since Election Day 2016, Stocks up almost 50%, Stocks gained 9.2 Trillion Dollars in value, and more than 5,000,000… https://t.co/nOj2hCnU11	FALSE	62468	NA	2019-06-20 00:12:31	TRUE	NA	1141499029727121408	NA	<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>	realDonaldTrump	16296	FALSE	FALSE	NA	NA
Congratulations to President Lopez Obrador — Mexico voted to ratify the USMCA today by a huge margin. Time for Congress to do the same here!	FALSE	85219	NA	2019-06-19 23:01:59	FALSE	NA	1141481280653209600	NA	<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>	realDonaldTrump	20039	FALSE	FALSE	NA	NA

To prepare our data for classification, let's get rid of links and format dataframe in a way when only one word is in line.

In [117]:

TrumpTweets <- TrumpTweets[-(grep('t.co', TrumpTweets$'text')),]

In [118]:

TrumpTweets$'tweet' <- 'tweet'
TrumpTweets <- TrumpTweets[ , c('text', 'tweet')]
TrumpTweets <- unnest_tokens(TrumpTweets, words, text)

And this is how our dataframe looks now:

In [119]:

tail(TrumpTweets)

tweet	words
tweet	most
tweet	of
tweet	their
tweet	people
tweet	from
tweet	venezuela

In [120]:

head(TrumpTweets)

tweet	words
tweet	iran
tweet	made
tweet	a
tweet	very
tweet	big
tweet	mistake

It's obvious that dataframe also contains various words without useful content. So it's a good idea to get rid of them.

In [121]:

TrumpTweets <- anti_join(TrumpTweets, stop_words, by = c('words' = 'word'))

And here's the result:

In [122]:

tail(TrumpTweets)

tweet	words
tweet	harassment
tweet	russia
tweet	informed
tweet	removed
tweet	people
tweet	venezuela

In [123]:

head(TrumpTweets)

tweet	words
tweet	iran
tweet	mistake
tweet	amp
tweet	record
tweet	congratulations
tweet	president

Much better, isn't it?

Let's see how many times each word appears in Donald Trump's tweets.

In [124]:

word_count <- dplyr::count(TrumpTweets, words, sort = TRUE)

In [125]:

head(word_count)

words	n
day	2
democrats	2
enjoy	2
florida	2
iran	2
live	2

Now it's time to create some dataframe with sentiments that will be used for tweets classification. We will use bing dictionary although you can easily use any other source.

In [126]:

sentiments <-get_sentiments("bing")
sentiments <- dplyr::select(sentiments, word, sentiment)

In [127]:

TrumpTweets_sentiments <- merge(word_count, sentiments, by.x = c('words'), by.y = c('word'))

Above we did a simple classification of Trump's tweets words using our sentiment bag of words. And this is how the result looks:

In [128]:

TrumpTweets_sentiments

words	n	sentiment
beautiful	1	positive
burning	1	negative
congratulations	1	positive
defy	1	negative
enjoy	2	positive
harassment	1	negative
hell	1	negative
limits	1	negative
mistake	1	negative
scandal	1	negative
strong	1	positive
trump	1	positive

Let's look at the number of occurrences per sentiment in tweets.

In [131]:

sentiments_count <- dplyr::count(TrumpTweets_sentiments, sentiment, sort = TRUE)

In [132]:

sentiments_count

sentiment	n
negative	7
positive	5

We also may want to know the total count and percentage of all the sentiments.

In [133]:

sentiments_sum <- sum(sentiments_count$'n')

In [134]:

sentiments_count$'percentage' <- sentiments_count$'n' / sentiments_sum

Let's now create an ordered dataframe for plotting counts of sentiments.

In [135]:

sentiment_count <- rbind(sentiments_count)

In [136]:

sentiment_count <- sentiment_count[order(sentiment_count$sentiment), ]

In [137]:

sentiment_count

sentiment	n	percentage
negative	7	0.5833333
positive	5	0.4166667

And now it's time for the visualization. We will plot the results of our classifier.

In [144]:

sentiment_count$'colour' <- as.integer(4)

In [145]:

barplot(sentiment_count$'n', names.arg = sentiment_count$'sentiment', col = sentiment_count$'colour', cex.names = .5)

In [146]:

barplot(sentiment_count$'percentage', names.arg = sentiment_count$'sentiment', col = sentiment_count$'colour', cex.names = .5)

Conclusion

Sentiment analysis is a great way to explore emotions and opinions among the people. Today we explored the most common and easy way for sentiment analysis that is still great in its simplicity and gives quite an informative result. However, it should be noted that different sentiment analysis methods and lexicons work better depending on the problem and text corpuses.

The result of the dictionary-based approach also depends much on the matching between the used dictionary and the textual data needed to be classified. But still, user can create own dictionary that can be a good solution. Despite this, dictionary-based methods usually show much better results than more compound techniques.