Getting the Data

This blog assumes that you were able to follow along for the bag of words organization articulated in the previous blog. That means you should be able to load a simple CSV file, clean and preprocess it and then create either a term document matrix or document term matrix. If you struggled with these concepts, then review them once more before proceeding. This blog rapidly moves from file to document matrix, so that explanations are kept to the additional manipulations needed to create the visualizations.

Simple Exploration: Term Frequency,

Associations and Word Networks

Sometimes merely looking at frequent terms can be an interesting and insightful endeavor. On some occasions, frequent terms are expected within a text mining project. However, unusual words or (later in the book as you explore multi‐gram tokenization) phrases can yield a previously unknown relationship.

This section of the book constructs simple visualizations from term frequency as a means to reinforce what would probably already be known in the case of DeltaAssist customer service tweets. It then goes a step further to look at a specific term association. Text mining association is similar to the statistical concept of correlation.

That is to say, as the frequency of a single word occurs, how correlated or associated is it with another word? The exploration of the term association can yield interesting relationships among a large set of terms. Without also coupling association with word frequency, this may actually be misleading and become a fishing expedition, because the number of individual terms can be immense. Lastly, this section adds a word network using the qdap package.

This is another way in which to explore association and connected terms. Those familiar with social network analysis will be equally familiar with the concept of a word network. This relationship between words is captured in a special matrix called an adjacency matrix, similar to the individuals of a social network. A word network differs from word association. A word network explores multiple word linkages simultaneously.

For example, the words “text,” “mining” and “book” can all be graphed at the same time in a word network. The word network will have scores for pairs “text” to “mining,” “text” to “book” and “mining” to “book”. In contrast, word association scores represent the relationships of a single word to others, such as “text” to “mining” and “text” to “book”. This contrasts because there is no score for the pair “mining” and “book.”

Term Frequency

Although not aesthetically interesting, a bar plot can convey amounts in a quick and easy‐to‐digest manner. So let’s create a bar plot of the most frequent terms, and see if anything surprising shows up. To do so you will be loading the package ggthemes. This package has predefined themes and color palettes for ggplot2 visualizations. As a result, we do not have to specify them all explicitly. Using it saves time compared to using the popular ggplot2 package alone. There are other visualization packages within the R ecosystem but ggplot2 is both popular and adequate in most cases.

You need to bring in a corpus and then organize it in Office com setup. To do so you ultimately need to get back to a cleaned term document matrix. After applying the last blog’s clean. corpus custom function, you need to make the matrix and then, as before, get the row sums into an ordered data frame. The code below should look very familiar as it redoes the same steps as the previous blog, ending in an ordered data frame of term frequencies. However, at this point, we are going a step beyond the tabled data and create a simple bar plot.







tryTolower <- function(x){

y = NA

try_error = tryCatch(tolower(x), error = function(e) e)

if (!inherits(try_error, 'error'))

y = tolower(x)



custom.stopwords <- c(stopwords('english'), 'lol',

'smh', 'delta', 'amp')

clean.corpus<-function(corpus){ corpus <- tm_map(corpus, content_transformer(tryTolower)) corpus <- tm_map(corpus, removeWords,


corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, stripWhitespace)

corpus <- tm_map(corpus, removeNumbers)


} <- readTabular(mapping= list(content="text", id="ID"))

corpus <- VCorpus(DataframeSource(tweets),









freq.df<-freq.df[order(freq.df[,2], decreasing=T),]

All of the above code is needed to create the freq.df data frame object. This becomes the data used in the ggplot2 code below that constructs the bar plot. In order to have ggplot2 sort the bars by value, the unique words have to be changed from a string to a factor with unique levels. Then you actually call on ggplot2 to make your bar plot.

The first input to the ggplot function is the data to reference. Here you specify only the first 20 words so that the visual will not be too cluttered. The freq.df[1:20,] below can be adjusted to add more or fewer bars in the visualization.

freq.df$word<-factor(freq.df$word, levels=unique(as.character(freq.df$word))) ggplot(freq.df[1:20,], aes(x=word, y=frequency))+geom_bar(stat="identity", fill='darkred')+coord_flip()+theme_gdocs()+ geom_text(aes(label=frequency), colour="white",hjust=1.25, size=5.0)

The gg within ggplot stands for the grammar of graphics and creates the visualization. The ggplot code uses a structured and layered method to create visualizations. First, we pass the data object to be used in the visual freq.df, indexing only the first 20 terms. Next, we define the aesthetics with the x and y axis referencing the column names of the data object. Once this is done we add a layer using the plus sign.

The new layer is to contain bars, and so it uses the geom_bar function. It creates one‐dimensional rectangles whose height is mapped to the value referenced. Further, geom_bar must be told how to statistically handle the values. Here you specify the identity so that each bar represents a unique identity and is not transformed in another manner. The code also fills the bars with dark red. You can specify various colors, including hexadecimal colors, as part of the fill parameter.

The next layer is again added with a plus sign and simply rotates the x and y of the graph. This is an empty function call and can be removed if it suits the analyst making the visual. The next layer is actually from ggthemes and represents an entire predefined style. In this case, it is meant to mimic Google document visualizations.

You can change the style manually using many parameters, leave the default ggplot2 style or use a ggtheme as you have done here. Lastly, another layer is added on top of the bars. The geom_text layer represents the white numerical text labels at the end of each bar. On this last layer, you can change color, adjust the position, and even adjust it to the size you desire.


In this rudimentary view, you can see that many of the tweets from Delta are apologies and discussions about flight and confirmation numbers.

There is nothing overly surprising or insightful in this view, but sometimes this simple approach can yield surprising insights. Later you will use two-word pairs instead of single word tokens. In my experience, changing the tokenization can enrich the insights that are found through a basic term frequency visual. The website has an extra Twitter data set called chardonnay.csv in which this approach can show an unexpected yet frequent result as you adjust the stopwords.

Notice that all the words are lowercase, and there is even a mistakenly parsed word “amp.” This is the result of the character encoding not being recognized properly. Character encoding is the process of converting text to bytes that represent characters. R needs to read each character and encode it to a corresponding byte. There are numerous encoding types of languages, characters, and symbols, and as a result, mistakes can occur.

The same issue can occur with emoticons, as those are parsed into completely different characters and byte strings than would be necessary for R to be able to make sense of them. In subsequent scripts you will add more lines of code to specify an encoding, thereby changing the “amp” to the ampersand “&.”

Based on this initial visualization, you can dive further into the analysis. In the bar chart, apologies are mentioned many times. Sometimes it makes sense to explore unexpected individual words and find the words most associated with them. In this simple example, you can explore associated terms with apologies to hopefully understand what DeltaAssist is apologizing for. In your own text mining analysis, the surprising words in the bar plot or frequency analysis are ripe for the following additional exploration called “association.”

Word Associations

In text mining, the association is similar to correlation. That is, when term x appears, the other term y is associated with it. It is not directly related to frequency but instead refers to the term pairings. Unlike statistical correlation, the range is between zero and one, instead of a negative one to one. For example, in this blog, the term “text” would have a high association with “mining,” as we refer to text mining often together.

The next code explores the word associations with the term “apologies” within the DeltaAssist corpus. The word “apologies” was chosen after first reviewing the frequent terms for unexpected items, or in this case, to learn about the behavior of customer service agents.

Since the association analysis is limited to specific interesting words from the frequency analysis, you are hopefully not fishing for associations that would yield a non‐insightful out-come. Since all words would have some associative word, looking at outliers may not be appropriate, and thus the frequency analysis is usually performed first. In the next code, we are looking for highly associated words greater than 0.11, but you will likely have to adjust the threshold in your own analysis.

Text Mining Visualizations


The code itself creates a data frame of factors for each term and the corresponding association value. The new data frame has the same information as the association’s object matrix but is a data frame class for ggplot2. This data frame has a superfluous vector with rows names as a vector. The data frame also changes the terms from strings to categorical factors. These steps may seem redundant, but this approach makes it explicit and easy to follow when the data is used in ggplot.

associations<-findAssocs(tdm, 'apologies', 0.11) associations< associations$terms<-row.names(associations) associations$terms<-factor(associations$terms, levels=associations$terms)

Once you have a clean data frame of highly associated words and their corresponding values, you can use it for another visual. Again building by layer ggplot is pulling data from the “association” data frame. You are setting the y-axis to be the terms and the x-axis to be the values. Instead of geom_bar, you are using geom_point and setting the size explicitly.

Next, you use the predefined gdocs theme. Then you add a layer of labels for the values in dark red. Lastly, you change the default gdoc theme by increasing the y-axis label’s size and removing the y-axis title.

ggplot(associations, aes(y=terms)) +

geom_point(aes(x=apologies), data=associations, size=5)+

theme_gdocs()+ geom_text(aes(x=apologies,



theme(text=element_text(size=20), axis.title.y=element_blank())

Again, notice some poor parsing of the text done by R. Instead of “you’re”, R has interpreted the word to include some foreign characters and even a trademark abbreviation! You will finally learn how to clean this up in the word cloud section, but for now, focus on the meaning of association and basic visualization.

In this case, these words confirm what you likely already know, that airline customer service personnel have to apologize for late arrivals and delays. However, in other instances, this type of analysis can be useful. Consider a corpus with technical customer reviews and complaints about laptops. Performing a simple word frequency and association analysis may yield the exact cause of poor reviews.

You could find common words – e.g. “screen problem” – within the corpus. And reviewing the associated words with screen and problem may yield highly associated terms like HDMI and cable or driver. It is often the case that term frequency and word association alone can yield some surprising results that can lead to insight or confirm an existing belief. In this simplistic case, the following tweet confirms the word frequency and association conclusion that agents apologize for delays.

“@kitmoni At the moment there appears to be a delay of slightly over an hour. My apologies for today’s experience with us. *RB”

What is Sentiment Analysis?

At first thought, sentiment analysis may appear easy: it means distilling an author’s emotional intent into distinct classes such as happy, frustrated or surprised. As it turns out, sentiment analysis is very difficult to do well. It borrows from disciplines such as linguistics, psychology and, of course, natural language processing.

Sentiment analysis is the process of extracting an author’s emotional intent from the text.

Sentiment analysis challenges arise not only from its interdisciplinary foundation but also from cultural and demographic differences between authors. Another reason is that there are hundreds of related emotional states which are part of the human condition. It is hard to quantify the difference between happy, or elated, or the spectrum of bored to uninterested to interested. In fact, without the author’s explicit emotional tone being captured at the time of writing, all sentiment analysis may be undermined by analyst or modeling bias.

Further compounding sentiment analysis difficulties may be feature‐specific sentiment. This occurs when the topic being written about may have more than one sentiment by a feature within the overall topic. For example, a restaurant review on Yelp may state that the prices are great but the food is average. So overall, the review may be decent, but the review itself contains two distinct emotional states (great and average) applied to a specific restaurant feature. Analyzing this type of layered nuanced sentiment is extremely challenging.

There are numerous emotional frameworks that can be used for sentiment analysis. Some are proprietary for commercial applications and others are from academia. A popularized framework was created by Robert Plutchik in the 1980s. Plutchik was a psychologist who created a classification system for emotion. He believed that there are eight evolutionarily created emotions:

1) anger

2) fear

3) sadness

4) disgust

5) surprise

6) anticipation

7) trust

8) joy

He believed that the eight primary emotions have been the basis of survival in humans and animals. As a result, each is foundational to the psyche created over eons. He believed that the eight primary emotions helped to improve survivability over time and were passed on from generation to generation. For example, surprise allowed early humans to make quick assessments as to whether to fight or flee. In this framework, the eight emotions each have a polar opposite. For example, ecstasy is the opposite of grief. To Plutchik, any emotional states outside of these primary eight are amalgamations of the original eight and are therefore subordinate. Lastly, each primary and derivative emotion can be felt to varying degrees.

If you were to create a sentiment model based on Plutchik’s framework, then each of the labeled emotions in Figure could be a document class in a training set while the document text n‐grams could be the independent variables. Then a machine learning algorithm such as Naïve Bayes can be trained and applied to new documents.

The end result would be new documents and their corresponding probability for each emotional state and another model for substates. You can start to understand why sentiment analysis is difficult when you consider that Pluthick’s approach is just one of many frameworks and that labeling emotions in the training set are fraught with bias. So it is important that you note methodologies and biases when doing sentiment analysis yourself or when consuming sentiment analysis from others.

Beyond sentiment analysis for emotional states, an easier approach is to merely state whether a document is positive or negative. This is referred to as the polarity of a document. Polarity can be more accurate because there are only two distinct classes, and they are easier to disassociate. For example, surprise can be both positive and negative.

Positive surprise maybe, “I just found out I won the lottery,” while negative surprise maybe, “I was just hit by a bus.” Rather than analyzing the nuanced differences of an emotional state like surprise, the polarity of the document is often easier.

This blog will show the archived sentiment package for R that performs basic sentiment analysis. Next, the qdap package’s polarity function that will also be explained. Finally, the tidy text package contains a sentiment scoring approach that will be illustrated.


Sentiment Scoring: Parlor Trick or Insightful?

In commercial text mining applications and in many academic papers, considerable time and effort have been devoted to sentiment analysis. Despite this effort, the results do not always have tangible value. The salespeople of some of these commercial organizations try to impress upon the decision‐maker a sophisticated approach such as using state‐of‐the‐art deep learning neural nets and truly big data sets as the training corpus. Even so, the value to the enterprise may be limited.

For example, understanding a survey respondent’s emotional state is less valuable than getting a recommendation about making a change to improve operation. Some marketers track sentiment over time to attempt to understand the effectiveness of marketing efforts. However, the sentiment scores can be misleading, non‐normal or lagged indicators of marketing success, and so the sentiment data should only be accepted with supporting marketing data.

In the end, many sentiment analysis vendors do not create an actionable insight that can be used within an operation, whether to improve marketing or change a process. It is less valuable to say “that was negative,” than it is to state, “That was negative because of X, Y or Z.” The latter requires some subject matter expertise to enrich the sentiment analysis. Still, it is impressive to have sophisticated approaches applied to millions of documents resulting in 80% or better polarity accuracy. But the question remains: to what end?

Despite these limitations, let’s embark on an example use case and follow the text mining process outlined in this book to answer a question and thereby reach some conclusions.

Suppose for a moment you have an apartment in Boston that you would like to rent out through the Airbnb is a service for people to list, find and rent lodging, which is used by millions of people throughout the world. You hope to make some extra money by renting your apartment, but you want to make sure that your apartment has the qualities of a good rental.

1) Define the problem and specific goals. What property qualities are listed in positive or negative comments?

2) Identify the text that needs to be collected. After a stay, an Airbnb renter can leave comments about the property. These comments are public and inform new renters’ decisions about the specific property listing. You decide to analyze the comments for properties in Boston.

3) Organize the text. The corpus contains 1000 randomly selected Boston Airbnb listings. You will clean and organize the comment frequency matrices.

4) Extract features. Once it is organized, you will need to calculate various sentiment and polarity scores.

5) Analyze. The sentiment and polarity scores will be used to subset the comments so that you can analyze the terms used distinctly in positive or negative comments.

Reach an insight or recommendation. By the end of the case study you hope to answer the question from step 1: What property qualities are listed in positive or negative comments? This will help inform you as to whether or not your property has the qualities of a positive Airbnb listing.

Polarity: Simple Sentiment Scoring

Polarity, the measure of positive or negative intent in a writer’s tone, can be calculated by sophisticated or fairly straightforward methods. The qdap library provides a polarity function which is surprisingly accurate and uses basic arithmetic for scoring.

The resulting polarity calculation is a number that is negative to represent a negative, zero to represent neutral and positive to represent positive tone. Although the resulting polarity score is easy to understand, it is best to understand the underlying calculation and how to customize it for your specific need by adjusting the subjectivity lexicon.

Subjectivity Lexicons

The polarity function of qdap is based on subjectivity lexicons. A subjectivity lexicon is a list of words associated with a particular emotional state. For example, the words bad, awful and terrible can all reasonably be associated with a negative state. In contrast, perfect and ideal can be connected with a positive state.

Researchers at the University of Pittsburgh provide a freely available subjectivity lexicon that is very popular. It contains information on more than 8000 words that have been found to have either a positive or negative polarity. The polarity designation has been captured by various methods and in multiple academic research studies, so it stands to reason that this particular subjectivity lexicon is broadly acceptable.


What Tools Do I Need to Get Started with This?

To get started in text mining you need a few tools. You should have access to a laptop or workstation with at least 4GB of RAM. All of the examples in this book have been tested on Microsoft’s Windows operating systems and Office. RAM is important because R’s processing is done “in memory.” This means that the objects being analyzed must be contained in the RAM memory.

Also, having a high-speed internet connection will aid in downloading the scripts, R library packages and example text data and for gathering text from various web pages. Lastly, the computer needs to have an installation of R and R Studio. The operating system of the computer should not matter because R has an installation for Microsoft, Linux, and Mac.

A Simple Example

Online customer reviews can be beneficial to understanding customer perspectives about a product or service. Further, reviewers can sometimes leave feedback anonymously, allowing authors to be candid and direct. While this may lead to accurate portrayals of a product it may lead to “keyboard courage” or extremely biased opinions. I consider it a form of selection bias, meaning that the people that leave feedback may have strong convictions not indicative of the overall product or service’s public perception.

Text mining allows an enterprise to benchmark their product reviews and develop a more accurate understanding of some public perceptions. Approaches like topic modeling and polarity (positive and negative scoring) which are covered later in this book may be applied in this context. Scoring methods can be normalized across different mediums such as forums or print, and when done against a competing product, the results can be compelling.

Suppose you are a Nike employee and you want to know about how consumers are viewing the Nike Men’s Roshe Run Shoes. The text mining steps to follow are:

1) Define the problem and specific goals. Using online reviews, identify overall positive or negative views. For negative reviews, identify a consistent cause of the poor review to be shared with the product manager and manufacturing personnel.

2) Identify the text that needs to be collected. There are running websites providing expert reviews, but since the shoes are mass market, a larger collection of general use reviews would be preferable. New additions come out annually, so old reviews may not be relevant to the current release. Thus, a shopping website like Amazon could provide hundreds of reviews, and since there is a timestamp on each review, the text can be limited to a particular timeframe.

3) Organize the text. Even though Amazon reviewers rate products with a number of stars, reviews with three or fewer stars may yield opportunities to improve. Web scraping all reviews into a simple csv with a review per row and the corresponding timestamp and number of stars in the next columns will allow the analysis to subset the corpus by these added dimensions.

4) Extract features. Reviews will need to be cleaned so that text features can be analyzed. For this simple example, this may mean removing common words with a little benefit like “shoe” or “Nike,” running spellcheck and making all text lowercase.

5) Analyze. A very simple way to analyze clean text, discussed in an early blog, is to scan for a specific group of keywords. The text‐mining analyst may want to scan for words given their subject matter expertise. Since the analysis is about shoe problems one could scan for “fit,” “rip” or “tear,” “narrow,” “wide,” “sole,” or any other possible quality problem from reviews. Then summing each could provide an indication of the most problematic feature. Keep in mind that this is an extremely simple example and the blogs build in complexity and analytical rigor beyond this illustration.

6) Reach an insight or recommendation. Armed with this frequency analysis, a text miner could present findings to the product manager and manufacturing personnel that the top consumer issue could be “narrow” and “fit.” In practical application, it is best to offer more methodologies beyond keyword frequency, as support for a finding.

A Real World Use Case

It is regularly the case that marketers learn best practices from each other. Unlike in other professions, many marketing efforts are available outside of the enterprise, and competitors can see the efforts easily. As a result, competitive intelligence in this space is rampant. It is also another reason why novel ideas are often copied and reused, and then the novel idea quickly loses salience with its intended audience. Text mining offers a quick way to understand the basics of a competitor’s text‐based public efforts.

When I worked at Amazon creating the social customer service team, we were obsessed with how others were doing it. We regularly read and reviewed other companies’ replies and learned from their missteps. This was early 2012, so customer service in social media was considered an emerging practice, let alone being at one of the largest retailers in the world.

At the time, the belief was that it was fraught with risk. Amazon’s legal counsel, channel marketers in charge of branding and even customer service leadership were weary of publicly acknowledging any shortcomings or service issues. The legal department was involved to understand if we were going to set undeliverable expectations or cause any tax implications on a state‐by‐state basis. Further, each brand owner, such as Amazon Prime, Amazon Mom, Amazon MP3, Amazon Video on Demand, and Amazon Kindle had cultivated their own style of communicating through their social media properties.

Lastly, customer service leadership had made multiple promises that reached all the way to Jeff Bezos, the CEO, about flawless execution and servicing in this channel demonstrating customer centricity. The mandate was clear: proceed, but do so cautiously and do not expand faster than could be reasonably handled to maintain quality set by all these internal parties.

The initial channels we covered were the two “Help” forums on the site, then retail and Kindle Facebook pages, and lastly, Twitter. We had our own missteps. I remember the email from Jeff that came down through the ranks with a simple “?” concerning an inappropriate briefly posted video to the Facebook wall. That told me our efforts were constantly under review and that we had to be as good as or better than other companies.

Text mining proved to be an important part of the research that was done to understand how others were doing social media customer service. We had to grasp simple items like the length of a reply by channel, the basic language used, typical agent workload, and if adding similar links repeatedly made sense.

My initial thought was that it was redundant to repeatedly post the same link, for example to our “contact us”, form. Further, we didn’t know what types of help links were best to post. Should they be informative pages or forms or links to outside resources? We did not even know how many people should be on the team and what an average workload for a customer service representative was.

In short, the questions basic text mining can help with are

1) What is the average length of a social customer service reply?

2) What links were referenced most often?

3) How many people should be on the team? How many social replies is reasonable for a customer service representative to handle?

Channel by channel we would find text of some companies already providing public support. We would identify and analyze attributes that would help us answer these questions. In the next blog, covering basic text mining, we will actually answer these questions on real customer service tweets and go through the six‐step process to do so.

Looking back, the answers to these questions seem common sense, but that is after running that team for a year. Now social media customer service has expanded to be the norm. In 2012, we were creating something new at a Fortune 50 fast growing company with many opinions on the matter, including “do not bother!”

At the time, I considered Wal‐Mart, Dell and Delta Airlines to be best in class social customer service. Basic text mining allowed me to review their respective replies in an automated fashion. We spoke with peers at Expedia but it proved more helpful to perform basic text mining and read a small sample of replies to help answer our questions.