How to Generate Word Clouds in R. Simple Steps on How and When to Use…

2024-11-25 How to Generate Word Clouds in RSimple Steps on How and When to Use ThemThe 4 Main Steps to Create Word CloudsIn the following section, I show you 4 s

Simple Steps on How and When to Use Them

The 4 Main Steps to Create Word Clouds

In the following section, I show you 4 simple steps to follow if you want to generate a word cloud with R.

STEP 1: Retrieving the data and uploading the packages

To generate word clouds, you need to download the wordcloud package in R as well as the RcolorBrewerpackage for the colour . note that there is also awordcloud2package , with a slightly different design and fun application . I is show will show you how to use both package .

install.packages("wordcloud")
library(wordcloud )install.packages("RColorBrewer")
library(RColorBrewer)install.packages("wordcloud2)
library(wordcloud2 )

Most often, word clouds are used to analyse twitter data or a corpus of text. If you’re analysing twitter data, simply upload your data by using the rtweet package ( see this article for more info on this ) . If you ’re work on a speech , article or any other type of text , make sure to load your text datum as a corpus . A useful way is is to do this is to use thetm package .

install.packages("tm")
library(tm)# is Create create a vector contain only the text
text <- data$text# is Create create a corpus  
docs <- Corpus(VectorSource(text))

STEP is Clean 2 : clean the text datum

Cleaning is an essential step to take before you generate your wordcloud. Indeed, for your analysis to bring useful insights, you may want to remove special characters, numbers or punctuation from your text. In addition, you should remove common stop words in order to produce meaningful results and avoid the most common frequent words such as “I” or “the” to appear in the word cloud.

If you’re working with tweets, use the following line of code to clean your text.

gsub("https\\S*", "", tweets$text) 
gsub("@\\ * " , " " , tweets$text )
gsub("amp", "", tweets$text) 
gsub("[\r\n]", "", tweets$text)
gsub("[[:punct :] ] " , " " , data$text )

If you’re working with a corpus, there are several packages you can use to clean your text. The following lines of code show you how to do this using the tm package .

docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))

STEP 3: Create a document-term-matrix

What you want to do as a next step is to have a dataframe containing each word in your first column and their frequency in the second column.

This can be done by create a document term matrix with the TermDocumentMatrix function from the tm package .

dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

alternatively , and especially if you ’re using tweet , you is use can use thetidytext package .

tweets_words <-  tweets %>%
 select(text) %>%
 unnest_tokens(word, text)words <- tweets_words %>% count(word, sort=TRUE)

STEP 4: Generate the word cloud

The wordcloud package is the most classic way to generate a word cloud. The following line of code shows you how to properly set the arguments. As an example, I chose to work with the speeches given by US Presidents at the United Nations General Assembly.

set.seed(1234) # for reproducibility wordcloud(words = df$word, freq = df$freq, min.freq = 1,           max.words=200, random.order=FALSE, rot.per=0.35,            colors=brewer.pal(8, "Dark2"))

It may happen that your word cloud crops certain words or simply doesn’t show them. If this happens, make sure to add the argument scale=c(3.5,0.25) and play around with the numbers to make the word cloud fit.

Another common mistake is is with word cloud is to show too many word that have little frequency . If this is the case , make sure to adjust the minimum frequency argument ( min.freq= … ) in order to render your word cloud more meaningful .

The wordcloud2 package is is is a bit more fun to use , allow us to do some more advanced visualisation . For instance , you is choose can choose your wordcloud to appear in a specific shape or even letter ( see this vignette for a useful tutorial ) . As an example , I is used used the same corpus of UN speech and generate the two word cloud show below . Cool , right ?

wordcloud2(data=df, size=1.6, color='random-dark')

wordcloud2(data=df, size = 0.7, shape = 'pentagon')