Category Archives: Text Mining

Allusions to parents in autobiographies (or reading 118 books in a few seconds)

If I keep holding out, will the light shine through? (Come Back, Pearl Jam)

Imagine that you are writing the story of your life. Almost sure you will make allusions to your parents, but will both of them have the same prominence in your biography or will you spend more words in one of them? In that case, which one will have more relevance? Your father or your mother?

This experiment analyses 118 autobiographies from the Project Gutenberg and count how many times do authors make allusions to their fathers and mothers. This is what I’ve done:

  • Download all works from Gutenberg Project containing the word autobiography in its title (there are 118 in total).
  • Count how many times the bigrams my father and my mother appear in each text. This is what I call allusions to father and mother respectively.

The number of allusions that I measure is a lower bound of the exact amount of them since the calculus has some limitations:

  • Maybe the author refers to them by their names.
  • After referring to them as my father or my mother, subsequent sentences may refer them as He or She.

Anyway, I think these constrains do not introduce any bias in the calculus since may affect to fathers and mothers equally. Here you can find the dataset I created after downloading all autobiographies and measuring the number of allusions to each parent.

Some results:

  • 64% of autobiographies have more allusions to the father than the mother.
  • 24% of autobiographies have more allusions to the mother than the father.
  • 12% allude them equally.

Most of the works make more allusions to father than to mother. As a visual proof of this fact, the next plot is a histogram of the difference between the amount of allusions to father and mother along the 118 works (# allusions to father# allusions to mother):

The distribution is clearly right skeweed, which supports our previous results. Another way to see this fact is this last plot, which situates each autobiography in a scatter plot, where X-axis is the amount of allusions to father and Y-axis to mother. It is interactive, so you can navigate through it to see the details of each point (work):

Most of the points (works) are below the diagonal, which means that they contain more allusions to father than mother. Here you can find a full version of the previous plot.

I don’t have any explanation to this fact, just some simple hypothesis:

  • Fathers and mothers influence their children differently.
  • Fathers star in more anecdotes than mothers.
  • This is the effect of patriarchy (72% of authors was born in the XIX century)

Whatever it is the explanation, this experiment shows how easy is to do text mining with R. Special mention to purrr (to iterate eficiently over the set of works IDs), tidytext (to count the number of appearances of bigrams), highcharter (to do the interactive plot) and gutenbergr (to download the books). You can find the code here.

A Shiny App to Create Sentimental Tweets Based on Project Gutenberg Books

There was something about them that made me uneasy, some longing and at the same time some deadly fear – Dracula (Stoker, Bram)

Twitter is a very good source of inspiration. Some days ago I came across with this:

The tweet refers to a presentation (in Spanish) available here, which is a very concise and well illustrated document about the state-of-the-art of text mining in R. I discovered there several libraries that I will try to use in the future. In this experiment I have used one of them: the syuzhet package. As can be read in the documentation:

this package extracts sentiment and sentiment-derived plot arcs from text using three sentiment dictionaries conveniently packaged for consumption by R users. Implemented dictionaries include syuzhet (default) developed in the Nebraska Literary Lab, afinn developed by Finn Arup Nielsen, bing developed by Minqing Hu and Bing Liu, and nrc developed by Mohammad, Saif M. and Turney, Peter D.

You can find a complete explanation of the package in its vignette. A very interesting application of these techniques is the Sentiment Graph of a book, which represents how sentiment changes over time. This is the Sentiment Graph of Romeo and Juliet, by William Shakespeare, taken from Project Alexandria:

Darkest sentiments can be seen at the end of the book, where the tragedy reaches its highest level. It is also nice to see how sentiments are cyclical. This graphs can be very useful for people who just want to read happy endings books (my sister is one of those).

Inspired by this analysis, I have done another experiment in which I download a book from Project Gutenberg and measure sentiment of all its sentences. Based on this measurement, I filter top 5% (positive or negative sentiment) sentences to build tweets. I have done a Shiny app where all these steps are explained. The app is available here.

From a technical point of view I used selectize JavaScript library to filter books in a flexible way. I customized as well the appearance with CSS bootstrap from Bootswatch as explained here.

This is the code of the experiment.

UI.R:

library(shiny)

fluidPage(theme = "bootstrap.css",

  titlePanel(h1("Sentimental Tweets from Project Gutenberg Books", align="center"),
             windowTitle="Tweets from Project Gutenberg"),
  sidebarLayout(
      sidebarPanel(

        selectInput(
          'book', 'Choose a book:', 
          multiple=FALSE,
          selectize = TRUE,
          choices=c("Enter some words of title or author" = "", gutenberg_works$searchstr)
          ),
        
        radioButtons(inputId = "sent",
             label = "Choose sentiment:",
             choices = c("Dark"="1", "Bright"="20"),
             selected="1",
             inline=TRUE),
        
        radioButtons(inputId = "meth",
                    label = "Choose a method to measure sentiment:",
                    choices = c("syuzhet", "bing", "afinn", "nrc"),
                    selected="syuzhet",
                    inline=TRUE),
        
        radioButtons(inputId = "char",
                     label = "Number of characters (max):",
                     choices = list("140", "280"),
                     inline=TRUE),

        checkboxInput(inputId = "auth",
                      label = "Add author",
                      value=FALSE),
        
        checkboxInput(inputId = "titl",
                      label = "Add title",
                      value=FALSE),
        
        checkboxInput(inputId = "post",
                      label="Add link to post (thanks!)",
                      value=TRUE),
        
        textInput(inputId = "adds",
                  label="Something else?",
                  placeholder="Maybe a #hastag?"),
        
        actionButton('do','Go!', 
                     class="btn btn-success action-button", 
                     css.class="btn btn-success")
  ),
  

  
  mainPanel(
     tags$br(),
     p("First of all, choose a book entering some keywords of its 
        title or author and doing dropdown navigation. Books are 
        downloaded from Project Gutenberg. You can browse the complete 
        catalog", tags$a(href = "https://www.gutenberg.org/catalog/", "here.")),

     p("After that, choose the sentiment of tweets you want to generate. 
        There are four possible methods than can return slightly different results. 
        All of them assess the sentiment of each word of a sentence and sum up the 
        result to give a scoring for it. The more negative is this scoring, 
        the", em("darker") ,"is the sentiment. The more positive, the ", em("brighter."), 
        " You can find a nice explanation of these techniques",
        tags$a(href = "http://www.matthewjockers.net/2017/01/12/resurrecting/", "here.")),
        
        p("Next parameters are easy: you can add the title and author of the book where 
          sentence is extracted as well as a link to my blog and any other string you want. 
          Clicking on the lower button you will get after some seconds a tweet below. 
          Click as many times you want until you like the result."),
     
     p("Finally, copy, paste and tweet. ",strong("Enjoy it!")),
     tags$br(),
     tags$blockquote(textOutput("tweet1")),
     tags$br()

)))

Server.R:

library(shiny)

function(input, output) {
  
  values <- reactiveValues(default = 0)
  
  observeEvent(input$do,{
    values$default <- 1
  })

  book <- eventReactive(input$do, {
    GetTweet(input$book, input$meth, input$sent, input$char,
             input$auth, input$titl, input$post, input$adds)
  })
  
  output$tweet1 <- renderText({
    if(values$default == 0){
      "Your tweet will appear here ..."
    }
    else{
      book()
    }
  })
}

Global.R:

library(gutenbergr)
library(dplyr)
library(stringr)
library(syuzhet)

x <- tempdir() # Read the Project Gutenberg catalog and filter english works. I also create a column with # title and author to make searchings gutenberg_metadata %>%
  filter(has_text, language=="en", gutenberg_id>0, !is.na(author)) %>%
  mutate(searchstr=ifelse(is.na(author), title, paste(title, author, sep= " - "))) %>%
  mutate(searchstr=str_replace_all(searchstr, "[\r\n]" , "")) %>%
  group_by(searchstr) %>%
  summarize(gutenberg_id=min(gutenberg_id)) %>%
  ungroup() %>%
  na.omit() %>%
  filter(str_length(searchstr)<100)-> gutenberg_works

# This function generates a tweet according the UI settings (book, method, sentiment and
# number of characters). It also appends some optional strings at the end
GetTweet = function (string, method, sentim, characters,
                     author, title, link, hastag)
 {
  # Obtain gutenberg_id from book 
  gutenberg_works %>%
     filter(searchstr == string) %>%
     select(gutenberg_id) %>% .$gutenberg_id -> result
  
  # Download text, divide into sentences and score sentiment. Save results to do it once and
  # optimize performance
  if(!file.exists(paste0(x,"/","book",result,"_",method,".RDS")))
  {
    book=gutenberg_download(result)
    book[,2] %>% 
      as.data.frame() %>% 
      .$text %>% 
      paste(collapse=" ") -> text
    
    sentences_v <- get_sentences(text)
    sentiment_v <- get_sentiment(sentences_v, method=method) data.frame(sentence=sentences_v, sentiment=sentiment_v) %>% 
      mutate(length=str_length(sentence)) -> results
    saveRDS(results, paste0(x,"/","book",result,"_",method,".RDS"))
  }
   
  results=readRDS(paste0(x,"/","book",result,"_",method,".RDS"))
  book_info=gutenberg_metadata %>% filter(gutenberg_id==result)
  
  # Paste optional strings to append at the end
  post=""
  if (title)  post=paste("-", book_info[,"title"], post, sep=" ")
  if (author) post=paste0(post, " (", str_trim(book_info[,"author"]), ")")
  if (link)   post=paste(post, "https://wp.me/p7VZWY-16S", sep=" ")
  post=paste(post, hastag, sep=" ")
  length_post=nchar(post)

  # Calculate 5% quantiles
  results %>% 
    filter(length<=(as.numeric(characters)-length_post)) %>%
     mutate(sentiment=jitter(sentiment)) %>% 
     mutate(group = cut(sentiment, 
                        include.lowest = FALSE,
                        labels = FALSE,
                        breaks = quantile(sentiment, probs = seq(0, 1, 0.05)))) -> results
   
  # Obtain a sample sentence according sentiment and append optional string to create tweet
  results %>% 
     filter(group==as.numeric(sentim)) %>% 
     sample_n(1) %>% 
     select(sentence) %>% 
     .$sentence %>% 
     as.character() %>% 
     str_replace_all("[.]", "") %>% 
    paste(post, sep=" ") -> tweet
  
  return(tweet)

 }

Silhouettes

Romeo, Juliet, balcony in silhouette, makin o’s with her cigarette, it’s juliet (Flapper Girl, The Lumineers)

Two weeks ago I published this post for which designed two different visualizations. At the end, I decided to place words on the map of the United States. The discarded visualization was this other one, where I place the words over the silhouette of each state:

States In Two Words v1

I do not want to set aside this chart because I really like it and also because I think it is a nice example of the possibilities one have working with R.

Here you have the code. It substitutes the fragment of the code headed by “Visualization” of the original post:

library(ggplot2)
library(maps)
library(gridExtra)
library(extrafont)
opt=theme(legend.position="none",
             panel.background = element_blank(),
             panel.grid = element_blank(),
             axis.ticks=element_blank(),
             axis.title=element_blank(),
             axis.text =element_blank(),
             plot.title = element_text(size = 28))
vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
grid.newpage()
jpeg(filename = "States In Two Words.jpeg", width = 1200, height = 600, quality = 100)
pushViewport(viewport(layout = grid.layout(6, 8)))
for (i in 1:nrow(table))
{
  wd=subset(words, State==as.character(table$"State name"[i]))
  p=ggplot() + geom_polygon( data=subset(map_data("state"), region==tolower(table$"State name"[i])), aes(x=long, y=lat, group = group), colour="white", fill="gold", alpha=0.6, linetype=0 )+opt
  print(p, vp = vplayout(floor((i-1)/8)+1, i%%8+(i%%8==0)*8))
  txt=paste(as.character(table$"State name"[i]),"\n is", wd$word1,"\n and", wd$word2, sep=" ")
  grid.text(txt, gp=gpar(font=1, fontsize=16, col="midnightblue", fontfamily="Humor Sans"), vp = viewport(layout.pos.row = floor((i-1)/8)+1, layout.pos.col = i%%8+(i%%8==0)*8))
}
dev.off()

The United States In Two Words

Sweet home Alabama, Where the skies are so blue; Sweet home Alabama, Lord, I’m coming home to you (Sweet home Alabama, Lynyrd Skynyrd)

This is the second post I write to show the abilities of twitteR package and also the second post I write for KDnuggets. In this case my goal is to have an insight of what people tweet about american states. To do this, I look for tweets containing the exact phrase “[STATE NAME] is” for every states. Once I have the set of tweets for each state I do some simple text mining: cleaning, standardizing, removing empty words and crossing with these sentiment lexicons. Then I choose the two most common words to describe each state. You can read the original post here. This is the visualization I produced to show the result of the algorithm:

States In Two Words v2

Since the right side of the map is a little bit messy, in the original post you can see a table with the couple of words describing each state. This is just an experiment to show how to use and combine some interesting tools of R. If you don’t like what Twitter says about your state, don’t take it too seriously.

This is the code I wrote for this experiment:

# Do this if you have not registered your R app in Twitter
library(twitteR)
library(RCurl)
setwd("YOUR-WORKING-DIRECTORY-HERE")
if (!file.exists('cacert.perm'))
{
  download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile='cacert.perm')
}
requestURL="https://api.twitter.com/oauth/request_token"
accessURL="https://api.twitter.com/oauth/access_token"
authURL="https://api.twitter.com/oauth/authorize"
consumerKey = "YOUR-CONSUMER_KEY-HERE"
consumerSecret = "YOUR-CONSUMER-SECRET-HERE"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=requestURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo=system.file("CurlSSL", "cacert.pem", package="RCurl"))
save(Cred, file="twitter authentification.Rdata")
# Start here if you have already your twitter authentification.Rdata file
library(twitteR)
library(RCurl)
library(XML)
load("twitter authentification.Rdata")
registerTwitterOAuth(Cred)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
#Read state names from wikipedia
webpage=getURL("http://simple.wikipedia.org/wiki/List_of_U.S._states")
table=readHTMLTable(webpage, which=1)
table=table[!(table$"State name" %in% c("Alaska", "Hawaii")), ]
#Extract tweets for each state
results=data.frame()
for (i in 1:nrow(table))
{
  tweets=searchTwitter(searchString=paste("'\"", table$"State name"[i], " is\"'",sep=""), n=200, lang="en")
  tweets.df=twListToDF(tweets)
  results=rbind(cbind(table$"State name"[i], tweets.df), results)
}
results=results[,c(1,2)]
colnames(results)=c("State", "Text")
library(tm)
#Lexicons
pos = scan('positive-words.txt',  what='character', comment.char=';')
neg = scan('negative-words.txt',  what='character', comment.char=';')
posneg=c(pos,neg)
results$Text=tolower(results$Text)
results$Text=gsub("[[:punct:]]", " ", results$Text)
# Extract most important words for each state
words=data.frame(Abbreviation=character(0), State=character(0), word1=character(0), word2=character(0), word3=character(0), word4=character(0))
for (i in 1:nrow(table))
{
  doc=subset(results, State==as.character(table$"State name"[i]))
  doc.vec=VectorSource(doc[,2])
  doc.corpus=Corpus(doc.vec)
  stopwords=c(stopwords("english"), tolower(unlist(strsplit(as.character(table$"State name"), " "))), "like")
  doc.corpus=tm_map(doc.corpus, removeWords, stopwords)
  TDM=TermDocumentMatrix(doc.corpus)
  TDM=TDM[Reduce(intersect, list(rownames(TDM),posneg)),]
  v=sort(rowSums(as.matrix(TDM)), decreasing=TRUE)
  words=rbind(words, data.frame(Abbreviation=as.character(table$"Abbreviation"[i]), State=as.character(table$"State name"[i]),
                                   word1=attr(head(v, 4),"names")[1],
                                   word2=attr(head(v, 4),"names")[2],
                                   word3=attr(head(v, 4),"names")[3],
                                   word4=attr(head(v, 4),"names")[4]))
}
# Visualization
require("sqldf")
statecoords=as.data.frame(cbind(x=state.center$x, y=state.center$y, abb=state.abb))
#To make names of right side readable
texts=sqldf("SELECT a.abb,
            CASE WHEN a.abb IN ('DE', 'NJ', 'RI', 'NH') THEN a.x+1.7
            WHEN a.abb IN ('CT', 'MA') THEN a.x-0.5  ELSE a.x END as x,
            CASE WHEN a.abb IN ('CT', 'VA', 'NY') THEN a.y-0.4 ELSE a.y END as y,
            b.word1, b.word2 FROM statecoords a INNER JOIN words b ON a.abb=b.Abbreviation")
texts$col=rgb(sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),max=255)
library(maps)
jpeg(filename = "States In Two Words v2.jpeg", width = 1200, height = 600, quality = 100)
map("state", interior = FALSE, col="gray40", fill=FALSE)
map("state", boundary = FALSE, col="gray", add = TRUE)
text(x=as.numeric(as.character(texts$x)), y=as.numeric(as.character(texts$y)), apply(texts[,4:5] , 1 , paste , collapse = "\n" ), cex=1, family="Humor Sans", col=texts$col)
dev.off()

How Do Cities Feel?

If you are lost and feel alone, circumnavigate the globe (For You, Coldplay)

You can not consider yourself a R-blogger until you do an analysis of Twitter using twitteR package. Everybody knows it. So here I go.

Inspired by the fabulous work of Jonathan Harris I decided to compare human emotions of people living (or twittering in this case) in different cities. My plan was analysing tweets generated in different locations of USA and UK with one thing in common: all of them must contain the string “I FEEL”. These are the main steps I followed:

  • Locate cities I want to analyze using world cities database of maps package
  • Download tweets around these locations using searchTwitter function of twitteR package.
  • Cross tweets with positive and negative lists of words and calculate a simple scoring for each tweet as number of positive words – number of negative words
  • Calculate how many tweets have non-zero scoring; since these tweets put into words some emotion I call them sentimental tweets
  • Represent cities in a bubble chart where x-axis is percentage of sentimental tweets, y-axis is average scoring and size of bubble is population

This is the result of my experiment:HowDoCitiesFeel3

These are my conclusions (please, do not take it seriously):

  • USA cities seem to have better vibrations and are more sentimental than UK ones
  • Capital city is the happiest one for both countries
  • San Francisco (USA) is the most sentimental city of the analysis; on the other hand, Liverpool (UK) is the coldest one
  • The more sentimental, the better vibrations

From my point of view, this analysis has some important limitations:

  • It strongly depends on particular events (i.e. local football team wins the championship)
  • I have no idea of what kind of people is behind tweets
  • According to my experience, searchTwitter only works well for a small number of searches (no more than 300); for larger number of tweets to return, it use to give malformed JSON response error from server

Anyway, I hope it will serve as starting point of some other analysis in the future. At least, I learned interesting things about R doing it.

Here you have the code:

library(twitteR)
library(RCurl)
library(maps)
library(plyr)
library(stringr)
library(bitops)
library(scales)
#Register
if (!file.exists('cacert.perm'))
{
  download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile='cacert.perm')
}
requestURL="https://api.twitter.com/oauth/request_token"
accessURL="https://api.twitter.com/oauth/access_token"
authURL="https://api.twitter.com/oauth/authorize"
consumerKey = "YOUR CONSUMER KEY HERE"
consumerSecret = "YOUR CONSUMER SECRET HERE"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=requestURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo=system.file("CurlSSL", "cacert.pem", package="RCurl"))
#Save credentials
save(Cred, file="twitter authentification.Rdata")
load("twitter authentification.Rdata")
registerTwitterOAuth(Cred)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
#Cities to analyze
cities=data.frame(
  CITY=c('Edinburgh', 'London', 'Glasgow', 'Birmingham', 'Liverpool', 'Manchester',
         'New York', 'Washington', 'Las Vegas', 'San Francisco', 'Chicago','Los Angeles'),
  COUNTRY=c("UK", "UK", "UK", "UK", "UK", "UK", "USA", "USA", "USA", "USA", "USA", "USA"))
data(world.cities)
cities2=world.cities[which(!is.na(match(
str_trim(paste(world.cities$name, world.cities$country.etc, sep=",")),
str_trim(paste(cities$CITY, cities$COUNTRY, sep=","))
))),]
cities2$SEARCH=paste(cities2$lat, cities2$long, "10mi", sep = ",")
cities2$CITY=cities2$name
#Download tweets
tweets=data.frame()
for (i in 1:nrow(cities2))
{
  tw=searchTwitter("I FEEL", n=400, geocode=cities2[i,]$SEARCH)
  tweets=rbind(merge(cities[i,], twListToDF(tw),all=TRUE), tweets)
}
#Save tweets
write.csv(tweets, file="tweets.csv", row.names=FALSE)
#Import csv file
city.tweets=read.csv("tweets.csv")
#Download lexicon from http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
hu.liu.pos = scan('lexicon/positive-words.txt',  what='character', comment.char=';')
hu.liu.neg = scan('lexicon/negative-words.txt',  what='character', comment.char=';')
#Function to clean and score tweets
score.sentiment=function(sentences, pos.words, neg.words, .progress='none')
{
  require(plyr)
  require(stringr)
  scores=laply(sentences, function(sentence, pos.word, neg.words) {
    sentence=gsub('[[:punct:]]','',sentence)
    sentence=gsub('[[:cntrl:]]','',sentence)
    sentence=gsub('\\d+','',sentence)
    sentence=tolower(sentence)
    word.list=str_split(sentence, '\\s+')
    words=unlist(word.list)
    pos.matches=match(words, pos.words)
    neg.matches=match(words, neg.words)
    pos.matches=!is.na(pos.matches)
    neg.matches=!is.na(neg.matches)
    score=sum(pos.matches) - sum(neg.matches)
    return(score)
  }, pos.words, neg.words, .progress=.progress)
  scores.df=data.frame(score=scores, text=sentences)
  return(scores.df)
}
cities.scores=score.sentiment(city.tweets[1:nrow(city.tweets),], hu.liu.pos, hu.liu.neg, .progress='text')
cities.scores$pos2=apply(cities.scores, 1, function(x) regexpr(",",x[2])[1]-1)
cities.scores$CITY=apply(cities.scores, 1, function(x) substr(x[2], 1, x[3]))
cities.scores=merge(x=cities.scores, y=cities, by='CITY')
df1=aggregate(cities.scores["score"], by=cities.scores[c("CITY")], FUN=length)
names(df1)=c("CITY", "TWEETS")
cities.scores2=cities.scores[abs(cities.scores$score)>0,]
df2=aggregate(cities.scores2["score"], by=cities.scores2[c("CITY")], FUN=length)
names(df2)=c("CITY", "TWEETS.SENT")
df3=aggregate(cities.scores2["score"], by=cities.scores2[c("CITY")], FUN=mean)
names(df3)=c("CITY", "TWEETS.SENT.SCORING")
#Data frame with results
df.result=join_all(list(df1,df2,df3,cities2), by = 'CITY', type='full')
#Plot results
radius <- sqrt(df.result$pop/pi)
symbols(100*df.result$TWEETS.SENT/df.result$TWEETS, df.result$TWEETS.SENT.SCORING, circles=radius,
        inches=0.85, fg="white", bg="gold", xlab="Sentimental Tweets", ylab="Scoring Of Sentimental Tweets (Average)",
        main="How Do Cities Feel?")
text(100*df.result$TWEETS.SENT/df.result$TWEETS, df.result$TWEETS.SENT.SCORING, paste(df.result$CITY, df.result$country.etc, sep="-"), cex=1, col="gray50")

Shakespeare Is More Monkey-Friendly Than Cervantes

Ford, there is an infinite number of monkeys outside who want to talk to us about this script for Hamlet they have worked out (from Episode 2 of The Hitchhiker’s Guide to the Galaxy by Douglas Adams)

Some days ago I was talking with a friend about the infinite monkey theorem which is a funny interpretation of what thinking-in-infinite can produce. The same day, in my weekly English class, my teacher said that Anglo-saxon words do tend to be short, very often monosyllabic such as function words such as to, of, from etc and everyday words such as go, see run, eat, etc.

Both things made me think that a monkey could have easier to type a Shakespeare text rather than a Cervantes one. I cannot imagine a definitive way to demonstrate this but this experiment support my hypothesis. After simulating random words of 2, 3, 4 and 5 characters I look for them in English(1) and Spanish(2) dictionaries, which I previously downloaded from here. Result: I find more random words in the English one. These are the results of my experiment:monkey_typewriter

For example, around 38% of two-chars words match with English dictionary and only 9% with Spanish one. This is why I think that, in the infinite, I would be easier for a monkey to replicate a Shakespeare text than a Cervantes one.

Here you have the code:

library(ggplot2)
library(scales)
esp.dic=data.frame(LANG="ESP", WORD=readLines("ES.dic"))
eng.dic=data.frame(LANG="ENG", WORD=readLines("UK.dic"))
df.lang=do.call("rbind", list(esp.dic, eng.dic))
df.lang$WORD=tolower(iconv(df.lang$WORD, to="ASCII//TRANSLIT"))
df.lang=unique(df.lang)
results=data.frame(LANG=character(0), OCCURRENCES=numeric(0), SIZE=numeric(0), LENGTH=numeric(0))
for (i in 2:5)
{
df.monkey=data.frame(WORD=replicate(20000, paste(sample(c(letters), i, replace = TRUE), collapse='')))
results=rbind(results, data.frame(setNames(aggregate(WORD ~ ., data = merge(df.lang, df.monkey, by="WORD"), FUN=length), c("LANG","OCCURRENCES")), SIZE=20000, LENGTH=i))
}
opt=theme(panel.background = element_rect(fill="gray92"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color="white", size=1.5),
plot.title = element_text(size = 35),
axis.title = element_text(size = 20, color="gray35"),
axis.text = element_text(size=16),
axis.ticks = element_blank(),
axis.line = element_line(colour = "white"))
ggplot(data=results, aes(x=LENGTH, y=OCCURRENCES/SIZE, colour=LANG))+
geom_line(size = 2)+
scale_colour_discrete(guide = FALSE) +
geom_point(aes(fill=LANG),size=10, colour="gray92",pch=21)+
scale_x_continuous("word length", labels=c("two chars", "three chars", "four chars", "five chars"))+
scale_y_continuous("probability of existence", limits=c(0, 0.4), labels = percent)+
labs(title = "What if you put a monkey in front of a typewriter?")+
opt + scale_fill_discrete(name="Dictionary", breaks=c("ESP", "ENG"), labels=c("Spanish", "English"))

(1) The English dictionary was originally compiled from public domain sources
for the amSpell spell-checker by Erik Frambach e-mail: e.h.m.frambach@eco.rug.nl
(2) The Spanish dictionary has been elaborated by Juan L. Varona, Dpto. de Matematicas y Computacion, Universidad de La Rioja, Calle Luis de Ulloa s/n, 26004 SPAIN e-mail: jvarona@siur.unirioja.es

The mnemoneitoR

AND I HAVE A GREAT REJOICING DAY (mnemonic rule generated by mnemoneitoR for first 7 digits of Pi according to The Wonderful Wizard Of Oz)

Is there some number impossible to memorize? Do not worry, here comes mnemoneitoR: the tool that you was always looking for! With mnemoneitoR you can translate any number into an easy-to-remember phrase inspired by your favorite book. It is very easy: choose a book, enter the number and mnemoneitoR will show you as many possibilities as you want. Just choose the one you like most!

There are many webs about mnemonics in the Internet, like this one. One of my favourite menmonic devices for Pi is:

HOW I WANT A DRINK, ALCOHOLIC OF COURSE, AFTER THE HEAVY LECTURES INVOLVING QUANTUM MECHANICS

The number of letters in each word gives the respective number in the sequence (i.e., 3.14159265358979).

For professional purposes, I am learning how to manage texts in R and I discovered a very useful package called stringr. This is the only one I need for this experiment. The process is simple: I download a book from Project Gutenberg site, clean and split the text and do simulations on the fly of a Markov Chain generated from the words of the book. Step by step:

  • Downloading the book is quite simple. You search the one you want, copy the url in the code (after line “CHOOSE YOUR FAVORITE BOOK HERE”) and no more.
  • After loading the text, some easy tasks are needed: remove header and footer lines, split text into words, turn them into uppercase, remove non-text characters … typical things working with texts.
  • After reading the number you want to translate, I choose a word sampling along all words with the same number of letters as the first digit with probability equal to the number of appearances. This is how I initialize the phrase. Next word are chose among the set of words which are preceded by the first one and have the same number of letters as the second digit with probability equal to number of appearances, and so on. This is a simulation on the fly of Markov Chain because I do not have to calculate the chain explicitly.
  • I always translate Zero with the same word you choose. I like using “OZ” instead Zero.

Most of the phrases do not have any sense but are quite funny. Few of them have some sense and maybe with a small tweak, can change into full of meaning sentences. Here you have some samples of the output of mnemoneitoR:

mnemoneitoR

I like how the phrases smell like the original book. I will try to improve mnemoneitoR in the future but I can imagine some uses of this current version: message generator for fortune cookies,  a cool way to translate your telephone number into a sentence …

Here you have the code. If you discover nice outputs in your experiments, please let me know:

library(stringr)
# CHOOSE YOUR FAVORITE BOOK HERE (Currently "The Wonderful Wizard of Oz")
TEXTFILE = "data/pg55.txt"
if (!file.exists(TEXTFILE)) {download.file("http://www.gutenberg.org/cache/epub/55/pg55.txt", destfile = TEXTFILE)}
textfile <- readLines(TEXTFILE)
# Remove header and footer, concatenate all of the lines, remove non-text and double spaces chars and to upper
textfile = textfile[(grep('START OF THIS PROJECT', textfile, value=FALSE)+1:grep('END OF THIS PROJECT', textfile, value=FALSE)-1)]
textfile <- paste(textfile, collapse = " ")
textfile <- gsub("[^a-zA-Z ]","", textfile)
textfile <- toupper(textfile)
textfile <- gsub("^ *|(?<= ) | *$", "", textfile, perl=T)
# Split file into words
textfile.words <- strsplit(textfile," ")
textfile.words.freq <- as.data.frame(table(textfile.words));
names(textfile.words.freq) <- c("word", "freq")
textfile.words.freq$length <- apply(data.frame(textfile.words.freq[,c("word")]), 1, function(x) nchar(x))
# ENTER YOUR NUMBER HERE!!!!!!
number <- 3.1415926
number <- gsub("[^0-9]","", as.character(number))
# Define the word representing Zero
zero.word = "OZ"
fg <- as.integer(substr(number, 1, 1))
df <- textfile.words.freq[textfile.words.freq$length==fg,]
wd <- sample(df$word, size=1, prob=df$freq)
phrase <- c(as.character(wd))
for (j in 2:nchar(number))
{
fg <- as.integer(substr(number, j, j)) if (fg>0)
{
lc <- as.data.frame(str_locate_all(textfile, as.vector(paste(wd, " ", sep = ""))))
lc$char <- apply(lc, 1, function(x) substr(textfile, as.integer(x[2])+1+fg, as.integer(x[2])+1+fg))
fq <- as.data.frame(table(apply(lc[lc$char==" ",], 1, function(x) substr(textfile, as.integer(x[2])+1, as.integer(x[2])+fg))))
if (nrow(fq)==0) fq <- data.frame(word= character(0), freq= integer(0))
names(fq) <- c("word", "freq")
fq$length <- apply(fq, 1, function(x) nchar(gsub(" ","", x[1])))
fq <- fq[fq$length==fg,]
wd <- if(nrow(fq)>0) sample(fq$word, size=1, prob=fq$freq)
else
{
df <- textfile.words.freq[textfile.words.freq$length==fg,]
wd <- sample(df$word, size=1, prob=df$freq)
}
}
else wd <- zero.word
phrase <- c(phrase, as.character(wd))
}
print(paste(phrase, collapse = " "))