Category Archives: Collaborations

The Somnambulist and Pi

How wary we are of something warm and unborn. Something calmly by zero will divide (Unbegotten, The Somnambulist)

Some time ago, I assumed the mission to draw a plot for the cover of the new album of The Somnambulist, a music band from Berlin. They wanted a circlization of Pi, which is a graphic where numbers are represented in a circular layout. The idea is connecting each digit of Pi to its successive digit with links to the position of the numerically corresponding external sectors. I used a color palette composed by 10 nuances of the visible spectrum as a tribute for Planck, as Marco (the vocalist) requested me. After a number of attempts:

somnambulist

The album is named Unbegotten, a german word which means archaic. As Marco told me, in theology it also means kind of eternal because of being never born and so never dying. I like how π is integrated into the title to substitute the string “tt” in the middle. Pi is also eternal so the association is genuine.

The music of The Somnambulist is intense, dark and powerful and is waiting for you here to listen it. My favorite song is the one that gives name to the album.

If you want to know more about circlizong numbers, you can visit this post, where you also can see the code I used as starting point to do this plot.

Visualizing the Gender of US Senators With R and Highmaps

I wake up every morning in a house that was built by slaves (Michelle Obama)

Some days ago I was invited by the people of Highcharts to write a post in their blog. What I have done is a simple but revealing map of women senators of the United States of America. Briefly, this is what I’ve done to generate it:

  • read from the US senate website a XML file with senators info
  • clean and obtain gender of senators from their first names
  • summarize results by state
  • join data with a US geojson dataset to create the highmap

You can find details and R code here.

It is easy creating a highcharts using highcharter, an amazing library as genderizeR, the one I use to obtain gender names. I like them a lot.

Discovering Shiny

It is not an experiment if you know it is going to work (Jeff Bezos)

From time to time, I discover some of my experiments translated into Shiny Apps, like this one. Some days ago, I discovered one of these translations and I contacted the author, who was a guy from Vietnam called Vu Anh. I asked him to do a Shiny App from this experiment. Vu was enthusiastic with the idea. We defined some parameters to play with shape, number, width and alpha of lines as well as background color and I received a perfect release of the application in just a few hours. With just a handful of parameters, possible outputs are almost infinite. Following you can find some of them:

SinyCollageI think the code is a nice example to take the first steps in Shiny. If you are not used to Markdown files, you can follow this instructions to run the code.

Vu is a talented guy, who loves maths and programming. He represents the future of our nice profession and I predict a successful future for him. Do not miss his brand new blog. I am sure you will find amazing things there.

This is the code of the app:

---
title: "Maths, Music and Merkbar"
author: "Brother Rain"
date: "18/03/2015"
output: html_document
runtime: shiny
---
 
## Load Data
 
```{r}
library(circlize)
library(scales)
factors = as.factor(0:9)
lines = 2000 #Number of lines to plot in the graph
alpha = 0.4  #Alpha for color lines
colors0=c(
    rgb(239,143,121, max=255),
    rgb(126,240,188, max=255),
    rgb(111,228,235, max=255),
    rgb(127,209,249, max=255),
    rgb( 74,106,181, max=255),
    rgb(114,100,188, max=255),
    rgb(181,116,234, max=255),
    rgb(226,135,228, max=255),
    rgb(239,136,192, max=255),
    rgb(233,134,152, max=255)
)
# You can find the txt file here:
# http://www.goldennumber.net/wp-content/uploads/2012/06/Phi-To-100000-Places.txt
phi=readLines("data/Phi-To-100000-Places.txt")[5]
```
 
## Visualization
 
```{r, echo=FALSE}
fluidPage(
  fluidRow(
    column(width = 4,
        sidebarPanel(
            sliderInput("lines", "Number of lines:", min=100, max=100000, step=100, value=500), 
            sliderInput("alpha", "Alpha:", min=0.01, max=1, step=0.01, value=0.4),
            sliderInput("lwd", "Line width", min=0, max=1, step=0.05, value=0.2),
            selectInput("background", "Background:",
                c("Purple" = "mediumpurple4", "Gray" = "gray25", "Orange"="orangered4", 
                  "Red" = "red4", "Brown"="saddlebrown", "Blue"="slateblue4", 
                  "Violet"="palevioletred4", "Green"="forestgreen", "Pink"="deeppink"), selected="Purple"),
            sliderInput("h0", "h0:", min=0, max=0.4,
                    step=0.0005, value=0.1375),
           sliderInput("h1", "h1:", min=0, max=0.4,
                step=0.0005, value=0.1125),
            width=12
        )
    ),
    column(width = 8,
        renderPlot({
            # get data
            phi=gsub("\\.","", substr(phi,1,input$lines))
            phi=gsub("\\.","", phi)
            position=1/(nchar(phi)-1)
             
            # create circos
            circos.clear()
            par(mar = c(1, 1, 1, 1), lwd = 0.1,
                cex = 0.7, bg=alpha(input$background, 1))
            circos.par(
                "cell.padding"=c(0.01,0.01),
                "track.height" = 0.025,
                "gap.degree" = 3
            )
            circos.initialize(factors = factors, xlim = c(0, 1))
            circos.trackPlotRegion(factors = factors, ylim = c(0, 1))
            ## create first region
            for (i in 0:9) {
                circos.updatePlotRegion(
                    sector.index = as.character(i),
                    bg.col = alpha(input$background, 1),
                    bg.border=alpha(colors0[i+1], 1)
                )
            }
            for (i in 1:(nchar(phi)-1)) {
                m=min(as.numeric(substr(phi, i, i)), as.numeric(substr(phi, i+1, i+1)))
                M=max(as.numeric(substr(phi, i, i)), as.numeric(substr(phi, i+1, i+1)))
                d=min((M-m),((m+10)-M))
                col=t(col2rgb(colors0[(as.numeric(substr(phi, i, i))+1)]))
                for(index in 1:3){
                    col[index] = max(min(255, col[index]), 0)
                }
                if (d>0) {
                    circos.link(
                        substr(phi, i, i), position*(i-1),
                        substr(phi, i+1, i+1), position*i,
                        h = input$h0 * d + input$h1,
                        lwd=input$lwd,
                        col=alpha(rgb(col, max=255), input$alpha), rou = 0.92
                    )
                }
            }
            }, width=600, height=600, res=192)
    )
  )
)
 
 
```

Silhouettes

Romeo, Juliet, balcony in silhouette, makin o’s with her cigarette, it’s juliet (Flapper Girl, The Lumineers)

Two weeks ago I published this post for which designed two different visualizations. At the end, I decided to place words on the map of the United States. The discarded visualization was this other one, where I place the words over the silhouette of each state:

States In Two Words v1

I do not want to set aside this chart because I really like it and also because I think it is a nice example of the possibilities one have working with R.

Here you have the code. It substitutes the fragment of the code headed by “Visualization” of the original post:

library(ggplot2)
library(maps)
library(gridExtra)
library(extrafont)
opt=theme(legend.position="none",
             panel.background = element_blank(),
             panel.grid = element_blank(),
             axis.ticks=element_blank(),
             axis.title=element_blank(),
             axis.text =element_blank(),
             plot.title = element_text(size = 28))
vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
grid.newpage()
jpeg(filename = "States In Two Words.jpeg", width = 1200, height = 600, quality = 100)
pushViewport(viewport(layout = grid.layout(6, 8)))
for (i in 1:nrow(table))
{
  wd=subset(words, State==as.character(table$"State name"[i]))
  p=ggplot() + geom_polygon( data=subset(map_data("state"), region==tolower(table$"State name"[i])), aes(x=long, y=lat, group = group), colour="white", fill="gold", alpha=0.6, linetype=0 )+opt
  print(p, vp = vplayout(floor((i-1)/8)+1, i%%8+(i%%8==0)*8))
  txt=paste(as.character(table$"State name"[i]),"\n is", wd$word1,"\n and", wd$word2, sep=" ")
  grid.text(txt, gp=gpar(font=1, fontsize=16, col="midnightblue", fontfamily="Humor Sans"), vp = viewport(layout.pos.row = floor((i-1)/8)+1, layout.pos.col = i%%8+(i%%8==0)*8))
}
dev.off()

The United States In Two Words

Sweet home Alabama, Where the skies are so blue; Sweet home Alabama, Lord, I’m coming home to you (Sweet home Alabama, Lynyrd Skynyrd)

This is the second post I write to show the abilities of twitteR package and also the second post I write for KDnuggets. In this case my goal is to have an insight of what people tweet about american states. To do this, I look for tweets containing the exact phrase “[STATE NAME] is” for every states. Once I have the set of tweets for each state I do some simple text mining: cleaning, standardizing, removing empty words and crossing with these sentiment lexicons. Then I choose the two most common words to describe each state. You can read the original post here. This is the visualization I produced to show the result of the algorithm:

States In Two Words v2

Since the right side of the map is a little bit messy, in the original post you can see a table with the couple of words describing each state. This is just an experiment to show how to use and combine some interesting tools of R. If you don’t like what Twitter says about your state, don’t take it too seriously.

This is the code I wrote for this experiment:

# Do this if you have not registered your R app in Twitter
library(twitteR)
library(RCurl)
setwd("YOUR-WORKING-DIRECTORY-HERE")
if (!file.exists('cacert.perm'))
{
  download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile='cacert.perm')
}
requestURL="https://api.twitter.com/oauth/request_token"
accessURL="https://api.twitter.com/oauth/access_token"
authURL="https://api.twitter.com/oauth/authorize"
consumerKey = "YOUR-CONSUMER_KEY-HERE"
consumerSecret = "YOUR-CONSUMER-SECRET-HERE"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=requestURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo=system.file("CurlSSL", "cacert.pem", package="RCurl"))
save(Cred, file="twitter authentification.Rdata")
# Start here if you have already your twitter authentification.Rdata file
library(twitteR)
library(RCurl)
library(XML)
load("twitter authentification.Rdata")
registerTwitterOAuth(Cred)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
#Read state names from wikipedia
webpage=getURL("http://simple.wikipedia.org/wiki/List_of_U.S._states")
table=readHTMLTable(webpage, which=1)
table=table[!(table$"State name" %in% c("Alaska", "Hawaii")), ]
#Extract tweets for each state
results=data.frame()
for (i in 1:nrow(table))
{
  tweets=searchTwitter(searchString=paste("'\"", table$"State name"[i], " is\"'",sep=""), n=200, lang="en")
  tweets.df=twListToDF(tweets)
  results=rbind(cbind(table$"State name"[i], tweets.df), results)
}
results=results[,c(1,2)]
colnames(results)=c("State", "Text")
library(tm)
#Lexicons
pos = scan('positive-words.txt',  what='character', comment.char=';')
neg = scan('negative-words.txt',  what='character', comment.char=';')
posneg=c(pos,neg)
results$Text=tolower(results$Text)
results$Text=gsub("[[:punct:]]", " ", results$Text)
# Extract most important words for each state
words=data.frame(Abbreviation=character(0), State=character(0), word1=character(0), word2=character(0), word3=character(0), word4=character(0))
for (i in 1:nrow(table))
{
  doc=subset(results, State==as.character(table$"State name"[i]))
  doc.vec=VectorSource(doc[,2])
  doc.corpus=Corpus(doc.vec)
  stopwords=c(stopwords("english"), tolower(unlist(strsplit(as.character(table$"State name"), " "))), "like")
  doc.corpus=tm_map(doc.corpus, removeWords, stopwords)
  TDM=TermDocumentMatrix(doc.corpus)
  TDM=TDM[Reduce(intersect, list(rownames(TDM),posneg)),]
  v=sort(rowSums(as.matrix(TDM)), decreasing=TRUE)
  words=rbind(words, data.frame(Abbreviation=as.character(table$"Abbreviation"[i]), State=as.character(table$"State name"[i]),
                                   word1=attr(head(v, 4),"names")[1],
                                   word2=attr(head(v, 4),"names")[2],
                                   word3=attr(head(v, 4),"names")[3],
                                   word4=attr(head(v, 4),"names")[4]))
}
# Visualization
require("sqldf")
statecoords=as.data.frame(cbind(x=state.center$x, y=state.center$y, abb=state.abb))
#To make names of right side readable
texts=sqldf("SELECT a.abb,
            CASE WHEN a.abb IN ('DE', 'NJ', 'RI', 'NH') THEN a.x+1.7
            WHEN a.abb IN ('CT', 'MA') THEN a.x-0.5  ELSE a.x END as x,
            CASE WHEN a.abb IN ('CT', 'VA', 'NY') THEN a.y-0.4 ELSE a.y END as y,
            b.word1, b.word2 FROM statecoords a INNER JOIN words b ON a.abb=b.Abbreviation")
texts$col=rgb(sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),max=255)
library(maps)
jpeg(filename = "States In Two Words v2.jpeg", width = 1200, height = 600, quality = 100)
map("state", interior = FALSE, col="gray40", fill=FALSE)
map("state", boundary = FALSE, col="gray", add = TRUE)
text(x=as.numeric(as.character(texts$x)), y=as.numeric(as.character(texts$y)), apply(texts[,4:5] , 1 , paste , collapse = "\n" ), cex=1, family="Humor Sans", col=texts$col)
dev.off()

Visualizing Home Ownership With Small Multiples And R

If everybody had an ocean, across the U.S.A., then everybody’d be surfin’ like California (Beach Boys, Surfin’ U.S.A.)

home_ownership

I was invited to write a post for Domino Data Lab, a company based in California which provides a cloud-based machine learning platform which enables companies to use the power of the cloud to build analytical projects. I also discovered recently this book which support the premises of companies like Domino Data Lab which are leading the change in the way of doing data science. How I wish to forget in the future expressions like execution time, update versions and memory limit!

Since I like a lot Small multiples, I decided to plot the evolution of homeownership across the United States (the more I use GridExtra package the more I like it). You can read the post here (code included).

By the way, if you want to go to Gigaom Structure Data 2015 for free, Domino Data Lab is giving away 2 tickets here.

Simple Data Science Of Global Warming In KDnuggets

Would love to get a post from you for KDnuggets (Gregory Piatetsky, KDnuggets President)

logoSome days ago, Gregory Piatetsky invited me to write a post for KDnuggets. I couldn’t say no. He suggested to me some topics and I decided to experiment around climate change to demonstrate how easy is to see some evidences of this alarming menace. You can read the post here.

This is the code I wrote to do this experiment:

require(sqldf)
require(googleVis)
require(ggplot2)
require(ggthemes)
setwd("YOUR WORKING DIRECTORY HERE") #This line doen's work until you type a valid path
#Data Avaliable in http://eca.knmi.nl/utils/downloadfile.php?file=download/ECA_blend_tg.zip
files=list.files(pattern = "TG_STAID")
results=data.frame(staid=character(0), trnd=numeric(0), nobs=numeric(0))
#Loop to calculate linear trends
for (i in 1:length(files))
{
  table=read.table(files[i], header=TRUE, sep = ',', skip=20)
  table=table[table$Q_TG==0,]
  table$date=as.Date(as.character(table$DATE), "%Y%m%d")
  results=rbind(data.frame(staid=files[i], trnd=lm.fit (x = matrix(table$date), y = table$TG)$coefficients, nobs=nrow(table)), results)
}
write.csv(results, file="results.csv", row.names = FALSE)#Save your work
results=read.csv(file="results.csv")#Read your work. You can start here if you stop your work further this line
#Remove outliers
results=results[!results$trnd %in% boxplot.stats(results$trnd, coef = 4)$out,]
#Histogram
hist(x=results$trnd, breaks = 50,
     col = "orange",
     border = "pink",
     freq=TRUE,
     xlim=c(-0.03, 0.03),
     ylim=c(0, 300),
     xlab="Historical trend of daily mean temperatures",
     ylab="Number of stations",
     main="Evolution of temperatures in Europe and the Mediterranean",
     sub="Source: European Climate Assessment & Dataset project")
results$staid2=as.numeric(gsub("[^0-9]","",results$staid)) #To join results with geographical coordinates
#Read table of geographical coordinates
staids=read.table("http://www.ecad.eu/download/ECA_all_stations.txt", header=TRUE, sep = ',', skip=17)
#Right tail of the distribution
hotpoints=sqldf("SELECT a.staid, a.staid2, a.trnd, a.nobs, b.staname, b.lat, b.lon
      FROM results a INNER JOIN staids b ON (a.staid2=b.staid) WHERE TRND >= 0.02")
#Convert degrees:minutes:seconds to decimal coordinates
hotpoints=within(hotpoints, {
  dms=do.call(rbind, strsplit(as.character(LAT), ":"))
  lat=sign(as.numeric(dms[,1]))*(abs(as.numeric(dms[,1]))+(as.numeric(dms[,2]) + as.numeric(dms[,3])/60)/60);rm(dms)
})
hotpoints=within(hotpoints, {
  dms=do.call(rbind, strsplit(as.character(LON), ":"))
  lon=sign(as.numeric(dms[,1]))*(abs(as.numeric(dms[,1]))+(as.numeric(dms[,2]) + as.numeric(dms[,3])/60)/60);rm(dms)
})
#To make readable tha name of the station in the map
hotpoints$staname=gsub("^\\s+|\\s+$", "", hotpoints$STANAME)
#To prepare coordinates to gvis function
hotpoints$LatLong=with(hotpoints, paste(lat, lon, sep=":"))
#The amazing gvisMap function of googleVis package
hotpointsMap=gvisMap(hotpoints, "LatLong" , "STANAME",
        options=list(showTip=TRUE,
                     showLine=TRUE,
                     enableScrollWheel=TRUE,
                     mapType='terrain',
                     useMapTypeControl=TRUE))
plot(hotpointsMap)
#The plot one of this hot stations
InAmenas=read.table("TG_STAID000312.txt", header=TRUE, sep = ',', skip=20)
InAmenas=InAmenas[InAmenas$Q_TG==0,]
InAmenas$date=as.Date(as.character(InAmenas$DATE), "%Y%m%d")
ggplot(InAmenas, aes(date, TG)) +
  geom_line(color="red", alpha=0.8) +
  xlab("") +
  ylab("Mean temperature in 0.1C")+
  ggtitle("Mean temperature in IN-AMENAS (ALGERIA) 1958- 1998")+
  geom_smooth(method = "lm", se=FALSE, color="red", lwd=2)+
  theme_economist(base_size = 20, base_family = "sans", horizontal = TRUE,
                  dkpanel = FALSE, stata = FALSE)