Tag Archives: Rstats

Simple Data Science Of Global Warming In KDnuggets

January 20, 2015CollaborationsClimate change, Global warming, Gregory Piatetsky, KDnuggets, R, Rstats@aschinchon

Would love to get a post from you for KDnuggets (Gregory Piatetsky, KDnuggets President)

Some days ago, Gregory Piatetsky invited me to write a post for KDnuggets. I couldn’t say no. He suggested to me some topics and I decided to experiment around climate change to demonstrate how easy is to see some evidences of this alarming menace. You can read the post here.

This is the code I wrote to do this experiment:

require(sqldf)
require(googleVis)
require(ggplot2)
require(ggthemes)
setwd("YOUR WORKING DIRECTORY HERE") #This line doen's work until you type a valid path
#Data Avaliable in http://eca.knmi.nl/utils/downloadfile.php?file=download/ECA_blend_tg.zip
files=list.files(pattern = "TG_STAID")
results=data.frame(staid=character(0), trnd=numeric(0), nobs=numeric(0))
#Loop to calculate linear trends
for (i in 1:length(files))
{
  table=read.table(files[i], header=TRUE, sep = ',', skip=20)
  table=table[table$Q_TG==0,]
  table$date=as.Date(as.character(table$DATE), "%Y%m%d")
  results=rbind(data.frame(staid=files[i], trnd=lm.fit (x = matrix(table$date), y = table$TG)$coefficients, nobs=nrow(table)), results)
}
write.csv(results, file="results.csv", row.names = FALSE)#Save your work
results=read.csv(file="results.csv")#Read your work. You can start here if you stop your work further this line
#Remove outliers
results=results[!results$trnd %in% boxplot.stats(results$trnd, coef = 4)$out,]
#Histogram
hist(x=results$trnd, breaks = 50,
     col = "orange",
     border = "pink",
     freq=TRUE,
     xlim=c(-0.03, 0.03),
     ylim=c(0, 300),
     xlab="Historical trend of daily mean temperatures",
     ylab="Number of stations",
     main="Evolution of temperatures in Europe and the Mediterranean",
     sub="Source: European Climate Assessment & Dataset project")
results$staid2=as.numeric(gsub("[^0-9]","",results$staid)) #To join results with geographical coordinates
#Read table of geographical coordinates
staids=read.table("http://www.ecad.eu/download/ECA_all_stations.txt", header=TRUE, sep = ',', skip=17)
#Right tail of the distribution
hotpoints=sqldf("SELECT a.staid, a.staid2, a.trnd, a.nobs, b.staname, b.lat, b.lon
      FROM results a INNER JOIN staids b ON (a.staid2=b.staid) WHERE TRND >= 0.02")
#Convert degrees:minutes:seconds to decimal coordinates
hotpoints=within(hotpoints, {
  dms=do.call(rbind, strsplit(as.character(LAT), ":"))
  lat=sign(as.numeric(dms[,1]))*(abs(as.numeric(dms[,1]))+(as.numeric(dms[,2]) + as.numeric(dms[,3])/60)/60);rm(dms)
})
hotpoints=within(hotpoints, {
  dms=do.call(rbind, strsplit(as.character(LON), ":"))
  lon=sign(as.numeric(dms[,1]))*(abs(as.numeric(dms[,1]))+(as.numeric(dms[,2]) + as.numeric(dms[,3])/60)/60);rm(dms)
})
#To make readable tha name of the station in the map
hotpoints$staname=gsub("^\\s+|\\s+$", "", hotpoints$STANAME)
#To prepare coordinates to gvis function
hotpoints$LatLong=with(hotpoints, paste(lat, lon, sep=":"))
#The amazing gvisMap function of googleVis package
hotpointsMap=gvisMap(hotpoints, "LatLong" , "STANAME",
        options=list(showTip=TRUE,
                     showLine=TRUE,
                     enableScrollWheel=TRUE,
                     mapType='terrain',
                     useMapTypeControl=TRUE))
plot(hotpointsMap)
#The plot one of this hot stations
InAmenas=read.table("TG_STAID000312.txt", header=TRUE, sep = ',', skip=20)
InAmenas=InAmenas[InAmenas$Q_TG==0,]
InAmenas$date=as.Date(as.character(InAmenas$DATE), "%Y%m%d")
ggplot(InAmenas, aes(date, TG)) +
  geom_line(color="red", alpha=0.8) +
  xlab("") +
  ylab("Mean temperature in 0.1C")+
  ggtitle("Mean temperature in IN-AMENAS (ALGERIA) 1958- 1998")+
  geom_smooth(method = "lm", se=FALSE, color="red", lwd=2)+
  theme_economist(base_size = 20, base_family = "sans", horizontal = TRUE,
                  dkpanel = FALSE, stata = FALSE)

Maths, Music and Merkbar

January 15, 2015Drawingscirclize, Merkbar, music, Phi, R, Rstats@aschinchon

Control is what we already know. Control is where we have already ventured. Control is what helps us predict the future. (Merkbar)

Maths and music get along very well. Last December I received a mail from a guy called Jesper. He is one of the two members of Merkbar: a electronic music band from Denmark. As can be read in their website:

Merkbar is Jesper and David who are both interested in the psychedelic worlds and oriental spiritualism. They both studied Computer Music, where they’ve done research in sound synthesis, generative composition and the design of new digital instrument.

They asked me a front cover for their new album which will be released at the beginning of 2015. Why? Because they liked this post I did about circlizing numbers. To do this plot I circlized the Golden Ratio number (Phi). But in this case I changed ribbons (all equal pairs of consecutive numbers gather together) by lines (every pair of consecutive numbers form a different line). As I did before, I used circlize package, which implements in R the features of Circos, a software to create stunning circular visualizations.

The final plot represents the first 2.000 digits of Phi:

You can hear an advancement of their new album here, which is called “Phi”. Enjoy their sensitive and full-of-shades music: you will be delightfully surprised as I was.

This is the code to circlize Phi:

library(circlize)
library(scales)
factors = as.factor(0:9)
lines = 2000 #Number of lines to plot in the graph
alpha = 0.4  #Alpha for color lines
colors0=c(rgb(239,143,121, max=255), rgb(126,240,188, max=255), rgb(111,228,235, max=255),
          rgb(127,209,249, max=255), rgb( 74,106,181, max=255), rgb(114,100,188, max=255),
          rgb(181,116,234, max=255), rgb(226,135,228, max=255), rgb(239,136,192, max=255),
          rgb(233,134,152, max=255))
#You can find the txt file here: http://www.goldennumber.net/wp-content/uploads/2012/06/Phi-To-100000-Places.txt
phi=readLines("Phi-To-100000-Places.txt")[5]
phi=gsub("\\.","", substr(phi,1,lines))
phi=gsub("\\.","", phi)
position=1/(nchar(phi)-1)
#This code generates a pdf file in your working directory
pdf(file="CirclizePhi.pdf", width=25, height=25)
circos.clear()
par(mar = c(1, 1, 1, 1), lwd = 0.1, cex = 0.7, bg=alpha("black", 1))
circos.par("cell.padding"=c(0.01,0.01), "track.height" = 0.025, "gap.degree" = 3)
circos.initialize(factors = factors, xlim = c(0, 1))
circos.trackPlotRegion(factors = factors, ylim = c(0, 1))
for (i in 0:9) {circos.updatePlotRegion(sector.index = as.character(i), bg.col = alpha("black", 1), bg.border=alpha(colors0[i+1], 1))}
for (i in 1:(nchar(phi)-1)) {
  m=min(as.numeric(substr(phi, i, i)), as.numeric(substr(phi, i+1, i+1)))
  M=max(as.numeric(substr(phi, i, i)), as.numeric(substr(phi, i+1, i+1)))
  d=min((M-m),((m+10)-M))
  col=t(col2rgb(colors0[(as.numeric(substr(phi, i, i))+1)]))
  if (col[1]>255) col[1]=255;if (col[2]>255) col[2]=255;if (col[3]>255) col[3]=255
  if (col[1]<0) col[1]=0;if (col[2]<0) col[2]=0;if (col[3]<0) col[3]=0 if (d>0) circos.link(substr(phi, i, i), position*(i-1), substr(phi, i+1, i+1), position*i, h = 0.1375*d+0.1125, lwd=0, col=alpha(rgb(col, max=255), alpha), rou = 0.92)
}
dev.off()

First Anniversary Of Ripples

January 7, 2015MilestonesR, Rstats@aschinchon

If people do not believe that mathematics is simple, it is only because they do not realize how complicated life is (John von Neumann)

I started this blog one year ago and the experience has been better than I could have imagine:

Ripples has been viewed about 30.000 times
49 posts published (The Sound Of Mandelbrot Set was the most-viewed)
R-bloggers is my top referring site (thanks to Tal Galili for his support)
Visitors come from 141 countries: most of them came from The United States. Spain and U.K. were not far behind (below you can see my map of the empire)

I have met a lot of wonderful people along the way like Andrew Wyer with whom I wrote a post about 3D-Harmonographs, people of Merkbar, a musicians who asked me a plot for the cover of their new album (I will post about this project in the next days) and Gregory Piatetsky from KDnuggets (I expect also doing something with him in the close future).

I think this new year is going to be funny. I still have lot of experiments in my mind that I want to explore with R.

Happy 2015!

Mixing Benford, GoogleVis And On-Line Encyclopedia of Integer Sequences

December 23, 2014Curiosities, Going BananasBenford, Gauge, googleVis, OEIS, R, Rstats@aschinchon

The chess-board is the world; the pieces are the phenomena of the universe; the rules of the game are what we call the laws of Nature (T. H. Huxley)

One of the greatest packages I discovered recently is googleVis. While plotting with ggplot can be sometimes very arduous, doing plots with googleVis is extremely easy. Here you can find many examples of what you can do with this great package.

Not long ago, I also discovered The On-Line Encyclopedia of Integer Sequences (OEIS), a huge database of about 250.000 integer sequences where, for example, you can find the number of ways to lace a shoe that has n pairs of eyelets or the smallest number of stones in Tchoukaillon (or Mancala, or Kalahari) solitaire which make use of n-th hole. Many mathematicians, as Ralph Stephan, use this useful resource to develop their theories.

The third protagonist of this story is Frank Benford, who formulated in 1938 his famous law which states that considering different lists of numbers, 1 occurs as the leading digit about the 30% of time, while larger digits occur in that position less frequently.

In this experiment I read 20 random sequences from the OEIS. For each sequence, I obtain the distribution of first digit of the numbers and calculate the similarity with the theoretical distribution given by Benford’s Law so the more similar is the distribution, the closer is this number to 1. Sequences of OEIS are labeled with a seven characters code (an “A” followed by 6 digits). A nice way to show the output of this experiment is using the Gauge visualization of googleVis:

Sequence A001288 is the closest to the Benford’s Law. This sequence is the number of distinct 11-element subsets that can be formed from a n element set. Why is so close to the Benford’s Law? No idea further than binomial coefficients are related to some biological laws as number of descendants of a couple of rabbits.

I would like to wish you all a Merry Christmas and a Happy New Year:

library(plyr)
bendford=data.frame(first=0:9, freq=c(0,log10(1+1/(1:9))))
SequencesIds=formatC(sample(1:250000, 20, replace=FALSE), width = 6, format = "d", flag = "0")
results=data.frame(SEQID=character(0), BENDFORNESS=numeric(0))
for(i in 1:length(SequencesIds))
{
  SEQID = SequencesIds[i]
  TEXTFILE=paste("b", SEQID, ".txt", sep="")
  if (!file.exists(TEXTFILE)) download.file(paste("http://oeis.org/A",SEQID, "/b", SEQID, ".txt",sep=""), destfile = TEXTFILE)
  SEQ=readLines(paste("b", SEQID, ".txt", sep=""))
  SEQ=SEQ[SEQ != ""]
  SEQ=SEQ[unlist(gregexpr(pattern ='synthesized',SEQ))<0]
  m=t(sapply(SEQ, function(x) unlist(strsplit(x, " "))))
  df=data.frame(first=substr(gsub("[^0-9]","",m[,2]), 1, 1), row.names = NULL)
  df=count(df, vars = "first")
  df$freq=df$freq/sum(df$freq)
  df2=merge(x = bendford, y = df, by = "first", all.x=TRUE)
  df2[is.na(df2)]=0
  results=rbind(results, data.frame(SEQID=paste("A", SEQID, sep=""), BENDFORNESS=1-sqrt(sum((df2$freq.x - df2$freq.y) ^ 2))))
}
results$BENDFORNESS=as.numeric(format(round(results$BENDFORNESS, 2), nsmall = 2))
Gauge=gvisGauge(results, options=list(min=0, max=1, greenFrom=.75, greenTo=1, yellowFrom=.25, yellowTo=.75, redFrom=0, redTo=.25, width=400, height=300))
plot(Gauge)

The Awesome Parrondo’s Paradox

December 16, 2014Curiosities, Gamesggplot, ggplot2, paradox, Parrondo, R, Rstats, simulation@aschinchon

A technique succeeds in mathematical physics, not by a clever trick, or a happy accident, but because it expresses some aspect of physical truth (O. G. Sutton)

Imagine three unbalanced coins:

Coin 1: Probability of head=0.495 and probability of tail=0.505
Coin 2: Probability of head=0.745 and probability of tail=0.255
Coin 3: Probability of head=0.095 and probability of tail=0.905

Now let’s define two games using these coins:

Game A: You toss coin 1 and if it comes up head you receive 1€ but if not, you lose 1€
Game B: If your present capital is a multiple of 3, you toss coin 2. If not, you toss coin 3. In both cases, you receive 1€ if coin comes up head and lose 1€ if not.

Played separately, both games are quite unfavorable. Now let’s define Game A+B in which you toss a balanced coin and if it comes up head, you play Game A and play Game B otherwise. In other words, in Game A+B you decide between playing Game A or Game B randomly.

Starting with 0€, it is easy to simulate the three games along 500 plays. This is an example of one of these simulations:

Resulting profit of Game A+B after 500 plays is +52€ and is -9€ and -3€ for Games A and B respectively. Let’s do some more simulations (I removed legends and titles but colors of games are the same):

As you can see, Game A+B is the most profitable in almost all the previous simulations. Coincidence? Not at all. This is a consequence of the stunning Parrondo’s Paradox which states that two losing games can combine into a winning one.

If you still don’t believe in this brain-crashing paradox, following you can see the empirical distributions of final profits of three games after 1.000 plays:

After 1000 plays, mean profit of Game A is -13€, is -7€ for Game B and 17€ for Game A+B.

This paradox was discovered in the last nineties by the Spanish physicist Juan Parrondo and can help to explain, among other things, why investing in losing shares can result in obtaining big profits. Amazing:

require(ggplot2)
require(scales)
library(gridExtra)
opts=theme(
  legend.position = "bottom",
  legend.background = element_rect(colour = "black"),
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="gray25", size=15),
  axis.text.x = element_text(colour="gray25", size=15),
  text = element_text(size=20),
  plot.title = element_text(size = 35))
PlayGameA = function(profit, x, c) {if (runif(1) < c-x) profit+1 else profit-1}
PlayGameB = function(profit, x1, c1, x2, c2) {if (profit%%3>0) PlayGameA(profit, x=x1, c=c1) else PlayGameA(profit, x=x2, c=c2)}
####################################################################
#EVOLUTION
####################################################################
noplays=500
alpha=0.005
profit0=0
results=data.frame(Play=0, ProfitA=profit0, ProfitB=profit0, ProfitAB=profit0)
for (i in 1:noplays) {results=rbind(results, c(i,
    PlayGameA(profit=results[results$Play==(i-1),2], x =alpha, c =0.5),
    PlayGameB(profit=results[results$Play==(i-1),3], x1=alpha, c1=0.75, x2=alpha, c2=0.1),
    if (runif(1)<0.5) PlayGameA(profit=results[results$Play==(i-1),4], x =alpha, c =0.5) else PlayGameB(profit=results[results$Play==(i-1),4], x1=alpha, c1=0.75, x2=alpha, c2=0.1)
    ))}
results=rbind(data.frame(Play=results$Play, Game="A",   Profit=results$ProfitA),
              data.frame(Play=results$Play, Game="B",   Profit=results$ProfitB),
              data.frame(Play=results$Play, Game="A+B", Profit=results$ProfitAB))
ggplot(results, aes(Profit, x=Play, y=Profit, color = Game)) +
  scale_x_continuous(limits=c(0,noplays), "Plays")+
  scale_y_continuous(limits=c(-75,75), expand = c(0, 0), "Profit")+
  labs(title="Evolution of profit games along 500 plays")+
  geom_line(size=3)+opts
####################################################################
#DISTRIBUTION
####################################################################
noplays=1000
alpha=0.005
profit0=0
results2=data.frame(Play=numeric(0), ProfitA=numeric(0), ProfitB=numeric(0), ProfitAB=numeric(0))
for (j in 1:100) {results=data.frame(Play=0, ProfitA=profit0, ProfitB=profit0, ProfitAB=profit0)
  for (i in 1:noplays) {results=rbind(results, c(i,
      PlayGameA(profit=results[results$Play==(i-1),2], x =alpha, c =0.5),
      PlayGameB(profit=results[results$Play==(i-1),3], x1=alpha, c1=0.75, x2=alpha, c2=0.1),
      if (runif(1)<0.5) PlayGameA(profit=results[results$Play==(i-1),4], x =alpha, c =0.5)
      else PlayGameB(profit=results[results$Play==(i-1),4], x1=alpha, c1=0.75, x2=alpha, c2=0.1)))}
      results2=rbind(results2, results[results$Play==noplays, ])}
results2=rbind(data.frame(Game="A", Profit=results2$ProfitA),
data.frame(Game="B", Profit=results2$ProfitB),
data.frame(Game="A+B", Profit=results2$ProfitAB))
ggplot(results2, aes(Profit, fill = Game)) +
  scale_x_continuous(limits=c(-150,150), "Profit")+
  scale_y_continuous(limits=c(0,0.02), expand = c(0, 0), "Density", labels = percent)+
  labs(title=paste("Parrondo's Paradox (",as.character(noplays)," plays)",sep=""))+
  geom_density(alpha=.75)+opts

Going Bananas #1: Superman Was Born In Kentucky!

December 9, 2014Going Bananasggmap, Kentucky, Krypton, plotGoogleMaps, R, RCurl, Rstats, Superman, XML@aschinchon

Who knows? Could be the tropic heat or something that I eat, that makes me gonzo (I’m Going Bananas, Madonna)

This is the first post of a new category called “Going Bananas” in which I will post totally crazy experiments. My aim is to try some techniques without taking care of making serious things. These are experiments without rhyme or reason (or, as we say in Spanish, sin pies ni cabeza). So here we go.

Some days ago, my friend Juanjo show me how to download a webpage and to extract a table in R. He did an amazing script to locate on a map all spanish football stadiums in just a handful of lines. And this is what what I will do in this experiment with a slight difference. I will not locate stadiums: I will try to locate the 118 chemical elements discovered up today. How to do it? It is extremely easy: as I did it in this previous post, using geocode function of ggmap package you can locate any address, even if it is the name of a chemical element.

It must be taking into account that almost all names of the elements come from greek or from latin as you can see here. This is the result of locating in the world the 118 chemical elements:

Some conclusions:

Only 88 elements (a 75%) have a place on Earth.
40 of the elements with a place on Earth (a 45%) are located in the United States of America, very far from Greek-Latin cultures, by the way
Only 16 of them are located in Europe (a 16%)
South-Africa and Philipines have almost all the elements outside of EEUU and Europe.
And last but not least, Krypton is in Kentucky, as you can see here:

Clark Kent, Krypton, Kentucky … there are a lot of misteries surrounding Superman … and a lot of K letters!!!!

library(RCurl)
library(XML)
library(ggmap)
library (plotGoogleMaps)
webpage=getURL("http://en.wikipedia.org/wiki/List_of_elements", .opts = opts)
table=readHTMLTable(webpage)[[1]]
table=table[nchar(as.character(table$Element))>2,c("Sym", "Element")]
for (i in 1:nrow(table)) {
 valor=geocode(as.character(table$Element[i]))
 table$lon[i]=valor$lon;table$lat[i]=valor$lat}
table2paint=na.omit(table[, c("lat", "lon", "Sym", "Element")])
coordinates(table2paint)=~lon+lat
proj4string(table2paint)=CRS('+proj=longlat +datum=WGS84')
plotGoogleMaps(table2paint, col="yellow", control.width="0%", control.height="0%", mapTypeId='ROADMAP', legend=FALSE)

The World We Live In #3: Breastfeeding

December 2, 2014The World We Live InBreastfeeding, Fertility, ggplot, ggplot2, R, Rstats, UNdata@aschinchon

Facts are stubborn, but statistics are more pliable (Mark Twain)

According to World Health Organization, exclusive breastfeeding is recommended up to 6 months of age, with continued breastfeeding along with appropriate complementary foods up to two years of age or beyond. Thus, the defining characteristic of continued breastfeeding is that the infant between 6 months and 2 years of age receives at least some breast milk regardless of the quantity or the presence of other foods or liquids in the diet.

On the other hand, as can be read in The World Factbook of Central Intelligence Agency, the Total Fertility Rate (TFR) is the average number of children that would be born to a woman over her lifetime if she were to experience the exact current age-specific fertility rates through her lifetime and she were to survive from birth through the end of her reproductive life. It is obtained by summing the single-year age-specific rates at a given time.

This is how the world is arranged according to these two rates:

There are many differences between countries. Both rates are very low in some east European countries like Ukraine, Bosnia, Belarus and Moldova. On the other hand both of them are very high in Benin, Rwanda, Burkina Faso and Malawi, all of them African. Also African countries are Angola, Nigeria and Somalia where fertility rate is very high but breastfeeding is not very established (Timor-Leste in Asia belongs to this segment as well); and women in Nepal, Bangladesh, Sri-Lanka and India feed their moderate number of descendants with their own milk.

We live in a complex and beautiful world which cannot be measured only with averages nor standard deviations:

#Continued breastfeeding rate: http://data.un.org/Data.aspx?d=SOWC&f=inID%3a89
#Total fertility rate (TFR): http://data.un.org/Data.aspx?d=SOWC&f=inID%3a127
#Population: http://data.un.org/Data.aspx?d=SOWC&f=inID%3a105
require("sqldf")
require("ggplot2")
require("scales")
breastfeeding=read.csv("UNdata_Export_20141122_122134175.csv", nrows=124, header=T, row.names=NULL)
fertility=read.csv("UNdata_Export_20141122_122330581.csv", nrows=570, header=T, row.names=NULL)
population=read.csv("UNdata_Export_20141122_142359579.csv", nrows=999, header=T, row.names=NULL)
colnames(breastfeeding)[1]="Country"
colnames(fertility)[1]="Country"
colnames(population)[1]="Country"
data=sqldf("SELECT a.Country, a.Value as Pop, b.Value as Fertility, c.Value as Breastfeeding
           FROM population a inner join fertility b
           on (a.Country=b.Country) INNER JOIN breastfeeding c
           on (a.Country=c.Country)
           where a.Subgroup = 'Total' AND b.Year = 2011
           AND a.Country NOT IN ('World', 'South Asia',
           'Sub-Saharan Africa', 'Least Developed Countries/Territories', 'Eastern and Southern Africa',
           'East Asia and Pacific')")
opts=theme(
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="gray25", size=15),
  axis.text.x = element_text(colour="gray25", size=15),
  text = element_text(size=20),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 45))
ggplot(data, aes(x=Fertility, y=Breastfeeding/100, size=log(Pop), label=Country), guide=FALSE)+
  geom_point(colour="white", fill="darkorchid2", shape=21, alpha=.55)+
  scale_size_continuous(range=c(2,40))+
  scale_x_continuous(limits=c(1,7))+
  scale_y_continuous(limits=c(0,1), labels = percent)+
  labs(title="The World We Live In #3: Breastfeeding",
       x="Total fertility rate (TFR)",
       y="Continued breastfeeding rate")+
  geom_text(data=subset(data, Fertility>5 & (Breastfeeding>75|Breastfeeding<40)), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Fertility<3 & Breastfeeding>75), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Fertility<2 & Breastfeeding<12), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(aes(5, 0), colour="gray25", hjust=0, label="Source: United Nations (size of bubble depending on population)", size=4)+opts

Circlizing Numbers

November 25, 2014DrawingsChord Diagrams, circlize, R, RColorBrewer, Rmpfr, Rstats@aschinchon

She makes the sound the sea makes to calm me down (Dissolve Me, Alt-J)

Searching how to do draw chord diagrams in the Internet with ggplot2 I found a very-easy-to-use package called circlize which does exactly that. A chord diagram shows relationships between things so the input to draw it is simply a matrix with the intensity of these relations. In this experiment I use this package to circlize numbers in this way:

I take a number with many digits (Rmpfr package is very useful to obtain large numbers), I convert it to text and remove punctuation characters (necessary if number has decimals)
Function CreateAdjacencyMatrix creates a 10×10 matrix where the element [i,j] contains the number of times that number “i” precedes to number “j” in the previous string (i and j from 0 to 9); this is the input to create diagram.

These diagrams are the result of circlizing four famous constants: Pi (green), Gamma (purple), Catalan (blue) and Logarithm constants (red):

Just two conclusions of my own to end:

Circlize package is very easy to use and generates very nice diagrams
Chord diagrams remember me to dreamcatchers
The more I use RColorBrewer package the more I like it

This is the code to circlize numbers:

library(Rmpfr)
library(circlize)
library(RColorBrewer)
CreateAdjacencyMatrix = function(x) {
 s=gsub("\\.", "", x)
 m=matrix(0, 10, 10)
 for (i in 1:(nchar(s)-1)) m[as.numeric(substr(s, i, i))+1, as.numeric(substr(s, i+1, i+1))+1]=m[as.numeric(substr(s, i, i))+1, as.numeric(substr(s, i+1, i+1))+1]+1
 rownames(m) = 0:9;colnames(m) = 0:9
 m}
m1=CreateAdjacencyMatrix(formatMpfr(Const("pi",2000)))
m2=CreateAdjacencyMatrix(formatMpfr(Const("gamma",2000)))
m3=CreateAdjacencyMatrix(formatMpfr(Const("catalan",2000)))
m4=CreateAdjacencyMatrix(formatMpfr(Const("log2",2000)))
jpeg(filename = "Chords.jpg", width = 800, height = 800, quality = 100)
par(mfrow=c(2,2), mar = c(1, 1, 1, 1))
chordDiagram(m1, grid.col = "darkgreen",
 col = colorRamp2(quantile(m1, seq(0, 1, by = 0.25)), brewer.pal(5,"Greens")),
 transparency = 0.4, annotationTrack = c("name", "grid"))
chordDiagram(m2, grid.col = "mediumpurple4",
 col = colorRamp2(quantile(m2, seq(0, 1, by = 0.25)), brewer.pal(5,"Purples")),
 transparency = 0.4, annotationTrack = c("name", "grid"))
chordDiagram(m3, grid.col = "midnightblue",
 col = colorRamp2(quantile(m3, seq(0, 1, by = 0.25)), brewer.pal(5,"Blues")),
 transparency = 0.4, annotationTrack = c("name", "grid"))
chordDiagram(m4, grid.col = "red3",
 col = colorRamp2(quantile(m4, seq(0, 1, by = 0.25)), brewer.pal(5,"Reds")),
 transparency = 0.4, annotationTrack = c("name", "grid"))
dev.off()

Fronkonstin

Experiments in R