The Somnambulist and Pi

How wary we are of something warm and unborn. Something calmly by zero will divide (Unbegotten, The Somnambulist)

Some time ago, I assumed the mission to draw a plot for the cover of the new album of The Somnambulist, a music band from Berlin. They wanted a circlization of Pi, which is a graphic where numbers are represented in a circular layout. The idea is connecting each digit of Pi to its successive digit with links to the position of the numerically corresponding external sectors. I used a color palette composed by 10 nuances of the visible spectrum as a tribute for Planck, as Marco (the vocalist) requested me. After a number of attempts:

somnambulist

The album is named Unbegotten, a german word which means archaic. As Marco told me, in theology it also means kind of eternal because of being never born and so never dying. I like how π is integrated into the title to substitute the string “tt” in the middle. Pi is also eternal so the association is genuine.

The music of The Somnambulist is intense, dark and powerful and is waiting for you here to listen it. My favorite song is the one that gives name to the album.

If you want to know more about circlizong numbers, you can visit this post, where you also can see the code I used as starting point to do this plot.

Bayesian Blood

The fourth, the fifth, the minor fall and the major lift (Hallelujah, Leonard Cohen)

Next problem is extracted from MacKay’s Information Theory, Inference and Learning Algorithms:

Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type ‘O’ blood. The blood groups of the two traces are found to be of type ‘O’ (a common type in the local population, having frequency 60%) and of type ‘AB’ (a rare type, with frequency 1%). Do these data give evidence in favor of the proposition that Oliver was one of the people who left blood at the scene?

To answer the question, let’s first remember the probability form of Bayes theorem:

p(H|D)=\dfrac{p(H)p(D|H)}{p(D)}

where:

  • p(H) is the probability of the hypothesis H before we see the data, called the prior
  • p(H|D) is the probablity of the hyothesis after we see the data, called the posterior
  • p(D|H) is the probability of the data under the hypothesis, called the likelihood
  • p(D)is the probability of the data under any hypothesis, called the normalizing constant

If we have two hypothesis, A and B, we can write the ratio of posterior probabilities like this:

\dfrac{p(A|D)}{p(B|D)}=\dfrac{p(A)p(D|A)}{p(B)p(D|B)}

If p(A)=1-p(B) (what means that A and B are mutually exclusive and collective exhaustive), then we can rewrite the ratio of the priors and the ratio of the posteriors as odds. Writing o(A) for odds in favor of A, we get the odds form of Bayes theorem:

o(A|D)=o(A)\dfrac{p(D|A)}{p(D|B)}

Dividing through by o(A) we have:

\dfrac{o(A|D)}{o(A)}=\dfrac{p(D|A)}{p(D|B)}

The term on the left is the ratio of the posteriors and prior odds. The term on the right is the likelihood ratio, also called the Bayes factor. If it is greater than 1, that means that the data were more likely under A than under B. And since the odds ratio is also greater than 1, that means that the odds are greater, in light of the data, than they were before. If the Bayes factor is less than 1, that means the data were less likely under A than under B, so th odds in favor of A go down.

Let’s go back to our initial problem. If Oliver left his blood at the crime scene, the probability of the data is just the probability that a random member of the population has type ‘AB’ blood, which is 1%. If Oliver did not leave blood at the scene, what is the the chance of finding two people, one with type ‘O’ and one with type ‘AB’? There are two ways it might happen: the first person we choose might have type ‘O’ and the second ‘AB’, or the other way around. So the probability in this case is 2(0.6)(0.01)=1.2%. Dividing probabilities of both scenarios we obtain a Bayes factor of 0.83, and we conclude that the blood data is evidence against Oliver’s guilt.

Once I read this example, I decided to replicate it using real data of blood type distribution by country from here. After cleaning data, I have this nice data set to work with:

For each country, I get the most common blood type (the one which the suspect has) and the least common and replicate the previous calculations. For example, in the case of Spain, the most common type is ‘O+’ with 36% and the least one is ‘AB-‘ with 0.5%. The Bayes factor is 0.005/(2(0.36)(0.005))=1.39 so data support the hypothesis of guilt in this case. Next chart shows Bayes factor accross countries:

Just some comments:

  • Sometimes data consistent with a hypothesis are not necessarily in favor of the hypothesis
  • How different is the distribution of blood types between countries!
  • If you are a estonian ‘A+’ murderer, choose carefully your accomplice

This is the code of the experiment:

library(rvest)
library(dplyr)
library(stringr)
library(DT)
library(highcharter)

# Webscapring of the table with the distribution of blood types
url <- "http://www.rhesusnegative.net/themission/bloodtypefrequencies/"
blood <- url %>%
   read_html() %>%
   html_node(xpath='/html/body/center/table') %>%
   html_table(fill=TRUE)

# Some data cleansing
blood %>% slice(-c(66:68)) -> blood

blood[,-c(1:2)] %>% 
  sapply(gsub, pattern=",", replacement=".") %>% 
  as.data.frame %>% 
  sapply(gsub, pattern=".79.2", replacement=".79") %>% 
  as.data.frame-> blood[,-c(1:2)]

blood %>% 
  sapply(gsub, pattern="%|,", replacement="") %>% 
  as.data.frame -> blood

blood[,-1] = apply(blood[,-1], 2, function(x) as.numeric(as.character(x)))


blood[,-c(1:2)] %>% mutate_all(funs( . / 100)) -> blood[,-c(1:2)]

# And finally, we have a nice data set
datatable(blood, 
          rownames = FALSE,
          options = list(
          searching = FALSE,
          pageLength = 10)) %>% 
  formatPercentage(3:10, 2)

# Calculate the Bayes factor
blood %>% 
  mutate(factor=apply(blood[,-c(1,2)], 1, function(x) {min(x)/(2*min(x)*max(x))})) %>% 
  arrange(factor)-> blood

# Data Visualization
highchart() %>% 
     hc_chart(type = "column") %>% 
     hc_title(text = "Bayesian Blood") %>%
     hc_subtitle(text = "An experiment about the Bayes Factor") %>%  
     hc_xAxis(categories = blood$Country, 
             labels = list(rotation=-90, style = list(fontSize = "12px")))  %>% 
     hc_yAxis(plotBands = list(list(from = 0, to = 1, color = "rgba(255,215,0, 0.8)"))) %>% 
     hc_add_series(data = blood$factor,
                   color = "rgba(255, 0, 0, 0.5)",
                   name = "Bayes Factor")%>% 
  hc_yAxis(min=0.5) %>% 
  hc_tooltip(pointFormat = "{point.y:.2f}") %>% 
  hc_legend(enabled = FALSE) %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_chart(zoomType = "xy")

Visualizing the Daily Variability of Bitcoin with Quandl and Highcharts

Lay your dreams, little darling, in a flower bed; let that sunshine in your hair (Where the skies are blue, The Lumineers)

I discovered this nice visualization some days ago. The author is also the creator of Highcharter, an incredible R wrapper for Highcharts javascript libray and its modules. I am a big fan of him.

Inspired by his radial plot, I did a visualization of the daily evolution of Daily Bitcoin exchange rate (BTC vs. EUR) on Localbtc. Data is sourced from here and I used Quandl to obtain the data frame. Quandl is a marketplace for financial and economic data delivered in modern formats for today’s analysts. There is a package called Quandl to interact directly with the Quandl API to download data in a number of formats usable in R. You only need to locate the data you want in the Quandl site. In my case data are here.

After loading data, I do the folowing steps:

  • Filtering data to obtain last 12 complete months
  • Create a new variable with the difference between closing and opening price of Bitcoin (in Euros)
  • Create a color variable to distinguish between positive and negative differences
  • Create the graph using Fivethirtyeight theme for highcharts

This is the result:

Apart of its appealing, I think is a good way to to have a quick overview of the evolution of a stock price. This is the code to do the experiment:

library(Quandl)
library(dplyr)
library(highcharter)
library(lubridate)
bitcoin=Quandl("BCHARTS/LOCALBTCEUR")
bitcoin %>% 
  arrange(Date) %>% 
  mutate(tmstmp = datetime_to_timestamp(Date)) -> bitcoin
last_date=max(bitcoin$Date)
if (day(last_date+1)==1) date_to=last_date else 
  date_to=ymd(paste(year(last_date), month(last_date),1, sep="-"))-1
date_from=ymd(paste(year(date_to)-1, month(date_to)+1,1, sep="-"))
bitcoin %>% filter(Date>=date_from, Date<=date_to) -> bitcoin
var_bitcoin <- bitcoin %>% 
  mutate(Variation = Close - Open,
         color = ifelse(Variation>=0, "green", "red"),
         y = Variation) %>% 
  select(x = tmstmp,
         y,
         variation = Variation,
         name = Date,
         color,
         open = Open,
         close = Close) %>% 
  list.parse3()
x <- c("Open", "Close", "Variation")
y <- sprintf("{point.%s}", tolower(x))
tltip <- tooltip_table(x, y)
hc <- highchart() %>% 
  hc_title(text = "Bitcoin Exchange Rate (BTC vs. EUR)") %>% 
  hc_subtitle(text = "Daily Variation on Localbtc. Last 12 months")%>% 
  hc_chart(
    type = "column",
    polar = TRUE) %>%
  hc_plotOptions(
    series = list(
      stacking = "normal",
      showInLegend = FALSE)) %>% 
  hc_xAxis(
    gridLineWidth = 0.5,
    type = "datetime",
    tickInterval = 30 * 24 * 3600 * 1000,
    labels = list(format = "{value: %b}")) %>% 
  hc_yAxis(showFirstLabel = FALSE) %>% 
  hc_add_series(data = var_bitcoin) %>% 
  hc_add_theme(hc_theme_538()) %>% 
  hc_tooltip(useHTML = TRUE,
    headerFormat = as.character(tags$small("{point.x:%d %B, %Y}")),
    pointFormat = tltip)
hc

Chaotic Galaxies

Tell me, which side of the earth does this nose come from? Ha! (ALF)

Reading about strange attractors I came across with this book, where I discovered a way to generate two dimensional chaotic maps. The generic equation is pretty simple:

x_{n+1}= a_{1}+a_{2}x_{n}+a_{3}x_{n}^{2}+a_{4}x_{n}y_{n}+a_{5}y_{n}+a_{6}y_{n}^{2}
y_{n+1}= a_{7}+a_{8}x_{n}+a_{9}x_{n}^{2}+a_{10}x_{n}y_{n}+a_{11}y_{n}+a_{12}y_{n}^{2}

I used it to generate these chaotic galaxies:

Changing the vector of parameters you can obtain other galaxies. Do you want to try?

library(ggplot2)
library(dplyr)
#Generic function
attractor = function(x, y, z)
{
  c(z[1]+z[2]*x+z[3]*x^2+ z[4]*x*y+ z[5]*y+ z[6]*y^2, 
    z[7]+z[8]*x+z[9]*x^2+z[10]*x*y+z[11]*y+z[12]*y^2)
}
#Function to iterate the generic function over the initial point c(0,0)
galaxy= function(iter, z)
{
  df=data.frame(x=0,y=0)
  for (i in 2:iter) df[i,]=attractor(df[i-1, 1], df[i-1, 2], z)
  df %>% rbind(data.frame(x=runif(iter/10, min(df$x), max(df$x)), 
                          y=runif(iter/10, min(df$y), max(df$y))))-> df
  return(df)
}
opt=theme(legend.position="none",
          panel.background = element_rect(fill="#00000c"),
          plot.background = element_rect(fill="#00000c"),
          panel.grid=element_blank(),
          axis.ticks=element_blank(),
          axis.title=element_blank(),
          axis.text=element_blank(),
          plot.margin=unit(c(-0.1,-0.1,-0.1,-0.1), "cm"))
#First galaxy
z1=c(1.0, -0.1, -0.2,  1.0,  0.3,  0.6,  0.0,  0.2, -0.6, -0.4, -0.6,  0.6)
galaxy1=galaxy(iter=2400, z=z1) %>% ggplot(aes(x,y))+
  geom_point(shape= 8, size=jitter(12, factor=4), color="#ffff99", alpha=jitter(.05, factor=2))+
  geom_point(shape=16, size= jitter(4, factor=2), color="#ffff99", alpha=jitter(.05, factor=2))+
  geom_point(shape=46, size= 0, color="#ffff00")+opt
#Second galaxy
z2=c(-1.1, -1.0,  0.4, -1.2, -0.7,  0.0, -0.7,  0.9,  0.3,  1.1, -0.2,  0.4)
galaxy2=galaxy(iter=2400, z=z2) %>% ggplot(aes(x,y))+
  geom_point(shape= 8, size=jitter(12, factor=4), color="#ffff99", alpha=jitter(.05, factor=2))+
  geom_point(shape=16, size= jitter(4, factor=2), color="#ffff99", alpha=jitter(.05, factor=2))+
  geom_point(shape=46, size= 0, color="#ffff00")+opt
#Third galaxy
z3=c(-0.3,  0.7,  0.7,  0.6,  0.0, -1.1,  0.2, -0.6, -0.1, -0.1,  0.4, -0.7)
galaxy3=galaxy(iter=2400, z=z3) %>% ggplot(aes(x,y))+
  geom_point(shape= 8, size=jitter(12, factor=4), color="#ffff99", alpha=jitter(.05, factor=2))+
  geom_point(shape=16, size= jitter(4, factor=2), color="#ffff99", alpha=jitter(.05, factor=2))+
  geom_point(shape=46, size= 0, color="#ffff00")+opt
#Fourth galaxy
z4=c(-1.2, -0.6, -0.5,  0.1, -0.7,  0.2, -0.9,  0.9,  0.1, -0.3, -0.9,  0.3)
galaxy4=galaxy(iter=2400, z=z4) %>% ggplot(aes(x,y))+
  geom_point(shape= 8, size=jitter(12, factor=4), color="#ffff99", alpha=jitter(.05, factor=2))+
  geom_point(shape=16, size= jitter(4, factor=2), color="#ffff99", alpha=jitter(.05, factor=2))+
  geom_point(shape=46, size= 0, color="#ffff00")+opt

Gummy Worms

Just keep swimming (Dory in Finding Nemo)

Inspired by this post, I decided to create gummy worms like this:

worm
Or these:

When I was young I used to eat them.

Do you want to try? This is the code:

library(rgl)
library(RColorBrewer)
t=seq(1, 6, by=.04)
f = function(a, b, c, d, e, f, t) exp(-a*t)*sin(t*b+c)+exp(-d*t)*sin(t*e+f)
v1=runif(6,0,1e-02)
v2=runif(6, 2, 3)
v3=runif(6,-pi/2,pi/2)
open3d()
spheres3d(x=f(v1[1], v2[1], v3[1], v1[4], v2[4], v3[4], t),
          y=f(v1[2], v2[2], v3[2], v1[5], v2[5], v3[5], t),
          z=f(v1[3], v2[3], v3[3], v1[6], v2[6], v3[6], t),
          radius=.3, color=sample(brewer.pal(8, "Dark2"),1))

Visualizing the Gender of US Senators With R and Highmaps

I wake up every morning in a house that was built by slaves (Michelle Obama)

Some days ago I was invited by the people of Highcharts to write a post in their blog. What I have done is a simple but revealing map of women senators of the United States of America. Briefly, this is what I’ve done to generate it:

  • read from the US senate website a XML file with senators info
  • clean and obtain gender of senators from their first names
  • summarize results by state
  • join data with a US geojson dataset to create the highmap

You can find details and R code here.

It is easy creating a highcharts using highcharter, an amazing library as genderizeR, the one I use to obtain gender names. I like them a lot.

Visualizing Stirling’s Approximation With Highcharts

I said, “Wait a minute, Chester, you know I’m a peaceful man”, He said, “That’s okay, boy, won’t you feed him when you can” (The Weight, The Band)

It is quite easy to calculate the probability of obtaining the same number of heads and tails when tossing a coin N times, and N is even. There are 2^{N} possible outcomes and only C_{N/2}^{N} are favorable so the exact probability is the quotient of these numbers (# of favorable divided by # of possible).

There is another way to approximate this number incredibly well: to use the Stirling’s formula, which is 1/\sqrt{\pi\cdot N/2}

The next plot represents both calculations for N from 2 to 200. Although for small values of N, Stirling’s approximation tends to overestimate probability, you can see hoy is extremely precise as N becomes bigger:

James Stirling published this amazing formula in 1730. It simplifies the calculus to the extreme and also gives a quick way to obtain the answer to a very interesting question: how many tosses are needed to be sure that the probability of obtaining the same number of heads and tails is under any given threshold? Just solve the formula for N and you will obtain the answer. And, also, the formula is another example of the presence of pi in the most unexpected places, as happens here.

Just another thing: the more I use highcharter package the more I like it.

This is the code:

library(highcharter)
library(dplyr)
data.frame(N=seq(from=2, by=2, length.out = 100)) %>%
  mutate(Exact=choose(N,N/2)/2**N, Stirling=1/sqrt(pi*N/2))->data
hc <- highchart() %>% 
  hc_title(text = "Stirling's Approximation") %>% 
  hc_subtitle(text = "How likely is getting 50% heads and 50% tails tossing a coin N times?") %>% 
  hc_xAxis(title = list(text = "N: Number of tosses"), categories = data$N) %>% 
  hc_yAxis(title = list(text = "Probability"), labels = list(format = "{value}%", useHTML = TRUE)) %>% 
  hc_add_series(name = "Stirling", data = data$Stirling*100,  marker = list(enabled = FALSE), color="blue") %>% 
  hc_add_series(name = "Exact", data = data$Exact*100,  marker = list(enabled = FALSE), color="lightblue") %>% 
  hc_tooltip(formatter = JS("function(){return ('<b>Number of tosses: </b>'+this.x+'
<b>Probability: </b>'+Highcharts.numberFormat(this.y, 2)+'%')}")) %>%
  hc_exporting(enabled = TRUE) %>%
  hc_chart(zoomType = "xy")
hc

Amazing Things That Happen When You Toss a Coin 12 Times

If there is a God, he’s a great mathematician (Paul Dirac)

Imagine you toss a coin 12 times and you count how many heads and tails you are obtaining after each throwing (the coin is equilibrated so the probability of head or tail is the same). At some point, it can happen that number of heads and number of tails are the same. For example, if you obtain the sequence T-H-T-T-H-T-H-H-T-T-H-H, after the second throwing, number of heads is equal to number of tails (and both equal to one). It happens again after the 8th throwing and after last one. In this example, the last throwing where equallity occurs is the number 12. Obviously, equallity can only be observed in even throwings.

If you repeat the experiment 10.000 times you will find something like this if you draw the relative frequency of the last throwing where cumulated number of heads is equal to the one of tails:

Coin12Times
From my point of view there are three amazing things in this plot:

  1. It is symmetrical, so prob(n)=prob(12-n)
  2. The least likely throwing to obtain the last equality is the central one.
  3. As a corollary, the most likely is not obtaining any equality (number of heads never are the same than number of tails) or obtaining last equality in the last throwing: two extremely different scenarios with the same chances to be observed.

Behind the simplicity of tossing coins there is a beautiful universe of mathematical surprises.

library(dplyr)
library(ggplot2)
library(scales)
tosses=12
iter=10000
results=data.frame(nmax=numeric(0), count=numeric(0), iter=numeric(0))
tmp=data.frame(nmax=numeric(0))
for (j in 1:iter)
{
data.frame(x=sample(c(-1,1), size=tosses, replace=TRUE)) %>%
add_rownames(var = "n") %>%
mutate(cumsum = cumsum(x)) %>% filter(cumsum==0) %>%
summarize(nmax=max(as.numeric(n))) %>% rbind(tmp)->tmp
}
tmp %>%
group_by(nmax) %>%
summarize(count=n()) %>%
mutate(nmax=ifelse(is.finite(nmax), nmax, 0), iter=iter) %>%
rbind(results)->results
opts=theme(
panel.background = element_rect(fill="darkolivegreen1"),
panel.border = element_rect(colour="black", fill=NA),
axis.line = element_line(size = 0.5, colour = "black"),
axis.ticks = element_line(colour="black"),
panel.grid.major = element_line(colour="white", linetype = 1),
panel.grid.minor = element_blank(),
axis.text.y = element_text(colour="black"),
axis.text.x = element_text(colour="black"),
text = element_text(size=20),
legend.key = element_blank(),
plot.title = element_text(size = 30)
)
ggplot(results, aes(x=nmax, y=count/iter)) +
geom_line(size=2, color="green4")+
geom_point(size=8, fill="green4", colour="darkolivegreen1",pch=21)+
scale_x_continuous(breaks = seq(0, tosses, by=2))+
scale_y_continuous(labels=percent, limits=c(0, .25))+
labs(title="What happens when you toss a coin 12 times?",
x="Last throwing where cumulated #tails = #heads",
y="Probability (estimated)")+opts

Playing With Julia (Set)

Viento, me pongo en movimiento y hago crecer las olas del mar que tienes dentro (Tercer Movimiento: Lo de Dentro, Extremoduro)

I really enjoy drawing complex numbers: it is a huge source of entertainment for me. In this experiment I play with the Julia Set, another beautiful fractal like this one. This is what I have done:

  • Choosing the function f(z)=exp(z3)-0.621
  • Generating a grid of complex numbers with both real and imaginary parts in [-2, 2]
  • Iterating f(z) over the grid a number of times so zn+1 = f(zn)
  • Drawing the resulting grid as I did here
  • Gathering all plots into a GIF with ImageMagick as I did in my previous post: each frame corresponds to a different number of iterations

This is the result:

julia

I love how easy is doing difficult things in R. You can play with the code changing f(z) as well as color palettes. Be ready to get surprised:

library(ggplot2)
library(dplyr)
library(RColorBrewer)
setwd("YOUR WORKING DIRECTORY HERE")
dir.create("output")
setwd("output")
f = function(z,c) exp(z^3)+c
# Grid of complex
z0 <- outer(seq(-2, 2, length.out = 1200),1i*seq(-2, 2, length.out = 1200),'+') %>% c()
opt <-  theme(legend.position="none",
              panel.background = element_rect(fill="white"),
              plot.margin=grid::unit(c(1,1,0,0), "mm"),
              panel.grid=element_blank(),
              axis.ticks=element_blank(),
              axis.title=element_blank(),
              axis.text=element_blank())
for (i in 1:35)
{
  z=z0
  # i iterations of f(z)
  for (k in 1:i) z <- f(z, c=-0.621) df=data.frame(x=Re(z0), y=Im(z0), z=as.vector(exp(-Mod(z)))) %>% na.omit() 
  p=ggplot(df, aes(x=x, y=y, color=z)) + 
    geom_tile() + 
    scale_x_continuous(expand=c(0,0))+
    scale_y_continuous(expand=c(0,0))+
    scale_colour_gradientn(colours=brewer.pal(8, "Paired")) + opt
  ggsave(plot=p, file=paste0("plot", stringr::str_pad(i, 4, pad = "0"),".png"), width = 1.2, height = 1.2)
}
# Place the exact path where ImageMagick is installed
system('"C:\\Program Files\\ImageMagick-6.9.3-Q16\\convert.exe" -delay 20 -loop 0 *.png julia.gif')
# cleaning up
file.remove(list.files(pattern=".png"))