Tag Archives: Rstats

Going Bananas With Hilbert

It seemed that everything is in ruins, and that all the basic mathematical concepts have lost their meaning (Naum Vilenkin, Russian mathematician, regarding to the discovery of Peano’s curve)

Giuseppe Peano found in 1890 a way to draw a curve in the plane that filled the entire space: just a simple line covering completely a two dimensional plane. Its discovery meant a big earthquake in the traditional structure of mathematics. Peano’s curve was the first but not the last: one of these space-filling curves was discovered by Hilbert and takes his name. It is really beautiful:
hilbert_n5

Hilbert’s curve can be created iteratively. These are the first six iterations of its construction:
hilbert

As you will see below, R code to create Hilbert’s curve is extremely easy. It is also very easy to play with the curve, altering the order in which points are sorted. Changing the initial matrix(1) by some other number, resulting curves are quite appealing:

Let’s go futher. Changing ggplot geometry from geom_path to geom_polygon generate some crazy pseudo-tessellations:

And what if you change the matrix exponent?

And what if you apply polar coordinates?

We started with a simple line and with some small changes we have created fantastical images. And all these things only using black and white. Do you want to add some colors? Try with the following code (if you draw something interesting, please let me know):

library(reshape2)
library(dplyr)
library(ggplot2)
opt=theme(legend.position="none",
          panel.background = element_rect(fill="white"),
          panel.grid=element_blank(),
          axis.ticks=element_blank(),
          axis.title=element_blank(),
          axis.text=element_blank())
hilbert = function(m,n,r) {
  for (i in 1:n)
  {
    tmp=cbind(t(m), m+nrow(m)^2)
    m=rbind(tmp, (2*nrow(m))^r-tmp[nrow(m):1,]+1)
  }
  melt(m) %>% plyr::rename(c("Var1" = "x", "Var2" = "y", "value"="order")) %>% arrange(order)}
# Original
ggplot(hilbert(m=matrix(1), n=1, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=2, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=3, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=4, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=6, r=2), aes(x, y)) + geom_path()+ opt
# Changing order
ggplot(hilbert(m=matrix(.5), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(0), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(tan(1)), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(3), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(-1), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(log(.1)), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(-15), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(-0.001), n=5, r=2), aes(x, y)) + geom_path()+ opt
# Polygons
ggplot(hilbert(m=matrix(log(1)), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(.5), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(tan(1)), n=5, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-15), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-25), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(0), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(1000000), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-1), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-.00001), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
# Changing exponent
gplot(hilbert(m=matrix(log(1)), n=4, r=-1), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(.5), n=4, r=-2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(tan(1)), n=4, r=6), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-15), n=3, r=sin(2)), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-25), n=4, r=-.0001), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(0), n=4, r=200), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(1000000), n=3, r=.5), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-1), n=4, r=sqrt(2)), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-.00001), n=4, r=52), aes(x, y)) + geom_polygon()+ opt
# Polar coordinates
ggplot(hilbert(m=matrix(1), n=4, r=2), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(-1), n=5, r=2), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(.1), n=2, r=.5), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(1000000), n=2, r=.1), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(.25), n=3, r=3), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(tan(1)), n=5, r=1), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(1), n=4, r=1), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(log(1)), n=3, r=sin(2)), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(-.0001), n=4, r=25), aes(x, y)) + geom_polygon()+ coord_polar()+opt

A Checkpoint Of Spanish Football League

I am an absolute beginner, but I am absolutely sane (Absolute Beginners, David Bowie)

Some time ago I wrote this post, where I predicted correctly the winner of the Spanish Football League several months before its ending. After thinking intensely about taking the risk of ruining my reputation repeating the analysis, I said “no problem, Antonio, do it again: in the end you don’t have any reputation to keep”. So here we are.

From a technical point of view there are many differences between both analysis. Now I use webscraping to download data, dplyr and pipes to do transformations and interactive D3.js graphs to show results. I think my code is better now and it makes me happy.

As I did the other time, Bradley-Terry Model gives an indicator of  the power of each team, called ability, which provides a natural mechanism for ranking teams. This is the evolution of abilities of each team during the championship (last season was played during the past weekend):

liga1_ability2

Although it is a bit messy, the graph shows two main groups of teams: on the one hand, Barcelona, Atletico de Madrid, Real Madrid and Villareal; on the other hand, the rest. Let’s have a closer look to evolution of the abilities of the top 4 teams:

liga2_ability2

While Barcelona, Atletico de Madrid and Real Madrid walk in parallel,  Villareal seems to be a bit stacked in the last seasons; the gap between them and Real Madrid is increasing little by little. Maybe is the Zidane’s effect. It is quite interesting discovering what teams are increasing their abilities: they are Malaga, Eibar and Getafe. They will probably finish the championship in a better position than they have nowadays (Eibar could reach fifth position):

liga3_ability2

What about Villareal? Will they go up some position? I don’t think so. This plot shows the probability of winning any of the top 3:

liga4_villareal2

As you can see, probability is decreasing significantly. And what about Barcelona? Will win? It is a very difficult question. They are almost tied with Atletico de Madrid, and only 5 and 8 points above Real Madrid and Villareal. But it seems Barcelona keep them at bay. This plot shows the evolution of the probability of be beaten by Atletico, Real Madrid and Villareal:

liga5_Barcelona2

All probabilities are under 50% and decreasing (I supposed a scoring of 2-0 for Barcelona in the match against Sporting of season 16 that was postponed to next February 17th).

Data science is a profession for brave people so it is time to do some predictions. These are mine, ordered by likelihood:

  • Barcelona will win, followed by Atletico (2), Real Madrid (3), Villareal (4) and Eibar (5)
  • Malaga and Getafe will go up some positions
  • Next year I will do the analysis again

Here you have the code:

library(rvest)
library(stringr)
library(BradleyTerry2)
library(dplyr)
library(reshape)
library(rCharts)
nseasons=20
results=data.frame()
for (i in 1:nseasons)
{
  webpage=paste0("http://www.marca.com/estadisticas/futbol/primera/2015_16/jornada_",i,"/")
  html(webpage) %>%
    html_nodes("table") %>%
    .[[1]] %>%
    html_table(header=FALSE, fill=TRUE) %>%
    mutate(X4=i) %>%
    rbind(results)->results
}
colnames(results)=c("home", "score", "visiting", "season")
results %>% 
  mutate(home     = iconv(home,     from="UTF8",to="ASCII//TRANSLIT"),
         visiting = iconv(visiting, from="UTF8",to="ASCII//TRANSLIT")) %>%
  #filter(grepl("-", score)) %>%
  mutate(score=replace(score, score=="18:30 - 17/02/2016", "0-2")) %>% # resultado fake para el Barcelona
  mutate(score_home     = as.numeric(str_split_fixed(score, "-", 2)[,1])) %>%
  mutate(score_visiting = as.numeric(str_split_fixed(score, "-", 2)[,2])) %>%
  mutate(points_home     =ifelse(score_home > score_visiting, 3, ifelse(score_home < score_visiting, 0, 1))) %>%
  mutate(points_visiting =ifelse(score_home > score_visiting, 0, ifelse(score_home < score_visiting, 3, 1))) -> data
prob_BT=function(x, y) {exp(x-y) / (1 + exp(x-y))}
BTabilities=data.frame()
for (i in 13:nseasons)
{
  data %>% filter(season<=i) %>%
    BTm(cbind(points_home, points_visiting), home, visiting, data=.) -> footballBTModel
  BTabilities(footballBTModel) %>%
  as.data.frame()  -> tmp 
  cbind(tmp, as.character(rownames(tmp)), i) %>% 
  mutate(ability=round(ability, digits = 2)) %>%
  rbind(BTabilities) -> BTabilities
}
colnames(BTabilities)=c("ability", "s.e.", "team", "season")
sort(unique(BTabilities[,"team"])) -> teams
BTprobabilities=data.frame()
for (i in 13:nseasons)
{
  BTabilities[BTabilities$season==i,1] %>% outer( ., ., prob_BT) -> tmp
  colnames(tmp)=teams
  rownames(tmp)=teams  
  cbind(melt(tmp),i) %>% rbind(BTprobabilities) -> BTprobabilities
}
colnames(BTprobabilities)=c("team1", "team2", "probability", "season")
BTprobabilities %>% 
  filter(team1=="Villarreal") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team2 %in% c("R. Madrid", "Barcelona", "Atletico")) -> BTVillareal
BTprobabilities %>% 
  filter(team2=="Barcelona") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team1 %in% c("R. Madrid", "Villarreal", "Atletico")) -> BTBarcelona
AbilityPlot <- nPlot(
  ability ~ season, 
  data = BTabilities, 
  group = "team",
  type = "lineChart")
AbilityPlot$yAxis(axisLabel = "Estimated Ability", width = 62)
AbilityPlot$xAxis(axisLabel = "Season")
VillarealPlot <- nPlot(
  probability ~ season, 
  data = BTVillareal, 
  group = "team2",
  type = "lineChart")
VillarealPlot$yAxis(axisLabel = "Probability of beating", width = 62)
VillarealPlot$xAxis(axisLabel = "Season")
BarcelonaPlot <- nPlot(
  probability ~ season, 
  data = BTBarcelona, 
  group = "team1",
  type = "lineChart")
BarcelonaPlot$yAxis(axisLabel = "Probability of being beaten", width = 62)
BarcelonaPlot$xAxis(axisLabel = "Season")

Climatic Change At A Glance

Mmm. Lost a planet, Master Obi-Wan has. How embarrassing (Yoda, Attack Of The Clones)

Some time ago I published this post in KDnuggets in which I analyze historical temperatures to show how we are gradually heading toward a warmer planet. Simple data science to obtain a simple (and increasingly accepted) conclusion: the global warming is real. Despite I was criticized I still believe what I said then: you don’t have to be a climatologist to empirically confirm global warming.

This experiment is another example of that. It is still simpler than that since it is only based on visual perception but I think is also quite conclusive. In this case, I represent U.S. temperature outliers from 1964 to 2013; a map per year. Dataset contains station ID, name, min/max temperature, as well as degree coordinates of the recorded weather. Original weather data collected from NOAA and anomalies analysis by Enigma. You can download data here.

Anomalies are divided into four categories: Strong Hot, Weak Hot, Weak Cold and Strong Cold. For each station, I represent difference between number of Cold and Hot anomalies (independently of the strength) so Blue bubbles represent stations where total number of Cold anomalies during the year is greater that total number of Hot ones and Red ones represent the opposite. Size of bubbles is also proportional to this indicator. As an example, following you can see the map of year 1975:

tonopah
It seems 1975 was hot in the right a cold on the left side. Concretely, in TONOPAH Station (Nevada) were registered 30 anomalies and most of them (26) where due to cold temperatures. This is why bubble is blue. This GIF shows the evolution of all these maps from 1964 to 2013:

anomalies

Maybe it is just my personal feeling but don’t you see how red bubbles are gradually winning to blue ones? Maybe I am a demagogue.

This code generates a dynamic map by year in html format:

library(data.table)
library(stringr)
library(leaflet)
library(RColorBrewer)
library(classInt)
library(dplyr)
library(htmlwidgets)
anomalies = fread("enigma-enigma.weather.anomalies.outliers-1964-2013-05ce047dbf3e67f83db9ae841545a333.csv")
anomalies %>%
  mutate(year=substr(date_str, 1, 4)) %>%
  group_by(year, longitude, latitude, id, station_name) %>%
  summarise(
    Strong_Hot=sum(str_count(type,"Strong Hot")),
    Weak_Hot=sum(str_count(type,"Weak Hot")),
    Weak_Cold=sum(str_count(type,"Weak Cold")),
    Strong_Cold=sum(str_count(type,"Strong Cold")),
    total=n()) %>%
  mutate(score=sign(-Strong_Hot-Weak_Hot+Weak_Cold+Strong_Cold)) %>%
  mutate(color=ifelse(score==1, "Blue",ifelse(score==0, "White", "Red"))) -> anomalies2
for (i in unique(anomalies2$year))
{
  anomalies2 %>%
    filter(year==i) %>%
    leaflet() %>%
    fitBounds(-124, 34, -62, 40) %>%
    addProviderTiles("Stamen.TonerLite") %>%
    addCircleMarkers(lng = ~longitude,
                     lat = ~latitude,
                     radius = ~ifelse(total < 20, 2, ifelse(total < 27, 4, 8)),
                     color= ~color,
                     stroke=FALSE,
                     fillOpacity = 0.5,
                     popup = ~paste(sep = "
", paste0("<b>", station_name, "</b>"),
                                    paste0("Strong Hot: ", Strong_Hot),
                                    paste0("Weak Hot: ", Weak_Hot),
                                    paste0("Weak Cold: ", Weak_Cold),
                                    paste0("Strong Cold: ", Strong_Cold))) -> m
    saveWidget(m, file=paste0("m", i, ".html"))
}

Phyllotaxis By Shiny

Antonio, you don’t know what empathy is! (Cecilia, my beautiful wife)

Spirals are nice. In the wake of my previous post I have done a Shiny app to explore patterns generated by changing angle, shape and number of points of Fermat’s spiral equation. You can obtain an almost infinite number of images. This is just an example:
phyllotaxis12

I like thinking in imaginary flowers. This is why I called this experiment Phyllotaxis.
More examples:


Just one comment about code: I did the Shiny in just one R file as this guy suggested me some time ago because of this post.

This is the code. Do your own imaginary flowers:

library(shiny)
library(ggplot2)
CreatePlot = function (ang=pi*(3-sqrt(5)), nob=150, siz=15, alp=0.8, sha=16, col="black", bac="white") {
ggplot(data.frame(r=sqrt(1:nob), t=(1:nob)*ang*pi/180), aes(x=r*cos(t), y=r*sin(t)))+
geom_point(colour=col, alpha=alp, size=siz, shape=sha)+
scale_x_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+
scale_y_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+
theme(legend.position="none",
panel.background = element_rect(fill=bac),
panel.grid=element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
axis.text=element_blank())}
shinyApp(
ui = fluidPage(
titlePanel("Phyllotaxis by Shiny"),
fluidRow(
column(3,
wellPanel(
selectInput("col", label = "Colour of points:", choices = colors(), selected = "black"),
selectInput("bac", label = "Background colour:", choices = colors(), selected = "white"),
selectInput("sha", label = "Shape of points:",
choices = list("Empty squares" = 0, "Empty circles" = 1, "Empty triangles"=2,
"Crosses" = 3, "Blades"=4, "Empty diamonds"=5,
"Inverted empty triangles"=6, "Bladed squares"=7,
"Asterisks"=8, "Crosed diamonds"=9, "Crossed circles"=10,
"Stars"=11, "Cubes"=12, "Bladed circles"=13,
"Filled squares" = 15, "Filled circles" = 16, "Filled triangles"=17,
"Filled diamonds"=18), selected = 16),
sliderInput("ang", label = "Angle (degrees):", min = 0, max = 360, value = 180*(3-sqrt(5)), step = .05),
sliderInput("nob", label = "Number of points:", min = 1, max = 1500, value = 60, step = 1),
sliderInput("siz", label = "Size of points:", min = 1, max = 60, value = 10, step = 1),
sliderInput("alp", label = "Transparency:", min = 0, max = 1, value = .5, step = .01)
)
),
mainPanel(
plotOutput("Phyllotaxis")
)
)
),
server = function(input, output) {
output$Phyllotaxis=renderPlot({
CreatePlot(ang=input$ang, nob=input$nob, siz=input$siz, alp=input$alp, sha=as.numeric(input$sha), col=input$col, bac=input$bac)
}, height = 650, width = 650 )}
)

Polar Circles

You cannot find peace by avoiding life (Virginia Woolf)

Combining polar coordinates, RColorBrewer palettes, ggplot2 and a simple trigonometric function to define the width of the tiles is easy to produce nice circular plots like these:

polar_flower4polar_flower2polar_flower1polar_flower3

Do you want to try? Here you have the code:

library(ggplot2)
library(dplyr)
library(RColorBrewer)
n=500
m=50
w=sapply(seq(from=-3.5*pi, to=3.5*pi, length.out=n), function(x) {abs(sin(x))})
x=c(1)
for (i in 2:n) {x[i]=x[i-1]+1/2*(w[i-1]+w[i])}
expand.grid(x=x, y=1:m) %>%
  mutate(w=rep(w, m))-> df
opt=theme(legend.position="none",
          panel.background = element_rect(fill="white"),
          panel.grid=element_blank(),
          axis.ticks=element_blank(),
          axis.title=element_blank(),
          axis.text=element_blank())
ggplot(df, aes(x=x,y=y))+geom_tile(aes(fill=x, width=w))+ 
  scale_fill_gradient(low=brewer.pal(9, "Greens")[1], high=brewer.pal(9, "Greens")[9])+
  coord_polar(start = runif(1, min = 0, max = 2*pi))+opt
ggplot(df, aes(x=x,y=y))+geom_tile(aes(fill=w, width=w))+ 
  scale_fill_gradient(low=brewer.pal(9, "Reds")[1], high=brewer.pal(9, "Reds")[9])+ 
  coord_polar(start = runif(1, min = 0, max = 2*pi))+opt
ggplot(df, aes(x=x,y=y))+geom_tile(aes(fill=y, width=w))+ 
  scale_fill_gradient(low=brewer.pal(9, "Purples")[1], high=brewer.pal(9, "Purples")[9])+ 
  coord_polar(start = runif(1, min = 0, max = 2*pi))+opt
ggplot(df, aes(x=x,y=y))+geom_tile(aes(fill=w*y, width=w))+ 
  scale_fill_gradient(low=brewer.pal(9, "Blues")[9], high=brewer.pal(9, "Blues")[1])+ 
  coord_polar(start = runif(1, min = 0, max = 2*pi))+opt

The Batman’s Ecosystem

If I weren’t crazy, I’d be insane! (Joker)

I present today a dynamical D3.js plot where I combine three things:

  • The Batman curve
  • A text mining analysis to obtain most common words from the Batman’s page at Wikipedia
  • A line plot using morris.js library of rCharts package where point labels are the words obtained in the previous step

This is my particular homage to one of the most amazing superheros ever, together with Daredevil:

The code:

require(ggplot2)
require(dplyr)
require(rCharts)
library(rvest)
library(tm)
f1u <- function(x) {ifelse ((abs(x) >  3 & abs(x) <= 7), 3*sqrt(1-(x/7)^2), 0)}
f1d <- function(x) {ifelse ((abs(x) >= 4 & abs(x) <= 7), -3*sqrt(1-(x/7)^2), 0)}
f2u <- function(x) {ifelse ((abs(x) > 0.50 & abs(x) < 0.75),  3*abs(x)+0.75, 0)}
f2d <- function(x) {ifelse ((abs(x) > -4 & abs(x) < 4), abs(x/2)-(3*sqrt(33)-7)*x^2/112-3 + sqrt(1-(abs(abs(x)-2)-1)^2), 0)}
f3u <- function(x) {ifelse ((x > -0.5 & x < 0.5), 2.25, 0)}
f4u <- function(x) {ifelse ((abs(x) >  1 & abs(x) <= 3), 6 * sqrt(10)/7 + (1.5 - 0.5 * abs(x)) * sqrt(abs(abs(x)-1)/(abs(x)-1)) - 6 * sqrt(10) * sqrt(4-(abs(x)-1)^2)/14, 0)}
f5u <- function(x) {ifelse ((abs(x) >= 0.75 & abs(x) <= 1), 9-8*abs(x), 0)}
fu <- function (x) f1u(x)+f2u(x)+f3u(x)+f4u(x)+f5u(x)
fd <- function (x) f1d(x)+f2d(x)
batman <- function(r,x) {ifelse(r%%2==0, fu(x), fd(x))}
data.frame(x=seq(from=-7, to=7, by=0.125)) %>%
  mutate(y=batman(row_number(), x)) -> df
html("https://en.wikipedia.org/wiki/Batman") %>%
  html_nodes("#bodyContent")  %>%
  html_text() %>%
  VectorSource() %>%
  Corpus() %>%
  tm_map(tolower) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%  
  tm_map(stripWhitespace) %>%
  tm_map(removeWords, c(stopwords(kind = "en"), "batman", "batmans")) %>%
  DocumentTermMatrix() %>%
  as.matrix() %>%
  colSums() %>%
  sort(decreasing=TRUE) %>%
  head(n=nrow(df)) %>%
  attr("names") -> df$word
m1=mPlot(x = "x",  y = "y",  data = df,  type = "Line")
m1$set(pointSize = 5,
       lineColors = c('black', 'black'),
       width = 900,
       height = 500,
       hoverCallback = "#! function(index, options, content)
                  { var row = options.data[index]
                  return '<b>' + row.word + '</b>'} !#",
       lineWidth = 2,
       grid=FALSE,
       axes=FALSE)
m1
m1$save('Batman.html', standalone = TRUE)

Visualising The Evolution Of Migration Flows With rCharts

Heaven we hope is just up the road (Atlas, Coldplay)

Following with the analysis of migration flows, I have done next two visualizations. These charts are called bump charts and are very suitable to represent rankings. This is what I have done:

  • Obtaining top 20 countries of the world according to % of migrants respect its population
  • To do this, I divide total number of migrants between 1960 and 2009 by the mean population in the same period.
  • I do the same to obtain top 20 countries of the world according to % of immigrants.
  • In both cases, I only consider countries with population greater than 2 million.
  • For these countries, I calculate % of migrants in each decade (60’s, 70’s, 80’s, 90’s and 00’s), dividing total number of migrants by mean population each decade
  • I do the same in the case of immigrants.
  • Instead of representing directly % of migrants and immigrants, I represent the ranking of countries according these indicators by decade

This is the bump chart of migrants:

And this is the one of immigrants:

Some comments:

  • There is a permanent exodus in Puerto Rico: all decades (except 70’s) is located in the top 1 of countries with most migrants respect its population
  • Ireland is also living a diaspora although in the 00’s decade has lost some positions
  • Albania, Georgia and Bosnia and Herzegovina are gaining positions. Is East Europe gradually becoming uncomfortable?
  • Jamaica is also moving up in this sad competition.
  • On the other hand, Hong Kong and Israel are persistently leaders as receivers
  • Saudi Arabia has presented an impressive growth receiving immigrants since 70’s
  • United States does not appear in the immigrants ranking
  • Singapore is gaining positions: in the 00’s decade is the third receiver country
  • Also in the 00s, Switzerland is the first European country in the ranking, holding the fifth position

I like using rCharts as well as using Enigma data sets, as I have done previously. This is the code:

library(data.table)
library(rCharts)
library(dplyr)
setwd("YOUR WORKING DIRECTORY HERE")
populflows = read.csv(file="enigma-org.worldbank.migration-remittances.migrants.migration-flow-c57405e33412118c8757b1052e8a1490.csv", stringsAsFactors=FALSE)
population = fread("enigma-org.worldbank.hnp.data-eaa31d1a34fadb52da9d809cc3bec954.csv")
population %>% 
  filter(indicator_name=="Population, total") %>% 
  as.data.frame %>% 
  mutate(decade=(year-year%%10)) %>% 
  group_by(country_name, country_code, decade) %>% 
  summarise(population=mean(value)) %>% 
  plyr::rename(., c("country_name"="country")) -> population2
populflows %>% filter(!is.na(total_migrants)) %>% 
  group_by(migration_year, destination_country) %>% 
  summarise(inmigrants = sum(total_migrants))  %>% 
  plyr::rename(., c("destination_country"="country", "migration_year"="decade"))   -> inmigrants
populflows %>% filter(!is.na(total_migrants)) %>% 
  group_by(migration_year, country_of_origin) %>% 
  summarise(migrants = sum(total_migrants)) %>%  
  plyr::rename(., c("country_of_origin"="country", "migration_year"="decade"))   -> migrants
# Join of data sets
migrants %>% 
  merge(inmigrants, by = c("country", "decade")) %>%
  merge(population2, by = c("country", "decade")) %>%
  mutate(p_migrants=migrants/population, p_inmigrants=inmigrants/population) -> populflows2
# Global Indicators
populflows2 %>% 
  group_by(country) %>% 
  summarise(migrants=sum(migrants), inmigrants=sum(inmigrants), population=mean(population)) %>% 
  mutate(p_migrants=migrants/population, p_inmigrants=inmigrants/population)  %>% 
  filter(population > 2000000)  %>%
  mutate(rank_migrants = dense_rank(desc(p_migrants)), rank_inmigrants = dense_rank(desc(p_inmigrants))) -> global
# Migrants dataset
global %>% 
  filter(rank_migrants<=20) %>% 
  select(country) %>% 
  merge(populflows2, by = "country") %>% 
  arrange(decade, p_migrants) %>%
  mutate(decade2=as.numeric(as.POSIXct(paste0(as.character(decade), "-01-01"), origin="1900-01-01"))) %>%
  plyr::ddply("decade", transform, rank = dense_rank(p_migrants)) -> migrants_rank
# Migrants dataset
global %>% 
  filter(rank_inmigrants<=20) %>% 
  select(country) %>% 
  merge(populflows2, by = "country") %>% 
  arrange(decade, p_inmigrants) %>%
  mutate(decade2=as.numeric(as.POSIXct(paste0(as.character(decade), "-01-01"), origin="1900-01-01"))) %>%
  plyr::ddply("decade", transform, rank = dense_rank(p_inmigrants)) -> inmigrants_rank
# Function for plotting
plotBumpChart <- function(df){
  bump_chart = Rickshaw$new()
  mycolors = ggthemes::tableau_color_pal("tableau20")(20)
  bump_chart$layer(rank ~ decade2, group = 'country_code', data = df, type = 'line', interpolation = 'none', colors = mycolors)
  bump_chart$set(slider = TRUE, highlight = TRUE, legend=TRUE)
  bump_chart$yAxis(tickFormat = "#!  function(y) { if (y == 0) { return '' } else { return String((21-y)) } } !#")
  bump_chart$hoverDetail(yFormatter = "#! function(y){return (21-y)} !#")
  return(bump_chart)
}
plotBumpChart(migrants_rank)
plotBumpChart(inmigrants_rank)

A Visualization Of The 100 Greatest Love Songs ft. D3.js

What would you do? If my heart was torn in two (More Than Words, Extreme)

Playing with rCharts package I had the idea of representing the list of 100 best love songs as a connected set of points which forms a heart. Songs can be seen putting mouse cursor over each dot:

You can reproduce it with this simple code:

library(dplyr)
library(rCharts)
library(rvest)
setwd("YOUR WORKING DIRECTORY HERE")
heart <- function(r,x) {ifelse(abs(x)<2, ifelse(r%%2==0, sqrt(1-(abs(x)-1)^2), acos(1-abs(x))-pi), 0)} data.frame(x=seq(from=-3, to=3, length.out=100)) %>% 
  mutate(y=jitter(heart(row_number(), x), amount=.1)) -> df
love_songs <- html("http://www.cs.ubc.ca/~davet/music/list/Best13.html") love_songs %>%
  html_nodes("table") %>%
  .[[2]] %>%
  html_table(header=TRUE, fill = TRUE) %>%
  cbind(df) -> df
m1=mPlot(x = "x",  y = "y",  data = df,  type = "Line")
m1$set(pointSize = 5, 
       lineColors = c('red', 'red'),
       width = 850,
       height = 600,
       lineWidth = 2,
       hoverCallback = "#! function(index, options, content){
       var row = options.data[index]
       return '<b>' + row.ARTIST + '</b>' + '<br/>' + row.TITLE} !#",
       grid=FALSE,
       axes=FALSE)
m1$save('Top_100_Greatest_Love_Songs.html', standalone = TRUE)

A Segmentation Of The World According To Migration Flows ft. Leaflet

Up in the sky you just feel fine, there is no running out of time and you never cross a line (Up In The Sky, 77 Bombay Street)

In this post I analyze two datasets from Enigma:

  • Migration flows: Every 10 years, since 1960, the World Bank estimates migrations worldwide (267.960 rows)
  • World population: Values and percentages of populations for each nation examined beginning in year 1960, by the World Bank’s Health, Nutrition and Population project (4.168.185 rows)

Since the second dataset is very large, I load it into R using fread function of data.table package, which is extremely fast. To filter datasets, I also use dplyr and pipes of magrittr package (my life changed since I discovered it).

To build a comparable indicator across countries, I divide migration flows (from and to each country) by the mean population in each decade. I do this because migration flows are aggregated for each decade since 1960. For example, during the first decade of 21st century, Argentina reveived 1.537.850 inmigrants, which represents a 3,99% of the mean population of the country in this decade. In the same period, inmigration to Burundi only represented a 0,67% of its mean population.

What happened in the whole world in that decade? There were around 166 million people who moved to other countries. It represents a 2.58% of the mean population of the world. I use this figure to divide countries into four groups:

  • Isolated: countries with both % of inmigrants and % of migrants under 2.58%
  • Emitter: countries with % of inmigrants under 2.58% and % of migrants over 2.58%
  • Receiver: countries with % of inmigrants over 2.58% and % of migrants under 2.58%
  • Transit: countries with both % of inmigrants and % of migrants over 2.58%

To create the map I use leaflet package as I did in my previous post. Shapefile of the world can be downloaded here. This is how the world looks like according to this segmentation:

Migration Flows

Some conclusions:

  • There are just sixteen receiver countries: United Arab Emirates, Argentina, Australia, Bhutan, Botswana, Costa Rica, Djibouti, Spain, Gabon, The Gambia, Libya, Qatar, Rwanda, Saudi Arabia, United States and Venezuela
  • China and India (the two most populous countries in the world) are isolated
  • Transit countries are concentrated in the north hemisphere and most of them are located in cold latitudes
  • There are six emitter countries with more than 30% of emigrants between 2000 and 2009: Guyana, Tonga, Tuvalu, Jamaica, Bosnia and Herzegovina and Albania

This is the code you need to reproduce the map:

library(data.table)
library(dplyr) 
library(leaflet)
library(rgdal)
library(RColorBrewer)
setwd("YOU WORKING DIRECTORY HERE")
populflows = read.csv(file="enigma-org.worldbank.migration-remittances.migrants.migration-flow-c57405e33412118c8757b1052e8a1490.csv", stringsAsFactors=FALSE)
population = fread("enigma-org.worldbank.hnp.data-eaa31d1a34fadb52da9d809cc3bec954.csv")
# Population
population %>% 
  filter(indicator_name=="Population, total") %>% 
  as.data.frame %>% 
  mutate(decade=(year-year%%10)) %>% 
  group_by(country_name, country_code, decade) %>% 
  summarise(avg_pop=mean(value)) -> population2
# Inmigrants by country
populflows %>% filter(!is.na(total_migrants)) %>% 
  group_by(migration_year, destination_country) %>% 
  summarise(inmigrants = sum(total_migrants))  %>% 
  merge(population2, by.x = c("destination_country", "migration_year"), by.y = c("country_name", "decade"))  %>% 
  mutate(p_inmigrants=inmigrants/avg_pop) -> inmigrants
# Migrants by country
populflows %>% filter(!is.na(total_migrants)) %>% 
  group_by(migration_year, country_of_origin) %>% 
  summarise(migrants = sum(total_migrants)) %>%  
  merge(population2, by.x = c("country_of_origin", "migration_year"), by.y = c("country_name", "decade"))  %>%
  mutate(p_migrants=migrants/avg_pop) -> migrants
# Join of data sets
migrants %>% 
  merge(inmigrants, by.x = c("country_code", "migration_year"), by.y = c("country_code", "migration_year")) %>%
  filter(migration_year==2000) %>% 
  select(country_of_origin, country_code, avg_pop.x, migrants, p_migrants, inmigrants, p_inmigrants) %>% 
  plyr::rename(., c("country_of_origin"="Country", 
                    "country_code"="Country.code", 
                    "avg_pop.x"="Population.mean",
                    "migrants"="Total.migrants",
                    "p_migrants"="p.of.migrants",
                    "inmigrants"="Total.inmigrants",
                    "p_inmigrants"="p.of.inmigrants")) -> populflows2000
# Threshold to create groups
populflows2000 %>% 
  summarise(x=sum(Total.migrants), y=sum(Total.inmigrants), z=sum(Population.mean)) %>% 
  mutate(m=y/z) %>% 
  select(m)  %>% 
  as.numeric -> avg
# Segmentation
populflows2000$Group="Receiver"
populflows2000[populflows2000$p.of.migrants>avg & populflows2000$p.of.inmigrants>avg, "Group"]="Transit"
populflows2000[populflows2000$p.of.migrants<avg & populflows2000$p.of.inmigrants<avg, "Group"]="Isolated"
populflows2000[populflows2000$p.of.migrants>avg & populflows2000$p.of.inmigrants<avg, "Group"]="Emitter"
#Loading shapefile from http://data.okfn.org/data/datasets/geo-boundaries-world-110m 
countries=readOGR("json/countries.geojson", "OGRGeoJSON") 
# Join shapefile and enigma information 
joined=merge(countries, populflows2000, by.x="wb_a3", by.y="Country.code", all=FALSE, sort = FALSE) 
joined$Group=as.factor(joined$Group) 
# To define one color by segment 
factpal=colorFactor(brewer.pal(4, "Dark2"), joined$Group) 
leaflet(joined) %>%
  addPolygons(stroke = TRUE, color="white", weight=1, smoothFactor = 0.2, fillOpacity = .8, fillColor = ~factpal(Group)) %>%
  addTiles() %>%
  addLegend(pal = factpal, values=c("Emitter", "Isolated", "Receiver", "Transit"))

A Simple Interactive Map Of US Prisons With Leaflet

The love of one’s country is a splendid thing. But why should love stop at the border? (Pablo Casals, Spanish cellist)

Some time ago, I discovered Enigma, an amazing open platform that unifies billions of records from thousands of government sources to make the world of public data universally accessible and useful. This is the first experiment I have done using data from Enigma. This is what I did:

  1. Create a free account, search and download data. Save the csv file in your working directory. File contains information about all prison facilities in the United States (private and state run) as recorded by the Department of Corrections in each state. Facility types, names, addresses (or lat/long coordinates) ownership names and detailed. In sum, there is information about 1.248 prison facilities.
  2. Since most of the prisons of the file do not contain geographical coordinates, I obtain latitude and longitude using geocode function from ggmap package. This step takes some time. I also remove closed facilities. Finally, I obtain a data set with complete information of 953 prison facilities.
  3. After cleaning and filling out data, generating the map is very easy using leaflet package for R. I create a column named popup_info pasting name and address to be shown in the popup. Instead using default OpenStreetMap basemap I use a CartoDB one.

In my opinion, resulting map is very appealing with a minimal effort:

This plot could be a good example of visual correlation, because it depends on this. Here you have the code:

library(dplyr)
library(ggmap)
library(leaflet)
setwd("YOUR WORKING DIRECTORY HERE")
prisons = read.csv(file="enigma-enigma.prisons.all-facilities-bd6a927c4024c16d8ba9e21d52292b0f.csv", stringsAsFactors=FALSE)
prisons %>% 
  mutate(address=paste(facility_address1, city, state, zip, "EEUU", sep=", ")) %>%
  select(address) %>% 
  lapply(function(x){geocode(x, output="latlon")})  %>% 
  as.data.frame %>% 
  cbind(prisons) -> prisons
prisons %>%  
  mutate(popup_info=paste(sep = "
", paste0("<b>", facility_name, "</b>"), facility_address1, city, state, zip)) %>% 
  mutate(lon=ifelse(is.na(longitude), address.lon, longitude),
         lat=ifelse(is.na(latitude),  address.lat, latitude)) %>%
  filter(!is.na(lon) & !grepl("CLOSED", facility_name)) -> prisons
leaflet(prisons) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(lng = ~lon, 
                   lat = ~lat, 
                   radius = 3, 
                   color = "red",
                   stroke=FALSE,
                   fillOpacity = 0.5,
                   popup = ~popup_info)