Category Archives: The World We Live In

Visualizing the Spanish Contribution to The Metropolitan Museum of Art

Well I walk upon the river like it’s easier than land
(Love is All, The Tallest Man on Earth)


The Metropolitan Museum of Art provides here a dataset with information on more than 450.000 artworks in its collection. You can do anything you want with these data: there are no restrictions of use. Each record contains information about the author, title, type of work, dimensions, date, culture and  geography of a particular piece.

I can imagine a bunch of things to do with these data but since I am a big fan of highcharter,  I have done a treemap, which is an artistic (as well as efficient) way to visualize hierarchical data. A treemap is useful to visualize frequencies. They can handle levels, allowing to navigate to go into detail about any category. Here you can find a good example of treemap.

To read data I use fread function from data.table package. I also use this package to do some data wrangling operations on the data set. After them, I filter it looking for the word SPANISH in the columns Artist Nationality and Culture and looking for the word SPAIN in the column Country. For me, any piece created by an Spanish artist (like this one), coming from Spanish culture (like this one) or from Spain (like this one) is Spanish (this is my very own definition and may do not match with any academical one). Once it is done, it is easy to extract some interesting figures:

  • There are 5.294 Spanish pieces in The Met, which means a 1,16% of the collection
  • This percentage varies significantly between departments: it raises to 9,01% in The Cloisters and to 4,83% in The Robert Lehman Collection; on the other hand, it falls to 0.52% in The Libraries and to 0,24% in Photographs.
  • The Met is home to 1.895 highlights and 44 of them (2,32%) are Spanish; It means that Spanish art is twice as important as could be expected (remember that represents a 1,16% of the entire collection)

My treemap represents the distribution of Spanish artworks by department (column Department) and type of work (column Classification). There are two important things to know before doing a treemap with highcharter:

  • You have to use treemap function from treemap package to create a list with your data frame that will serve as input for hctreemap function
  • hctreemap fails if some category name is the same as any of its subcategories. To avoid this, make sure that all names are distinct.

This is the treemap:

Here you can see a full size version of it.

There can be seen several things at a glance: most of the pieces are drawings and prints and european sculpture and decorative arts (in concrete, prints and textiles), there is also big number of costumes, arms and armor is a very fragmented department … I think treemap is a good way to see what kind of works owns The Met.

Mi favorite spanish piece in The Met is the stunning Portrait of Juan de Pareja by Velázquez, which illustrates this post: how nice would be to see it next to El Primo in El Museo del Prado!

Feel free to use my code to do your own experiments:

library(data.table)
library(dplyr)
library(stringr)
library(highcharter)
library(treemap)

file="MetObjects.csv"
# Download data
if (!file.exists(file)) download.file(paste0("https://media.githubusercontent.com/media/metmuseum/openaccess/master/", file), 
                                      destfile=file,
                                      mode='wb')
# Read data
data=fread(file, sep=",", encoding="UTF-8")

# Modify column names to remove blanks
colnames(data)=gsub(" ", ".", colnames(data))

# Clean columns to prepare for searching
data[,`:=`(Artist.Nationality_aux=toupper(Artist.Nationality) %>% str_replace_all("\\[\\d+\\]", "") %>% 
             iconv(from='UTF-8', to='ASCII//TRANSLIT'),
           Culture_aux=toupper(Culture) %>% str_replace_all("\\[\\d+\\]", "") %>% 
             iconv(from='UTF-8', to='ASCII//TRANSLIT'),
           Country_aux=toupper(Country) %>% str_replace_all("\\[\\d+\\]", "") %>% 
             iconv(from='UTF-8', to='ASCII//TRANSLIT'))]

# Look for Spanish artworks
data[Artist.Nationality_aux %like% "SPANISH" | 
       Culture_aux %like% "SPANISH" | 
       Country_aux %like% "SPAIN"] -> data_spain

# Count artworks by Department and Classification
data_spain %>% 
  mutate(Classification=ifelse(Classification=='', "miscellaneous", Classification)) %>% 
  mutate(Department=tolower(Department),
         Classification1=str_match(Classification, "(\\w+)(-|,|\\|)")[,2],
         Classification=ifelse(!is.na(Classification1), 
                               tolower(Classification1), 
                               tolower(Classification))) %>% 
  group_by(Department, Classification) %>% 
  summarize(Objects=n()) %>% 
  ungroup %>% 
  mutate(Classification=ifelse(Department==Classification, paste0(Classification, "#"), 
                               Classification)) %>% 
  as.data.frame() -> dfspain

# Do treemap without drawing
tm_dfspain <- treemap(dfspain, index = c("Department", "Classification"),
                      draw=F,
                      vSize = "Objects", 
                      vColor = "Objects",
                      type = "index")

# Do highcharter treemap 
hctreemap(
  tm_dfspain,
  allowDrillToNode = TRUE,
  allowPointSelect = T,
  levelIsConstant = F,
  levels = list(
    list(
      level = 1,
      dataLabels = list (enabled = T, color = '#f7f5ed', style = list("fontSize" = "1em")),
      borderWidth = 1
    ),
    list(
      level = 2,
      dataLabels = list (enabled = F,  align = 'right', verticalAlign = 'top', 
                         style = list("textShadow" = F, "fontWeight" = 'light', "fontSize" = "1em")),
      borderWidth = 0.7
    ) 
  )) %>% 
  hc_title(text = "Spanish Artworks in The Met") %>% 
  hc_subtitle(text = "Distribution by Department") -> plot

plot

The Cycling Accident Map of Madrid City

Far away, this ship has taken me far away (Starlight, Muse)

Madrid City has an Open Data platform where can be found around 300 data sets about a number of topics. One of these sets is the one I used for this experiment. It contains information about cycling accidents  happened in the city from January to July 2017. I have done a map to locate where the accidents took place. This experiment shows how R makes very easy to create professional maps with Leaflet (in this case I use Carto basemaps).

To locate accidents the data set only contains the address where they happened so the first thing I did is to obtain their geographical coordinates using geocode function from ggmap package. There were 431 accidents during the first 7 months of 2017 (such a big number!) and I got coordinates of 407 so I can locate 94% of the accidents.

Obviously, the amount of accidents in some place depend on how many bikers circulate there as well as on its infrastructure. None of these things can be seen in the map: It only shows number of accidents.

The categorization of accidents is:

  • Double collision (Colisión doble): Traffic accident occurred between two moving vehicles.
  • Multiple collision (Colisión múltiple): Traffic accident occurred between more than two moving vehicles.
  • Fixed object collision (Choque con objeto fijo): Accident occurred between a moving vehicle with a driver and an immovable object that occupies the road or separated area of ​​the same, whether parked vehicle, tree, street lamp, etc.
  • Accident (Atropello): Accident occurred between a vehicle and a pedestrian that occupies the road or travels by sidewalks, refuges, walks or zones of the public road not destined to the circulation of vehicles.
  • Overturn (Vuelco): Accident suffered by a vehicle with more than two wheels which by some circumstance loses contact with the road and ends supported on one side or on its roof.
  • Motorcycle fall (Caída motocicleta): Accident suffered by a motorcycle, which at some moment loses balance, because of the driver or due to the conditions of the road.
  • Moped fall (Caída ciclomotor): Accident suffered by a moped, which at some moment loses balance, because of the driver or due to the conditions of the road.
  • Bicycle fall (Caída bicicleta): Accident suffered by a bicycle, which at some moment loses balance, because of the driver or due to the conditions of the road.

These categories are redundant (e.g. Double and Multiple collision), difficult to understand (e.g. Overturn) or both things at the same time (e.g. Motorcycle fall and Moped fall). This categorization also forgets human damages incurred by the accident.

Taking all these things in mind, this is the map:

Here is a full-screen version of the map.

My suggestions to the city council of Madrid are:

  1. Add geographical coordinates to data (I guess many of the analysis will need them)
  2. Rethink the categorization to make it clearer and more informative
  3. Add more cycling data sets to the platform (detail of bikeways, traffic …) to understand accidents better
  4. Attending just to the number of accidents , put the focus around Parque del Retiro, specially on its west surroundings, from Plaza de Cibeles to Plaza de Carlos V: more warning signals, more  (or better) bikeways …

I add the code below to update the map (If someone ask it to me, I can do it myself regularly):

library(dplyr)
library(stringr)
library(ggmap)
library(leaflet)
# First, getting the data
download.file(paste0("http://datos.madrid.es/egob/catalogo/", file), 
              destfile="300110-0-accidentes-bicicleta.csv")

data=read.csv("300110-0-accidentes-bicicleta.csv", sep=";", skip=1)

# Prepare data for geolocation
data %>% 
  mutate(direccion=paste(str_trim(Lugar), str_trim(Numero), "MADRID, SPAIN", sep=", ") %>% 
           str_replace("NA, ", "") %>% 
           str_replace(" - ", " CON ")) -> data

# Geolocation (takes some time ...)
coords=c()
for (i in 1:nrow(data)) 
{
  coords %>% rbind(geocode(data[i,"direccion"])) -> coords
  Sys.sleep(0.5)
}
  
# Save data, just in case
data %>% cbind(coords) %>% saveRDS(file="bicicletas.RDS")

data=readRDS(file="bicicletas.RDS")

# Remove non-successfull geolocations
data %>% 
  filter(!is.na(lon)) %>% 
  droplevels()-> data

# Remove non-successfull geolocations
data %>% mutate(Fecha=paste0(as.Date(data$Fecha, "%d/%m/%Y"), " ", TRAMO.HORARIO),
                popup=paste0("<b>Dónde:</b>",
                             direccion,
                             "<b>Cuándo:</b>",
                             Fecha,
                             "<b>Qué pasó:</b>",
                             Tipo.Accidente)) -> data

# Do the map
data %>% split(data$Tipo.Accidente) -> data.df

l <- leaflet() %>% addProviderTiles(providers$CartoDB.Positron)

names(data.df) %>%
  purrr::walk( function(df) {
    l <<- l %>%
      addCircleMarkers(data=data.df[[df]],
                 lng=~lon, lat=~lat,
                 popup=~popup,
                 color="red",
                 stroke=FALSE,
                 fillOpacity = 0.8,
                 group = df,
                 clusterOptions = markerClusterOptions(removeOutsideVisibleBounds = F))
  })

l %>%
  addLayersControl(
    overlayGroups = names(data.df),
    options = layersControlOptions(collapsed = FALSE)
  )

Who is Alan Turing?

This government is committed to introducing posthumous pardons for people with certain historical sexual offence convictions who would be innocent of any crime now (British Government Spokesperson, September 2016)

Last September, the British government announced its intention to pursue what has become known as the Alan Turing law, offering exoneration to the tens of thousands of gay men convicted of historic charges.  The law was finally unveiled on 20 October 2016.

This plot shows the daily views of the Alan Turing’s wikipedia page during the last 365 days:

There are three huge peaks in May 27th, July 30th, and October 29th that can be easily detected using AnomalyDetection package:


After substituting these anomalies by a simple linear imputation, it is clear that the time series has suffered a significant impact since the last days of September:

To estimate the amount of incremental views since September 28th (this is the date I have chosen as starting point) I use CausalImpact package:


Last plot shows the accumulated effect. After 141 days, there have been around 1 million of incremental views to the Alan Turing’s wikipedia page (more than 7.000 per day) and it does not seem ephemeral.

Alan Turing has won another battle, this time posthumous. And thanks to it, there is a lot of people that have discovered his amazing legacy: long life to Alan Turing.

This is the code I wrote to do the experiment:

library(httr)
library(jsonlite)
library(stringr)
library(xts)
library(highcharter)
library(AnomalyDetection)
library(imputeTS)
library(CausalImpact)
library(dplyr)

# Views last 365 days
(Sys.Date()-365) %>% str_replace_all("[[:punct:]]", "") %>% substr(1,8) -> date_ini
Sys.time()       %>% str_replace_all("[[:punct:]]", "") %>% substr(1,8) -> date_fin
url="https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Alan%20Turing/daily"
paste(url, date_ini, date_fin, sep="/") %>% 
  GET  %>% 
  content("text") %>% 
  fromJSON %>% 
  .[[1]] -> wikistats

# To prepare dataset for highcharter
wikistats %>% 
  mutate(day=str_sub(timestamp, start = 1, end = 8)) %>% 
  mutate(day=as.POSIXct(day, format="%Y%m%d", tz="UTC")) -> wikistats

# Highcharts viz
rownames(wikistats)=wikistats$day
wikistats %>% select(views) %>% as.xts  %>% hchart

# Anomaly detection
wikistats %>% select(day, views) -> tsdf
tsdf %>%  
  AnomalyDetectionTs(max_anoms=0.01, direction='both', plot=TRUE)->res
res$plot

# Imputation of anomalies
tsdf[tsdf$day %in% as.POSIXct(res$anoms$timestamp, format="%Y-%m-%d", tz="UTC"),"views"]<-NA 
ts(tsdf$views, frequency = 365) %>% 
  na.interpolation() %>% 
  xts(order.by=wikistats$day) -> tscleaned
tscleaned %>% hchart

# Causal Impact from September 28th
x=sum(index(tscleaned)<"2016-09-28 UTC")
impact <- CausalImpact(data = tscleaned %>% as.numeric, 
                       pre.period = c(1,x),
                       post.period = c(x+1,length(tscleaned)), 
                       model.args = list(niter = 5000, nseasons = 7),
                       alpha = 0.05)
plot(impact)

How to Find Equidistant Coordinates Between Two Locations on Earth

Here’s to the ones who dream
foolish, as they may seem
(The Fools Who Dream, ‘La La Land’ OST)

One of the key points of The Meeting Point Locator is to obtain an orthogonal great circle to the bearing defined by any two given locations on Earth. A great circle is the intersection of the sphere and a plane that passes through the center point of the sphere. In other words, a great circle is a false meridian. The orthogonal great circle to the direction defined by any two given locations is the one which passes by all equidistant points to both of them (at least this is what I call orthogonal great circle). This was my first approach to obtain it:

  • Get the midpoint between the initial locations, let’s call it p1
  • Calculate the direction (bearing angle) between the initial locations, let’s call it α
  • Obtain a very close point to p1 (only 1 meter away) with bearing α+90, let’s call it p2
  • Calculate the great circle which passes through p1 and p2

This is the code I used in this first approach:

library(dplyr)
library(ggmap)
library(geosphere)
library(leaflet)
library(ggplot2)
library(scales)
library(extrafont)
windowsFonts(Garamond=windowsFont("Garamond"))

#Starting places
place1="Madrid, Spain"
place2="Toledo, Spain"

# Call to Google Maps API to obtain coordinates of Starting places
p1=geocode(place1, output = "latlon")
p2=geocode(place2, output = "latlon")

#Midpoint of p1 and p2
mid=midPoint(p1, p2)

#Direction between p1 and p2
bea=bearingRhumb(p1, p2)

# Great circle between midpoint and 1-meter separated point with bearing bea+90
points=greatCircle(destPoint(p=mid, b=bea+90, d=1), mid, n=100)

# Arrange the points dependning on the distance to the input locations
data.frame(dist2p1=apply(points, 1, function (x) distGeo(p1, x)),
           dist2p2=apply(points, 1, function (x) distGeo(p2, x))) %>% 
  cbind(points) -> points

opts=theme(
  panel.background = element_rect(fill="gray90"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="white", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=6, family = "Garamond"),
  axis.title = element_text(size=10, colour="gray10", family = "Garamond"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 14, colour="gray10", family = "Garamond"),
  plot.subtitle = element_text(size = 10, colour="gray20", family = "Garamond"))

ggplot(points, aes(x=dist2p1, y=dist2p2), guide=FALSE)+
  geom_abline(intercept = 0, slope = 1, colour = "red", alpha=.25)+
  geom_point(colour="blue", fill="blue", shape=21, alpha=.8, size=1)+
  scale_x_continuous(label=scientific_format())+
  scale_y_continuous(label=scientific_format())+
  labs(title=paste(place1,"and" ,place2, sep=" "),
       subtitle="Equidistant points (2nd approach)",
       x=paste("Distance to" ,place1, "(Km)", sep=" "),
       y=paste("Distance to" ,place2, "(Km)", sep=" "))+opts

#Map
points %>% 
  leaflet() %>% 
  addTiles(urlTemplate = "https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png") %>% 
  addCircleMarkers(
    lng=points$lon, lat=points$lat,
    radius = 6,
    color = "blue",
    stroke = FALSE, fillOpacity = 0.5) %>% 
  addCircleMarkers(
    lng=c(p1$lon, p2$lon), lat=c(p1$lat, p2$lat),
    radius = 6,
    color = "red",
    stroke = FALSE, fillOpacity = 0.5)

I was pretty sure that all points of this last great circle must be equidistant to the initial locations but I was wrong. When the starting points are enough close, everything goes well. This is an example with Madrid and Toledo (separated only by 67 kilometers) as starting points. The following plot shows the distance to Madrid and Toledo of 100 points on the great circle obtained as I described before:


This map shows also these 100 points (in blue) as well as the starting ones (in red):

Quite convincent. But this is what happens when I choose Tokyo and New York (10.873 kms. away) as the starting points:


And the map:

To be honest, I do not know why this happens but, based on the success obtained using close starting points, the final solution was simple: bring the starting points closer preserving the original midpoint. This was my second (and definitive) try:


And the map:

Mission accomplished. The final code:

library(dplyr)
library(ggmap)
library(geosphere)
library(leaflet)
library(ggplot2)
library(scales)
library(extrafont)
windowsFonts(Garamond=windowsFont("Garamond"))

# Starting places
place1="Tokyo, Japan"
place2="New York, USA"

# Call to Google Maps API to obtain coordinates of Starting places
p1=geocode(place1, output = "latlon")
p2=geocode(place2, output = "latlon")

# Midpoint of p1 and p2
mid=midPoint(p1, p2)
# Distance between p1 and p2
dist=distGeo(p1, p2)
# A simple piece of code to bring the starting points closer preserving the original midpoint
 x=p1
 y=p2
 while(dist>1000000)
 {
   x=midPoint(mid, x)
   y=midPoint(mid, y)
   dist=distGeo(x, y)
}
# Direction between resulting (close) points
bea=bearingRhumb(x, y)
# Great circle between midpoint and 1-meter separated point with bearing bea+90
points=greatCircle(destPoint(p=mid, b=bea+90, d=1), mid, n=100)

# Arrange the points dependning on the distance to the input locations
data.frame(dist2p1=apply(points, 1, function (x) distGeo(p1, x)),
           dist2p2=apply(points, 1, function (x) distGeo(p2, x))) %>% 
  cbind(points) -> points

opts=theme(
  panel.background = element_rect(fill="gray90"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="white", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=6, family = "Garamond"),
  axis.title = element_text(size=10, colour="gray10", family = "Garamond"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 14, colour="gray10", family = "Garamond"),
  plot.subtitle = element_text(size = 10, colour="gray20", family = "Garamond"))

ggplot(points, aes(x=dist2p1, y=dist2p2), guide=FALSE)+
  geom_abline(intercept = 0, slope = 1, colour = "red", alpha=.25)+
  geom_point(colour="blue", fill="blue", shape=21, alpha=.8, size=1)+
  scale_x_continuous(label=scientific_format())+
  scale_y_continuous(label=scientific_format())+
  labs(title=paste(place1,"and" ,place2, sep=" "),
       subtitle="Equidistant points (2nd approach)",
       x=paste("Distance to" ,place1, "(Km)", sep=" "),
       y=paste("Distance to" ,place2, "(Km)", sep=" "))+opts

points %>% 
  leaflet() %>% 
  addTiles(urlTemplate = "https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png") %>% 
  addCircleMarkers(
    lng=points$lon, lat=points$lat,
    radius = 6,
    color = "blue",
    stroke = FALSE, fillOpacity = 0.5) %>% 
  addCircleMarkers(
    lng=c(p1$lon, p2$lon), lat=c(p1$lat, p2$lat),
    radius = 6,
    color = "red",
    stroke = FALSE, fillOpacity = 0.5)

The Meeting Point Locator

Hi Hillary, It’s Donald, would you like to have a beer with me in La Cabra Brewing, in Berwyn, Pensilvania? (Hypothetical utilization of The Meeting Point Locator)

Finding a place to have a drink with someone may become a difficult task. It is quite common that one of them does not want to move to the other’s territory. I am sure you have faced to this situation many times. With The Meeting Point Locator this will be no longer an issue, since it will give you a list of equidistant bars and coffees to any two given locations. Let’s see an example.

I do not know if Hillary Clinton and Donald Trump have met each other after the recent elections in United States, but the will probably do. Let’s suppose Hillary doesn’t want to go to The White House and that Donald prefers another place instead Hillary’s home. No problem at all. According to this, Hillary lives in Chappaqua, New York and Donald will live in The White House, Washington (although he supposedly won’t do full time as he announced recently). These two locations are the only input that The Meeting Point Locator needs to purpose equidistant places where having a drink. This is how it works:

  • Generates a number of coordinates on the great circle which passes through the midpoint of the original locations and is orthogonal to the rhumb defined by them; the number of points depends on the distance between the original locations.
  • Arranges these coordinates according to the distance to the original locations, from the nearest to the most distant.
  • Depending also on the distance of the original locations, defines a radius to search around each point generated on the great circle (once calculated, this radius is constant for all searches).
  • Starting from the nearest point, looks for a number of places (20 by default) to have a drink using the radius calculated previously. To do this, it calls to the Google Places API. Once the number of locations is reached, the proccess stops.

This map shows the places purposed for Hillary and Donald (blue points) as well as the original locations (red ones). You can make zoom in for details:

These are the 20 closest places to both of them:

listHillaryTrumpDT

Some other examples of the utility of The Meeting Point Locator:

  • Pau Gasol (who lives in San Antonio, Texas) and Marc Gasol (in Memphis, Tennessee) can meet in The Draft Sports Bar, in Leesville (Louisiana) to have a beer while watching a NBA match. It is 537 kilometers far from both of them.
  • Bob Dylan (who lives in Malibu, California) and The Swedish Academy (placed in Stockholm, Sweden) can smooth things over drinking a caipirinha in Bar São João, in Tremedal (Brasil)only 9.810 kilometers far from both of them.
  • Spiderman (placed in New York City) and Doraemon (in Tokio, Japan) can meet in Andreyevskaya, in Stroitel (Russia) to have a have a hot drink. Since they are superheroes, they will cover the 9.810 kilometers of separation in no time at all.

I faced with two challenges to do this experiment: how to generate the orthogonal great circle from two given locations and how to define radius and number of points over this circle to do searchings. I will try to explain in depth both things in the future in another post.

You will find the code below. To make it work, do not forget to get your own key for Google Places API Web Service here. I hope this tool will be helpful for someone; if yes, do not hesitate to tell it to me.

library(httr)
library(jsonlite)
library(dplyr)
library(ggmap)
library(geosphere)
library(DT)
library(leaflet)

# Write both addresses here (input)
place1="Chappaqua, New York, United States of America"
place2="The White House, Washington DC, United States of America"

# Call to Google Maps API to obtain coordinates of previous addresses
p1=geocode(place1, output = "latlon")
p2=geocode(place2, output = "latlon")

# To do searchings I need a radius
radius=ifelse(distGeo(p1, p2)>1000000, 10000,
              ifelse(distGeo(p1, p2)>100000, 2500, 1000))

# And a number of points
npoints=ifelse(distGeo(p1, p2)>1000000, 2002,
               ifelse(distGeo(p1, p2)>100000, 7991, 19744))

# Place here the Google Places API Key
key="PLACE_YOUR_OWN_KEY_HERE"

# Build the url to look for bars and cafes with the previous key
url1="https://maps.googleapis.com/maps/api/place/nearbysearch/json?location=lat,lon&radius="
url2="&types=cafe|bar&key="
url=paste0(url1,radius,url2,key)

# This is to obtain the great circle orthogonal to the rhumb defined by input locations
# and which passes over the midpoint. I will explain this step in the future
mid=midPoint(p1, p2)
dist=distGeo(p1, p2)
x=p1
y=p2
while(dist>1000000)
{
  x=midPoint(mid, x)
  y=midPoint(mid, y)
  dist=distGeo(x, y)
}

bea=bearingRhumb(x, y)
points=greatCircle(destPoint(p=mid, b=bea+90, d=1), mid, n=npoints)

# Arrange the points dependning on the distance to the input locations
data.frame(dist2p1=apply(points, 1, function (x) distGeo(p1, x)),
           dist2p2=apply(points, 1, function (x) distGeo(p2, x))) %>% 
  mutate(order=apply(., 1, function(x) {max(x)})) %>% 
  cbind(points) %>% 
  arrange(order) -> points

# Start searchings
nlocs=0 # locations counter (by default stops when 20 is reached)
niter=1 # iterations counter (if greater than number of points on the great circle, stops)
results=data.frame()
while(!(nlocs>=20 | niter>npoints)) {
  print(niter)
  url %>% 
    gsub("lat", points[niter, 'lat'], .) %>% 
    gsub("lon", points[niter, 'lon'], .) %>% 
    GET %>% 
    content("text") %>% 
    fromJSON -> retrieve
  
  df=data.frame(lat=retrieve$results$geometry$location$lat,
                lng=retrieve$results$geometry$location$lng,
                name=retrieve$results$name, 
                address=retrieve$results$vicinity)
  results %>% rbind(df)->results
  
  nlocs=nlocs+nrow(df)
  niter=niter+1 
}

# I prepare results to do a Data Table
data.frame(dist2p1=apply(results, 1, function (x) round(distGeo(p1, c(as.numeric(x[2]), as.numeric(x[1])))/1000, digits=1)),
           dist2p2=apply(results, 1, function (x) round(distGeo(p2, c(as.numeric(x[2]), as.numeric(x[1])))/1000, digits=1))) %>% 
  mutate(mx=apply(., 1, function(x) {max(x)})) %>% 
  cbind(results) %>% 
  arrange(mx) %>% 
  mutate(rank=row_number()) %>% 
  select(-mx)-> resultsDT

# This is the Data table
datatable(resultsDT, 
          class = 'cell-border stripe',
          rownames = FALSE,
          options = list(pageLength = 5),
          colnames = c('Distance to A (Km)', 
                       'Distance to B (Km)', 
                       'Latitude', 
                       'Longitude',
                       'Name', 
                       'Address', 
                       'Rank'))

# Map with the locations using leaflet
resultsDT %>% 
  leaflet() %>% 
  addTiles() %>% 
  addCircleMarkers(
    lng=resultsDT$lng, lat=resultsDT$lat,
    radius = 8,
    color = "blue",
    stroke = FALSE, fillOpacity = 0.5,
    popup=paste(paste0("<b>", resultsDT$name, "</b>"), resultsDT$address, sep="
")
  ) %>% 
  addCircleMarkers(
    lng=p1$lon, lat=p1$lat,
    radius = 10,
    color = "red",
    stroke = FALSE, fillOpacity = 0.5,
    popup=paste("<b>Place 1</b>", place1, sep="
")
  )%>% 
  addCircleMarkers(
    lng=p2$lon, lat=p2$lat,
    radius = 10,
    color = "red",
    stroke = FALSE, fillOpacity = 0.5,
    popup=paste("<b>Place 2</b>", place2, sep="
")
  )

Bayesian Blood

The fourth, the fifth, the minor fall and the major lift (Hallelujah, Leonard Cohen)

Next problem is extracted from MacKay’s Information Theory, Inference and Learning Algorithms:

Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type ‘O’ blood. The blood groups of the two traces are found to be of type ‘O’ (a common type in the local population, having frequency 60%) and of type ‘AB’ (a rare type, with frequency 1%). Do these data give evidence in favor of the proposition that Oliver was one of the people who left blood at the scene?

To answer the question, let’s first remember the probability form of Bayes theorem:

p(H|D)=\dfrac{p(H)p(D|H)}{p(D)}

where:

  • p(H) is the probability of the hypothesis H before we see the data, called the prior
  • p(H|D) is the probablity of the hyothesis after we see the data, called the posterior
  • p(D|H) is the probability of the data under the hypothesis, called the likelihood
  • p(D)is the probability of the data under any hypothesis, called the normalizing constant

If we have two hypothesis, A and B, we can write the ratio of posterior probabilities like this:

\dfrac{p(A|D)}{p(B|D)}=\dfrac{p(A)p(D|A)}{p(B)p(D|B)}

If p(A)=1-p(B) (what means that A and B are mutually exclusive and collective exhaustive), then we can rewrite the ratio of the priors and the ratio of the posteriors as odds. Writing o(A) for odds in favor of A, we get the odds form of Bayes theorem:

o(A|D)=o(A)\dfrac{p(D|A)}{p(D|B)}

Dividing through by o(A) we have:

\dfrac{o(A|D)}{o(A)}=\dfrac{p(D|A)}{p(D|B)}

The term on the left is the ratio of the posteriors and prior odds. The term on the right is the likelihood ratio, also called the Bayes factor. If it is greater than 1, that means that the data were more likely under A than under B. And since the odds ratio is also greater than 1, that means that the odds are greater, in light of the data, than they were before. If the Bayes factor is less than 1, that means the data were less likely under A than under B, so th odds in favor of A go down.

Let’s go back to our initial problem. If Oliver left his blood at the crime scene, the probability of the data is just the probability that a random member of the population has type ‘AB’ blood, which is 1%. If Oliver did not leave blood at the scene, what is the the chance of finding two people, one with type ‘O’ and one with type ‘AB’? There are two ways it might happen: the first person we choose might have type ‘O’ and the second ‘AB’, or the other way around. So the probability in this case is 2(0.6)(0.01)=1.2%. Dividing probabilities of both scenarios we obtain a Bayes factor of 0.83, and we conclude that the blood data is evidence against Oliver’s guilt.

Once I read this example, I decided to replicate it using real data of blood type distribution by country from here. After cleaning data, I have this nice data set to work with:

For each country, I get the most common blood type (the one which the suspect has) and the least common and replicate the previous calculations. For example, in the case of Spain, the most common type is ‘O+’ with 36% and the least one is ‘AB-‘ with 0.5%. The Bayes factor is 0.005/(2(0.36)(0.005))=1.39 so data support the hypothesis of guilt in this case. Next chart shows Bayes factor accross countries:

Just some comments:

  • Sometimes data consistent with a hypothesis are not necessarily in favor of the hypothesis
  • How different is the distribution of blood types between countries!
  • If you are a estonian ‘A+’ murderer, choose carefully your accomplice

This is the code of the experiment:

library(rvest)
library(dplyr)
library(stringr)
library(DT)
library(highcharter)

# Webscapring of the table with the distribution of blood types
url <- "http://www.rhesusnegative.net/themission/bloodtypefrequencies/"
blood <- url %>%
   read_html() %>%
   html_node(xpath='/html/body/center/table') %>%
   html_table(fill=TRUE)

# Some data cleansing
blood %>% slice(-c(66:68)) -> blood

blood[,-c(1:2)] %>% 
  sapply(gsub, pattern=",", replacement=".") %>% 
  as.data.frame %>% 
  sapply(gsub, pattern=".79.2", replacement=".79") %>% 
  as.data.frame-> blood[,-c(1:2)]

blood %>% 
  sapply(gsub, pattern="%|,", replacement="") %>% 
  as.data.frame -> blood

blood[,-1] = apply(blood[,-1], 2, function(x) as.numeric(as.character(x)))


blood[,-c(1:2)] %>% mutate_all(funs( . / 100)) -> blood[,-c(1:2)]

# And finally, we have a nice data set
datatable(blood, 
          rownames = FALSE,
          options = list(
          searching = FALSE,
          pageLength = 10)) %>% 
  formatPercentage(3:10, 2)

# Calculate the Bayes factor
blood %>% 
  mutate(factor=apply(blood[,-c(1,2)], 1, function(x) {min(x)/(2*min(x)*max(x))})) %>% 
  arrange(factor)-> blood

# Data Visualization
highchart() %>% 
     hc_chart(type = "column") %>% 
     hc_title(text = "Bayesian Blood") %>%
     hc_subtitle(text = "An experiment about the Bayes Factor") %>%  
     hc_xAxis(categories = blood$Country, 
             labels = list(rotation=-90, style = list(fontSize = "12px")))  %>% 
     hc_yAxis(plotBands = list(list(from = 0, to = 1, color = "rgba(255,215,0, 0.8)"))) %>% 
     hc_add_series(data = blood$factor,
                   color = "rgba(255, 0, 0, 0.5)",
                   name = "Bayes Factor")%>% 
  hc_yAxis(min=0.5) %>% 
  hc_tooltip(pointFormat = "{point.y:.2f}") %>% 
  hc_legend(enabled = FALSE) %>% 
  hc_exporting(enabled = TRUE) %>%
  hc_chart(zoomType = "xy")

Visualizing the Gender of US Senators With R and Highmaps

I wake up every morning in a house that was built by slaves (Michelle Obama)

Some days ago I was invited by the people of Highcharts to write a post in their blog. What I have done is a simple but revealing map of women senators of the United States of America. Briefly, this is what I’ve done to generate it:

  • read from the US senate website a XML file with senators info
  • clean and obtain gender of senators from their first names
  • summarize results by state
  • join data with a US geojson dataset to create the highmap

You can find details and R code here.

It is easy creating a highcharts using highcharter, an amazing library as genderizeR, the one I use to obtain gender names. I like them a lot.

Climatic Change At A Glance

Mmm. Lost a planet, Master Obi-Wan has. How embarrassing (Yoda, Attack Of The Clones)

Some time ago I published this post in KDnuggets in which I analyze historical temperatures to show how we are gradually heading toward a warmer planet. Simple data science to obtain a simple (and increasingly accepted) conclusion: the global warming is real. Despite I was criticized I still believe what I said then: you don’t have to be a climatologist to empirically confirm global warming.

This experiment is another example of that. It is still simpler than that since it is only based on visual perception but I think is also quite conclusive. In this case, I represent U.S. temperature outliers from 1964 to 2013; a map per year. Dataset contains station ID, name, min/max temperature, as well as degree coordinates of the recorded weather. Original weather data collected from NOAA and anomalies analysis by Enigma. You can download data here.

Anomalies are divided into four categories: Strong Hot, Weak Hot, Weak Cold and Strong Cold. For each station, I represent difference between number of Cold and Hot anomalies (independently of the strength) so Blue bubbles represent stations where total number of Cold anomalies during the year is greater that total number of Hot ones and Red ones represent the opposite. Size of bubbles is also proportional to this indicator. As an example, following you can see the map of year 1975:

tonopah
It seems 1975 was hot in the right a cold on the left side. Concretely, in TONOPAH Station (Nevada) were registered 30 anomalies and most of them (26) where due to cold temperatures. This is why bubble is blue. This GIF shows the evolution of all these maps from 1964 to 2013:

anomalies

Maybe it is just my personal feeling but don’t you see how red bubbles are gradually winning to blue ones? Maybe I am a demagogue.

This code generates a dynamic map by year in html format:

library(data.table)
library(stringr)
library(leaflet)
library(RColorBrewer)
library(classInt)
library(dplyr)
library(htmlwidgets)
anomalies = fread("enigma-enigma.weather.anomalies.outliers-1964-2013-05ce047dbf3e67f83db9ae841545a333.csv")
anomalies %>%
  mutate(year=substr(date_str, 1, 4)) %>%
  group_by(year, longitude, latitude, id, station_name) %>%
  summarise(
    Strong_Hot=sum(str_count(type,"Strong Hot")),
    Weak_Hot=sum(str_count(type,"Weak Hot")),
    Weak_Cold=sum(str_count(type,"Weak Cold")),
    Strong_Cold=sum(str_count(type,"Strong Cold")),
    total=n()) %>%
  mutate(score=sign(-Strong_Hot-Weak_Hot+Weak_Cold+Strong_Cold)) %>%
  mutate(color=ifelse(score==1, "Blue",ifelse(score==0, "White", "Red"))) -> anomalies2
for (i in unique(anomalies2$year))
{
  anomalies2 %>%
    filter(year==i) %>%
    leaflet() %>%
    fitBounds(-124, 34, -62, 40) %>%
    addProviderTiles("Stamen.TonerLite") %>%
    addCircleMarkers(lng = ~longitude,
                     lat = ~latitude,
                     radius = ~ifelse(total < 20, 2, ifelse(total < 27, 4, 8)),
                     color= ~color,
                     stroke=FALSE,
                     fillOpacity = 0.5,
                     popup = ~paste(sep = "
", paste0("<b>", station_name, "</b>"),
                                    paste0("Strong Hot: ", Strong_Hot),
                                    paste0("Weak Hot: ", Weak_Hot),
                                    paste0("Weak Cold: ", Weak_Cold),
                                    paste0("Strong Cold: ", Strong_Cold))) -> m
    saveWidget(m, file=paste0("m", i, ".html"))
}

Visualising The Evolution Of Migration Flows With rCharts

Heaven we hope is just up the road (Atlas, Coldplay)

Following with the analysis of migration flows, I have done next two visualizations. These charts are called bump charts and are very suitable to represent rankings. This is what I have done:

  • Obtaining top 20 countries of the world according to % of migrants respect its population
  • To do this, I divide total number of migrants between 1960 and 2009 by the mean population in the same period.
  • I do the same to obtain top 20 countries of the world according to % of immigrants.
  • In both cases, I only consider countries with population greater than 2 million.
  • For these countries, I calculate % of migrants in each decade (60’s, 70’s, 80’s, 90’s and 00’s), dividing total number of migrants by mean population each decade
  • I do the same in the case of immigrants.
  • Instead of representing directly % of migrants and immigrants, I represent the ranking of countries according these indicators by decade

This is the bump chart of migrants:

And this is the one of immigrants:

Some comments:

  • There is a permanent exodus in Puerto Rico: all decades (except 70’s) is located in the top 1 of countries with most migrants respect its population
  • Ireland is also living a diaspora although in the 00’s decade has lost some positions
  • Albania, Georgia and Bosnia and Herzegovina are gaining positions. Is East Europe gradually becoming uncomfortable?
  • Jamaica is also moving up in this sad competition.
  • On the other hand, Hong Kong and Israel are persistently leaders as receivers
  • Saudi Arabia has presented an impressive growth receiving immigrants since 70’s
  • United States does not appear in the immigrants ranking
  • Singapore is gaining positions: in the 00’s decade is the third receiver country
  • Also in the 00s, Switzerland is the first European country in the ranking, holding the fifth position

I like using rCharts as well as using Enigma data sets, as I have done previously. This is the code:

library(data.table)
library(rCharts)
library(dplyr)
setwd("YOUR WORKING DIRECTORY HERE")
populflows = read.csv(file="enigma-org.worldbank.migration-remittances.migrants.migration-flow-c57405e33412118c8757b1052e8a1490.csv", stringsAsFactors=FALSE)
population = fread("enigma-org.worldbank.hnp.data-eaa31d1a34fadb52da9d809cc3bec954.csv")
population %>% 
  filter(indicator_name=="Population, total") %>% 
  as.data.frame %>% 
  mutate(decade=(year-year%%10)) %>% 
  group_by(country_name, country_code, decade) %>% 
  summarise(population=mean(value)) %>% 
  plyr::rename(., c("country_name"="country")) -> population2
populflows %>% filter(!is.na(total_migrants)) %>% 
  group_by(migration_year, destination_country) %>% 
  summarise(inmigrants = sum(total_migrants))  %>% 
  plyr::rename(., c("destination_country"="country", "migration_year"="decade"))   -> inmigrants
populflows %>% filter(!is.na(total_migrants)) %>% 
  group_by(migration_year, country_of_origin) %>% 
  summarise(migrants = sum(total_migrants)) %>%  
  plyr::rename(., c("country_of_origin"="country", "migration_year"="decade"))   -> migrants
# Join of data sets
migrants %>% 
  merge(inmigrants, by = c("country", "decade")) %>%
  merge(population2, by = c("country", "decade")) %>%
  mutate(p_migrants=migrants/population, p_inmigrants=inmigrants/population) -> populflows2
# Global Indicators
populflows2 %>% 
  group_by(country) %>% 
  summarise(migrants=sum(migrants), inmigrants=sum(inmigrants), population=mean(population)) %>% 
  mutate(p_migrants=migrants/population, p_inmigrants=inmigrants/population)  %>% 
  filter(population > 2000000)  %>%
  mutate(rank_migrants = dense_rank(desc(p_migrants)), rank_inmigrants = dense_rank(desc(p_inmigrants))) -> global
# Migrants dataset
global %>% 
  filter(rank_migrants<=20) %>% 
  select(country) %>% 
  merge(populflows2, by = "country") %>% 
  arrange(decade, p_migrants) %>%
  mutate(decade2=as.numeric(as.POSIXct(paste0(as.character(decade), "-01-01"), origin="1900-01-01"))) %>%
  plyr::ddply("decade", transform, rank = dense_rank(p_migrants)) -> migrants_rank
# Migrants dataset
global %>% 
  filter(rank_inmigrants<=20) %>% 
  select(country) %>% 
  merge(populflows2, by = "country") %>% 
  arrange(decade, p_inmigrants) %>%
  mutate(decade2=as.numeric(as.POSIXct(paste0(as.character(decade), "-01-01"), origin="1900-01-01"))) %>%
  plyr::ddply("decade", transform, rank = dense_rank(p_inmigrants)) -> inmigrants_rank
# Function for plotting
plotBumpChart <- function(df){
  bump_chart = Rickshaw$new()
  mycolors = ggthemes::tableau_color_pal("tableau20")(20)
  bump_chart$layer(rank ~ decade2, group = 'country_code', data = df, type = 'line', interpolation = 'none', colors = mycolors)
  bump_chart$set(slider = TRUE, highlight = TRUE, legend=TRUE)
  bump_chart$yAxis(tickFormat = "#!  function(y) { if (y == 0) { return '' } else { return String((21-y)) } } !#")
  bump_chart$hoverDetail(yFormatter = "#! function(y){return (21-y)} !#")
  return(bump_chart)
}
plotBumpChart(migrants_rank)
plotBumpChart(inmigrants_rank)

The World We Live In #5: Calories And Kilograms

I enjoy doing new tunes; it gives me a little bit to perk up, to pay a little bit more attention (Earl Scruggs, American musician)

I recently finished reading The Signal and the Noise, a book by Nate Silver, creator of the also famous FiveThirtyEight blog. The book is a very good reading for all data science professionals, and is a must in particular for all those who work trying to predict the future. The book praises the bayesian way of thinking as the best way to face and modify predictions and criticizes rigid ways of thinking with many examples of disastrous predictions. I enjoyed a lot the chapter dedicated to chess and how Deep Blue finally took over Kasparov. In a nutshell: I strongly recommend it.
One of the plots of Silver’s book present a case of false negative showing the relationship between obesity and calorie consumption across the world countries. The plot shows that there is no evidence of a connection between both variables. Since it seemed very strange to me, I decided to reproduce the plot by myself.

I compared these two variables:

  • Dietary Energy Consumption (kcal/person/day) estimated by the FAO Food Balance Sheets.
  • Prevalence of Obesity as percentage of defined population with a body mass index (BMI) of 30 kg/m2 or higher estimated by the World Health Organization

And this is the resulting plot:

Calories And KilogramsAs you can see there is a strong correlation between two variables. Why the experiment of Nate Silver shows the opposite? Obviously we did not plot the same data (although, in principle, both of us went to the same source). Anyway: to be honest, I prefer my plot because shows what all of we know: the more calories you eat, the more weight you will see in your bathroom scale. Some final thoughts seeing the plot:

  • I would like to be Japanese: they don’t gain weight!
  • Why US people are fatter than Austrian?
  • What happens in Samoa?

Here you have the code to do the plot:

library(xlsx)
library(dplyr)
library(ggplot2)
library(scales)
setwd("YOUR WORKING DIRECTORY HERE")
url_calories = "http://www.fao.org/fileadmin/templates/ess/documents/food_security_statistics/FoodConsumptionNutrients_en.xls"
download.file(url_calories, method="internal", destfile = "FoodConsumptionNutrients_en.xls", mode = "ab")
calories = read.xlsx(file="FoodConsumptionNutrients_en.xls", startRow = 4, colIndex = c(2,6), colClasses = c("character", "numeric"), sheetName="Dietary Energy Cons. Countries", stringsAsFactors=FALSE) 
colnames(calories)=c("Country", "Kcal")
url_population = "http://esa.un.org/unpd/wpp/DVD/Files/1_Excel%20(Standard)/EXCEL_FILES/1_Population/WPP2015_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.XLS"
download.file(url_population, method="internal", destfile = "Population.xls", mode = "ab")
population = read.xlsx(file="Population.xls", startRow = 17, colIndex = c(3,71), colClasses = c("character", "numeric"), sheetName="ESTIMATES", stringsAsFactors=FALSE) 
colnames(population)=c("Country", "Population")
# http://apps.who.int/gho/data/node.main.A900A?lang=en
url_obesity = "http://apps.who.int/gho/athena/data/xmart.csv?target=GHO/NCD_BMI_30A&profile=crosstable&filter=AGEGROUP:*;COUNTRY:*;SEX:*&x-sideaxis=COUNTRY&x-topaxis=GHO;YEAR;AGEGROUP;SEX&x-collapse=true"
obesity = read.csv(file=url_obesity, stringsAsFactors=FALSE)
obesity %>% select(matches("Country|2014.*Both")) -> obesity
colnames(obesity)=c("Country", "Obesity")
obesity %>% filter(Obesity!="No data") -> obesity
obesity %>% mutate(Obesity=as.numeric(substr(Obesity, 1, regexpr(pattern = "[[]", obesity$Obesity)-1))) -> obesity
population %>% inner_join(calories,by = "Country") %>% inner_join(obesity,by = "Country") -> data
opts=theme(
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=15),
  axis.title = element_text(size=18, colour="gray10"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 40, colour="gray10"))
ggplot(data, aes(x=Kcal, y=Obesity/100, size=log(Population), label=Country), guide=FALSE)+
  geom_point(colour="white", fill="sandybrown", shape=21, alpha=.55)+
  scale_size_continuous(range=c(2,40))+
  scale_x_continuous(limits=c(1500,4100))+
  scale_y_continuous(labels = percent)+
  labs(title="The World We Live In #5: Calories And Kilograms",
       x="Dietary Energy Consumption (kcal/person/day)",
       y="% population with body mass index >= 30 kg/m2")+
  geom_text(data=subset(data, Obesity>35|Kcal>3700), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Kcal<2000), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Obesity<10 & Kcal>2600), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(aes(3100, .01), colour="gray25", hjust=0, label="Source: United Nations (size of bubble depending on population)", size=4.5)+opts