How Much Money Should Machines Earn?

Every inch of sky’s got a star
Every inch of skin’s got a scar

I think that a very good way to start with R is doing an interactive visualization of some open data because you will train many important skills of a data scientist: loading, cleaning, transforming and combinig data and performing a suitable visualization. Doing it interactive will give you an idea of the power of R as well, because you will also realise that you are able to handle indirectly other programing languages such as JavaScript.

That’s precisely what I’ve done today. I combined two interesting datasets:

• The probability of computerisation of 702 detailed occupations, obtained by Carl Benedikt Frey and Michael A. Osborne from the University of Oxford, using a Gaussian process classifier and published in this paper in 2013.
• Statistics of jobs from (employments, median annual wages and typical education needed for entry) from the US Bureau of Labor, available here.

Apart from using `dplyr` to manipulate data and `highcharter` to do the visualization, I used `tabulizer` package to extract the table of probabilities of computerisation from the `pdf`: it makes this task extremely easy.

This is the resulting plot:

If you want to examine it in depth, here you have a full size version.

These are some of my insights (its corresponding figures are obtained directly from the dataset):

• There is a moderate negative correlation between wages and probability of computerisation.
• Around 45% of US employments are threatened by machines (have a computerisation probability higher than 80%): half of them do not require formal education to entry.
• In fact, 78% of jobs which do not require formal education to entry are threatened by machines: 0% which require a master’s degree are.
• Teachers are absolutely irreplaceable (0% are threatened by machines) but they earn a 2.2% less then the average wage (unfortunately, I’m afraid this phenomenon occurs in many other countries as well).
• Don’t study for librarian or archivist: it seems a bad way to invest your time
• Mathematicians will survive to machines

The code of this experiment is available here.

The Pleasing Ratio Project

Music is a world within itself, with a language we all understand (Sir Duke, Stevie Wonder)

This serious man on the left is Gustav Theodor Fechner, a German philosopher, physicist and experimental psychologist who lived between 1801 and 1887. To be honest, I don’t know almost anything of his life or work exepct one thing: he did in the 1860s a thought-provoking experiment. It seems me interesting for two important reasons: he called into question something widely established and obtained experimental data by himself.

Fechner’s experiment was simple: he presented just ten rectangles to 82 students. Then he asked each of them to choose the most pleasing one and obtained revealing discoveries I will not explain here since would cause bias in my experiment. You can find more information about the original experiment here.

I have done a project inspired in Fechner’s one that I called The Pleasing Ratio Project. Once you enter in the App, you will see two rectangles. Both of them have the same area. They only vary in their length-to-width ratios. Then you will be asked to select the one that seems you most pleasing. You can do it as many times as you want (all responses are completely anonymous). Every game will confront a couple of ratios, which can vary from 1 to 3,5. In the Results section you will find the  percentage of winning games for each ratio. The one with the highest percentage will be named officially as The Most Pleasing Ratio of the World in the future by myself.

Although my experiment is absolutely inspired in Fechner’s one, there is a important difference: I can explore a bigger set of ratios doing an A/B test. This makes this one a bit richer.

The experiment has also some interesting technical features:

• the use of `shinydashboard` package to arrange the App
• the use of `shinyjs` package to add javaScript to refresh the page when use choose to play again
• to save votes in a text file
• to read it to visualize results

Will I obtain the same results as Fechner? This is a living project whose results will change over the time so you can check it regularly.

The code of the project is available in GitHub. Thanks a lot for your collaboration!

Fatal Journeys: Visualizing the Horror

In war, truth is the first casualty (Aeschylus)

I am not a pessimistic person. On the contrary, I always try to look at the bright side of life. I also believe that living conditions are now better than years ago as these plots show. But reducing the complexity of our world to just six graphs is riskily simplistic. Our world is quite far of being a fair place and one example is the immigration drama.

Last year there were 934 incidents around the world involving people looking for a better life where more than 5.300 people lost their life o gone missing, 60% of them in Mediterranean. Around 8 out of 100 were children.

The missing migrant project tracks deaths of migrants, including refugees and asylum-seekers, who have gone missing along mixed migration routes worldwide. You can find a huge amount of figures, plots and information about this scourge in their website. You can also download there a historical dataset with information of all these fatal journeys, including location, number of dead or missing people and information source from 2015 until today.

I this experiment I read the dataset and do some plots using `highcharter`; you can find a link to the R code at the end of the post.

This is the evolution of the amount of deaths or missing migrants in the process of migration towards an international destination from January 2015 to December 2017:

The Mediterranean is the zone with the most incidents. To see it more clearly, this plot compares Mediterranean with the rest of the world, grouping previous zones:

Is there any pattern in the time series of Mediterranean incidents? To see it, I have done a LOESS decomposition of the time series:

Good news: trend is decreasing for last 12 months. Regarding seasonal component, incidents increase in April and May. Why? I don’t know.

This is a map of the location of all incidents in 2017. Clicking on markers you will find information about each incident:

Every of us should try to make our world a better place. I don’t really know how to do it but I will try to make some experiments during this year to show that we have tons of work in front of us. Meanwhile, I hope this experiment is useful to give visibility to this humanitarian disaster. If someone wants to use the code, the complete project is available in GitHub.

Visualizing the Spanish Contribution to The Metropolitan Museum of Art

Well I walk upon the river like it’s easier than land
(Love is All, The Tallest Man on Earth)

The Metropolitan Museum of Art provides here a dataset with information on more than 450.000 artworks in its collection. You can do anything you want with these data: there are no restrictions of use. Each record contains information about the author, title, type of work, dimensions, date, culture and  geography of a particular piece.

I can imagine a bunch of things to do with these data but since I am a big fan of highcharter,  I have done a treemap, which is an artistic (as well as efficient) way to visualize hierarchical data. A treemap is useful to visualize frequencies. They can handle levels, allowing to navigate to go into detail about any category. Here you can find a good example of treemap.

To read data I use `fread` function from `data.table` package. I also use this package to do some data wrangling operations on the data set. After them, I filter it looking for the word `SPANISH` in the columns `Artist Nationality` and `Culture` and looking for the word `SPAIN` in the column `Country`. For me, any piece created by an Spanish artist (like this one), coming from Spanish culture (like this one) or from Spain (like this one) is Spanish (this is my very own definition and may do not match with any academical one). Once it is done, it is easy to extract some interesting figures:

• There are 5.294 Spanish pieces in The Met, which means a 1,16% of the collection
• This percentage varies significantly between departments: it raises to 9,01% in The Cloisters and to 4,83% in The Robert Lehman Collection; on the other hand, it falls to 0.52% in The Libraries and to 0,24% in Photographs.
• The Met is home to 1.895 highlights and 44 of them (2,32%) are Spanish; It means that Spanish art is twice as important as could be expected (remember that represents a 1,16% of the entire collection)

My treemap represents the distribution of Spanish artworks by department (column `Department`) and type of work (column `Classification`). There are two important things to know before doing a treemap with `highcharter`:

• You have to use `treemap` function from `treemap` package to create a list with your data frame that will serve as input for `hctreemap` function
• `hctreemap `fails if some category name is the same as any of its subcategories. To avoid this, make sure that all names are distinct.

This is the treemap:

Here you can see a full size version of it.

There can be seen several things at a glance: most of the pieces are drawings and prints and european sculpture and decorative arts (in concrete, prints and textiles), there is also big number of costumes, arms and armor is a very fragmented department … I think treemap is a good way to see what kind of works owns The Met.

Mi favorite spanish piece in The Met is the stunning Portrait of Juan de Pareja by Velázquez, which illustrates this post: how nice would be to see it next to El Primo in El Museo del Prado!

Feel free to use my code to do your own experiments:

```library(data.table)
library(dplyr)
library(stringr)
library(highcharter)
library(treemap)

file="MetObjects.csv"
destfile=file,
mode='wb')

# Modify column names to remove blanks
colnames(data)=gsub(" ", ".", colnames(data))

# Clean columns to prepare for searching
data[,`:=`(Artist.Nationality_aux=toupper(Artist.Nationality) %>% str_replace_all("\\[\\d+\\]", "") %>%
iconv(from='UTF-8', to='ASCII//TRANSLIT'),
Culture_aux=toupper(Culture) %>% str_replace_all("\\[\\d+\\]", "") %>%
iconv(from='UTF-8', to='ASCII//TRANSLIT'),
Country_aux=toupper(Country) %>% str_replace_all("\\[\\d+\\]", "") %>%
iconv(from='UTF-8', to='ASCII//TRANSLIT'))]

# Look for Spanish artworks
data[Artist.Nationality_aux %like% "SPANISH" |
Culture_aux %like% "SPANISH" |
Country_aux %like% "SPAIN"] -> data_spain

# Count artworks by Department and Classification
data_spain %>%
mutate(Classification=ifelse(Classification=='', "miscellaneous", Classification)) %>%
mutate(Department=tolower(Department),
Classification1=str_match(Classification, "(\\w+)(-|,|\\|)")[,2],
Classification=ifelse(!is.na(Classification1),
tolower(Classification1),
tolower(Classification))) %>%
group_by(Department, Classification) %>%
summarize(Objects=n()) %>%
ungroup %>%
mutate(Classification=ifelse(Department==Classification, paste0(Classification, "#"),
Classification)) %>%
as.data.frame() -> dfspain

# Do treemap without drawing
tm_dfspain <- treemap(dfspain, index = c("Department", "Classification"),
draw=F,
vSize = "Objects",
vColor = "Objects",
type = "index")

# Do highcharter treemap
hctreemap(
tm_dfspain,
allowDrillToNode = TRUE,
allowPointSelect = T,
levelIsConstant = F,
levels = list(
list(
level = 1,
dataLabels = list (enabled = T, color = '#f7f5ed', style = list("fontSize" = "1em")),
borderWidth = 1
),
list(
level = 2,
dataLabels = list (enabled = F,  align = 'right', verticalAlign = 'top',
style = list("textShadow" = F, "fontWeight" = 'light', "fontSize" = "1em")),
borderWidth = 0.7
)
)) %>%
hc_title(text = "Spanish Artworks in The Met") %>%
hc_subtitle(text = "Distribution by Department") -> plot

plot
```

Bayesian Blood

The fourth, the fifth, the minor fall and the major lift (Hallelujah, Leonard Cohen)

Next problem is extracted from MacKay’s Information Theory, Inference and Learning Algorithms:

Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type ‘O’ blood. The blood groups of the two traces are found to be of type ‘O’ (a common type in the local population, having frequency 60%) and of type ‘AB’ (a rare type, with frequency 1%). Do these data give evidence in favor of the proposition that Oliver was one of the people who left blood at the scene?

To answer the question, let’s first remember the probability form of Bayes theorem:

$p(H|D)=\dfrac{p(H)p(D|H)}{p(D)}$

where:

• p(H) is the probability of the hypothesis H before we see the data, called the prior
• p(H|D) is the probablity of the hyothesis after we see the data, called the posterior
• p(D|H) is the probability of the data under the hypothesis, called the likelihood
• p(D)is the probability of the data under any hypothesis, called the normalizing constant

If we have two hypothesis, A and B, we can write the ratio of posterior probabilities like this:

$\dfrac{p(A|D)}{p(B|D)}=\dfrac{p(A)p(D|A)}{p(B)p(D|B)}$

If p(A)=1-p(B) (what means that A and B are mutually exclusive and collective exhaustive), then we can rewrite the ratio of the priors and the ratio of the posteriors as odds. Writing o(A) for odds in favor of A, we get the odds form of Bayes theorem:

$o(A|D)=o(A)\dfrac{p(D|A)}{p(D|B)}$

Dividing through by o(A) we have:

$\dfrac{o(A|D)}{o(A)}=\dfrac{p(D|A)}{p(D|B)}$

The term on the left is the ratio of the posteriors and prior odds. The term on the right is the likelihood ratio, also called the Bayes factor. If it is greater than 1, that means that the data were more likely under A than under B. And since the odds ratio is also greater than 1, that means that the odds are greater, in light of the data, than they were before. If the Bayes factor is less than 1, that means the data were less likely under A than under B, so th odds in favor of A go down.

Let’s go back to our initial problem. If Oliver left his blood at the crime scene, the probability of the data is just the probability that a random member of the population has type ‘AB’ blood, which is 1%. If Oliver did not leave blood at the scene, what is the the chance of finding two people, one with type ‘O’ and one with type ‘AB’? There are two ways it might happen: the first person we choose might have type ‘O’ and the second ‘AB’, or the other way around. So the probability in this case is 2(0.6)(0.01)=1.2%. Dividing probabilities of both scenarios we obtain a Bayes factor of 0.83, and we conclude that the blood data is evidence against Oliver’s guilt.

Once I read this example, I decided to replicate it using real data of blood type distribution by country from here. After cleaning data, I have this nice data set to work with:

For each country, I get the most common blood type (the one which the suspect has) and the least common and replicate the previous calculations. For example, in the case of Spain, the most common type is ‘O+’ with 36% and the least one is ‘AB-‘ with 0.5%. The Bayes factor is 0.005/(2(0.36)(0.005))=1.39 so data support the hypothesis of guilt in this case. Next chart shows Bayes factor accross countries:

• Sometimes data consistent with a hypothesis are not necessarily in favor of the hypothesis
• How different is the distribution of blood types between countries!
• If you are a estonian ‘A+’ murderer, choose carefully your accomplice

This is the code of the experiment:

```library(rvest)
library(dplyr)
library(stringr)
library(DT)
library(highcharter)

# Webscapring of the table with the distribution of blood types
url <- "http://www.rhesusnegative.net/themission/bloodtypefrequencies/"
blood <- url %>%
html_node(xpath='/html/body/center/table') %>%
html_table(fill=TRUE)

# Some data cleansing
blood %>% slice(-c(66:68)) -> blood

blood[,-c(1:2)] %>%
sapply(gsub, pattern=",", replacement=".") %>%
as.data.frame %>%
sapply(gsub, pattern=".79.2", replacement=".79") %>%
as.data.frame-> blood[,-c(1:2)]

blood %>%
sapply(gsub, pattern="%|,", replacement="") %>%
as.data.frame -> blood

blood[,-1] = apply(blood[,-1], 2, function(x) as.numeric(as.character(x)))

blood[,-c(1:2)] %>% mutate_all(funs( . / 100)) -> blood[,-c(1:2)]

# And finally, we have a nice data set
datatable(blood,
rownames = FALSE,
options = list(
searching = FALSE,
pageLength = 10)) %>%
formatPercentage(3:10, 2)

# Calculate the Bayes factor
blood %>%
mutate(factor=apply(blood[,-c(1,2)], 1, function(x) {min(x)/(2*min(x)*max(x))})) %>%
arrange(factor)-> blood

# Data Visualization
highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "Bayesian Blood") %>%
hc_subtitle(text = "An experiment about the Bayes Factor") %>%
hc_xAxis(categories = blood\$Country,
labels = list(rotation=-90, style = list(fontSize = "12px")))  %>%
hc_yAxis(plotBands = list(list(from = 0, to = 1, color = "rgba(255,215,0, 0.8)"))) %>%
color = "rgba(255, 0, 0, 0.5)",
name = "Bayes Factor")%>%
hc_yAxis(min=0.5) %>%
hc_tooltip(pointFormat = "{point.y:.2f}") %>%
hc_legend(enabled = FALSE) %>%
hc_exporting(enabled = TRUE) %>%
hc_chart(zoomType = "xy")
```

Visualizing the Daily Variability of Bitcoin with Quandl and Highcharts

Lay your dreams, little darling, in a flower bed; let that sunshine in your hair (Where the skies are blue, The Lumineers)

I discovered this nice visualization some days ago. The author is also the creator of Highcharter, an incredible R wrapper for Highcharts javascript libray and its modules. I am a big fan of him.

Inspired by his radial plot, I did a visualization of the daily evolution of Daily Bitcoin exchange rate (BTC vs. EUR) on Localbtc. Data is sourced from here and I used Quandl to obtain the data frame. Quandl is a marketplace for financial and economic data delivered in modern formats for today’s analysts. There is a package called `Quandl` to interact directly with the Quandl API to download data in a number of formats usable in R. You only need to locate the data you want in the Quandl site. In my case data are here.

• Filtering data to obtain last 12 complete months
• Create a new variable with the difference between closing and opening price of Bitcoin (in Euros)
• Create a color variable to distinguish between positive and negative differences
• Create the graph using Fivethirtyeight theme for highcharts

This is the result:

Apart of its appealing, I think is a good way to to have a quick overview of the evolution of a stock price. This is the code to do the experiment:

```library(Quandl)
library(dplyr)
library(highcharter)
library(lubridate)
bitcoin=Quandl("BCHARTS/LOCALBTCEUR")
bitcoin %>%
arrange(Date) %>%
mutate(tmstmp = datetime_to_timestamp(Date)) -> bitcoin
last_date=max(bitcoin\$Date)
if (day(last_date+1)==1) date_to=last_date else
date_to=ymd(paste(year(last_date), month(last_date),1, sep="-"))-1
date_from=ymd(paste(year(date_to)-1, month(date_to)+1,1, sep="-"))
bitcoin %>% filter(Date>=date_from, Date<=date_to) -> bitcoin
var_bitcoin <- bitcoin %>%
mutate(Variation = Close - Open,
color = ifelse(Variation>=0, "green", "red"),
y = Variation) %>%
select(x = tmstmp,
y,
variation = Variation,
name = Date,
color,
open = Open,
close = Close) %>%
list.parse3()
x <- c("Open", "Close", "Variation")
y <- sprintf("{point.%s}", tolower(x))
tltip <- tooltip_table(x, y)
hc <- highchart() %>%
hc_title(text = "Bitcoin Exchange Rate (BTC vs. EUR)") %>%
hc_subtitle(text = "Daily Variation on Localbtc. Last 12 months")%>%
hc_chart(
type = "column",
polar = TRUE) %>%
hc_plotOptions(
series = list(
stacking = "normal",
showInLegend = FALSE)) %>%
hc_xAxis(
gridLineWidth = 0.5,
type = "datetime",
tickInterval = 30 * 24 * 3600 * 1000,
labels = list(format = "{value: %b}")) %>%
hc_yAxis(showFirstLabel = FALSE) %>%