I am an absolute beginner, but I am absolutely sane (Absolute Beginners, David Bowie)

Some time ago I wrote this post, where I predicted correctly the winner of the Spanish Football League several months before its ending. After thinking intensely about taking the risk of ruining my reputation repeating the analysis, I said *“no problem, Antonio, do it again: in the end you don’t have any reputation to keep”*. So here we are.

From a technical point of view there are many differences between both analysis. Now I use webscraping to download data, dplyr and pipes to do transformations and interactive D3.js graphs to show results. I think my code is better now and it makes me happy.

As I did the other time, Bradley-Terry Model gives an indicator of *the power* of each team, called *ability*, which provides a natural mechanism for ranking teams. This is the evolution of *abilities* of each team during the championship (last season was played during the past weekend):

Although it is a bit messy, the graph shows two main groups of teams: on the one hand, Barcelona, Atletico de Madrid, Real Madrid and Villareal; on the other hand, the rest. Let’s have a closer look to evolution of the *abilities* of the top 4 teams:

While Barcelona, Atletico de Madrid and Real Madrid *walk* in parallel, Villareal seems to be a bit stacked in the last seasons; the gap between them and Real Madrid is increasing little by little. Maybe is the *Zidane’s effect*. It is quite interesting discovering what teams are increasing their abilities: they are Malaga, Eibar and Getafe. They will probably finish the championship in a better position than they have nowadays (Eibar could reach fifth position):

What about Villareal? Will they go up some position? I don’t think so. This plot shows the probability of winning any of the top 3:

As you can see, probability is decreasing significantly. And what about Barcelona? Will win? It is a very difficult question. They are *almost tied* with Atletico de Madrid, and only 5 and 8 points above Real Madrid and Villareal. But it seems Barcelona *keep them at bay*. This plot shows the evolution of the probability of be beaten by Atletico, Real Madrid and Villareal:

All probabilities are under 50% and decreasing (I supposed a scoring of 2-0 for Barcelona in the match against Sporting of season 16 that was postponed to next February 17th).

Data science is a profession for brave people so it is time to do some predictions. These are mine, ordered by *likelihood*:

- Barcelona will win, followed by Atletico (2), Real Madrid (3), Villareal (4) and Eibar (5)
- Malaga and Getafe will go up some positions
- Next year I will do the analysis again

Here you have the code:

library(rvest) library(stringr) library(BradleyTerry2) library(dplyr) library(reshape) library(rCharts) nseasons=20 results=data.frame() for (i in 1:nseasons) { webpage=paste0("http://www.marca.com/estadisticas/futbol/primera/2015_16/jornada_",i,"/") html(webpage) %>% html_nodes("table") %>% .[[1]] %>% html_table(header=FALSE, fill=TRUE) %>% mutate(X4=i) %>% rbind(results)->results } colnames(results)=c("home", "score", "visiting", "season") results %>% mutate(home = iconv(home, from="UTF8",to="ASCII//TRANSLIT"), visiting = iconv(visiting, from="UTF8",to="ASCII//TRANSLIT")) %>% #filter(grepl("-", score)) %>% mutate(score=replace(score, score=="18:30 - 17/02/2016", "0-2")) %>% # resultado fake para el Barcelona mutate(score_home = as.numeric(str_split_fixed(score, "-", 2)[,1])) %>% mutate(score_visiting = as.numeric(str_split_fixed(score, "-", 2)[,2])) %>% mutate(points_home =ifelse(score_home > score_visiting, 3, ifelse(score_home < score_visiting, 0, 1))) %>% mutate(points_visiting =ifelse(score_home > score_visiting, 0, ifelse(score_home < score_visiting, 3, 1))) -> data prob_BT=function(x, y) {exp(x-y) / (1 + exp(x-y))} BTabilities=data.frame() for (i in 13:nseasons) { data %>% filter(season<=i) %>% BTm(cbind(points_home, points_visiting), home, visiting, data=.) -> footballBTModel BTabilities(footballBTModel) %>% as.data.frame() -> tmp cbind(tmp, as.character(rownames(tmp)), i) %>% mutate(ability=round(ability, digits = 2)) %>% rbind(BTabilities) -> BTabilities } colnames(BTabilities)=c("ability", "s.e.", "team", "season") sort(unique(BTabilities[,"team"])) -> teams BTprobabilities=data.frame() for (i in 13:nseasons) { BTabilities[BTabilities$season==i,1] %>% outer( ., ., prob_BT) -> tmp colnames(tmp)=teams rownames(tmp)=teams cbind(melt(tmp),i) %>% rbind(BTprobabilities) -> BTprobabilities } colnames(BTprobabilities)=c("team1", "team2", "probability", "season") BTprobabilities %>% filter(team1=="Villarreal") %>% mutate(probability=round(probability, digits = 2)) %>% filter(team2 %in% c("R. Madrid", "Barcelona", "Atletico")) -> BTVillareal BTprobabilities %>% filter(team2=="Barcelona") %>% mutate(probability=round(probability, digits = 2)) %>% filter(team1 %in% c("R. Madrid", "Villarreal", "Atletico")) -> BTBarcelona AbilityPlot <- nPlot( ability ~ season, data = BTabilities, group = "team", type = "lineChart") AbilityPlot$yAxis(axisLabel = "Estimated Ability", width = 62) AbilityPlot$xAxis(axisLabel = "Season") VillarealPlot <- nPlot( probability ~ season, data = BTVillareal, group = "team2", type = "lineChart") VillarealPlot$yAxis(axisLabel = "Probability of beating", width = 62) VillarealPlot$xAxis(axisLabel = "Season") BarcelonaPlot <- nPlot( probability ~ season, data = BTBarcelona, group = "team1", type = "lineChart") BarcelonaPlot$yAxis(axisLabel = "Probability of being beaten", width = 62) BarcelonaPlot$xAxis(axisLabel = "Season")