# Allusions to parents in autobiographies (or reading 118 books in a few seconds)

If I keep holding out, will the light shine through? (Come Back, Pearl Jam)

Imagine that you are writing the story of your life. Almost sure you will make allusions to your parents, but will both of them have the same prominence in your biography or will you spend more words in one of them? In that case, which one will have more relevance? Your father or your mother?

This experiment analyses 118 autobiographies from the Project Gutenberg and count how many times do authors make allusions to their fathers and mothers. This is what I’ve done:

• Download all works from Gutenberg Project containing the word autobiography in its title (there are 118 in total).
• Count how many times the bigrams my father and my mother appear in each text. This is what I call allusions to father and mother respectively.

The number of allusions that I measure is a lower bound of the exact amount of them since the calculus has some limitations:

• Maybe the author refers to them by their names.
• After referring to them as my father or my mother, subsequent sentences may refer them as He or She.

Anyway, I think these constrains do not introduce any bias in the calculus since may affect to fathers and mothers equally. Here you can find the dataset I created after downloading all autobiographies and measuring the number of allusions to each parent.

Some results:

• 64% of autobiographies have more allusions to the father than the mother.
• 24% of autobiographies have more allusions to the mother than the father.
• 12% allude them equally.

Most of the works make more allusions to father than to mother. As a visual proof of this fact, the next plot is a histogram of the difference between the amount of allusions to father and mother along the 118 works (# allusions to father# allusions to mother):

The distribution is clearly right skeweed, which supports our previous results. Another way to see this fact is this last plot, which situates each autobiography in a scatter plot, where X-axis is the amount of allusions to father and Y-axis to mother. It is interactive, so you can navigate through it to see the details of each point (work):

Most of the points (works) are below the diagonal, which means that they contain more allusions to father than mother. Here you can find a full version of the previous plot.

I don’t have any explanation to this fact, just some simple hypothesis:

• Fathers and mothers influence their children differently.
• Fathers star in more anecdotes than mothers.
• This is the effect of patriarchy (72% of authors was born in the XIX century)

Whatever it is the explanation, this experiment shows how easy is to do text mining with R. Special mention to purrr (to iterate eficiently over the set of works IDs), tidytext (to count the number of appearances of bigrams), highcharter (to do the interactive plot) and gutenbergr (to download the books). You can find the code here.

# The Pleasing Ratio Project

Music is a world within itself, with a language we all understand (Sir Duke, Stevie Wonder)

This serious man on the left is Gustav Theodor Fechner, a German philosopher, physicist and experimental psychologist who lived between 1801 and 1887. To be honest, I don’t know almost anything of his life or work exepct one thing: he did in the 1860s a thought-provoking experiment. It seems me interesting for two important reasons: he called into question something widely established and obtained experimental data by himself.

Fechner’s experiment was simple: he presented just ten rectangles to 82 students. Then he asked each of them to choose the most pleasing one and obtained revealing discoveries I will not explain here since would cause bias in my experiment. You can find more information about the original experiment here.

I have done a project inspired in Fechner’s one that I called The Pleasing Ratio Project. Once you enter in the App, you will see two rectangles. Both of them have the same area. They only vary in their length-to-width ratios. Then you will be asked to select the one that seems you most pleasing. You can do it as many times as you want (all responses are completely anonymous). Every game will confront a couple of ratios, which can vary from 1 to 3,5. In the Results section you will find the  percentage of winning games for each ratio. The one with the highest percentage will be named officially as The Most Pleasing Ratio of the World in the future by myself.

Although my experiment is absolutely inspired in Fechner’s one, there is a important difference: I can explore a bigger set of ratios doing an A/B test. This makes this one a bit richer.

The experiment has also some interesting technical features:

• the use of shinydashboard package to arrange the App
• the use of shinyjs package to add javaScript to refresh the page when use choose to play again
• to save votes in a text file
• to read it to visualize results

Will I obtain the same results as Fechner? This is a living project whose results will change over the time so you can check it regularly.

The code of the project is available in GitHub. Thanks a lot for your collaboration!

# Fatal Journeys: Visualizing the Horror

In war, truth is the first casualty (Aeschylus)

I am not a pessimistic person. On the contrary, I always try to look at the bright side of life. I also believe that living conditions are now better than years ago as these plots show. But reducing the complexity of our world to just six graphs is riskily simplistic. Our world is quite far of being a fair place and one example is the immigration drama.

Last year there were 934 incidents around the world involving people looking for a better life where more than 5.300 people lost their life o gone missing, 60% of them in Mediterranean. Around 8 out of 100 were children.

The missing migrant project tracks deaths of migrants, including refugees and asylum-seekers, who have gone missing along mixed migration routes worldwide. You can find a huge amount of figures, plots and information about this scourge in their website. You can also download there a historical dataset with information of all these fatal journeys, including location, number of dead or missing people and information source from 2015 until today.

I this experiment I read the dataset and do some plots using highcharter; you can find a link to the R code at the end of the post.

This is the evolution of the amount of deaths or missing migrants in the process of migration towards an international destination from January 2015 to December 2017:

The Mediterranean is the zone with the most incidents. To see it more clearly, this plot compares Mediterranean with the rest of the world, grouping previous zones:

Is there any pattern in the time series of Mediterranean incidents? To see it, I have done a LOESS decomposition of the time series:

Good news: trend is decreasing for last 12 months. Regarding seasonal component, incidents increase in April and May. Why? I don’t know.

This is a map of the location of all incidents in 2017. Clicking on markers you will find information about each incident:

Every of us should try to make our world a better place. I don’t really know how to do it but I will try to make some experiments during this year to show that we have tons of work in front of us. Meanwhile, I hope this experiment is useful to give visibility to this humanitarian disaster. If someone wants to use the code, the complete project is available in GitHub.

# Visualizing the Spanish Contribution to The Metropolitan Museum of Art

Well I walk upon the river like it’s easier than land
(Love is All, The Tallest Man on Earth)

The Metropolitan Museum of Art provides here a dataset with information on more than 450.000 artworks in its collection. You can do anything you want with these data: there are no restrictions of use. Each record contains information about the author, title, type of work, dimensions, date, culture and  geography of a particular piece.

I can imagine a bunch of things to do with these data but since I am a big fan of highcharter,  I have done a treemap, which is an artistic (as well as efficient) way to visualize hierarchical data. A treemap is useful to visualize frequencies. They can handle levels, allowing to navigate to go into detail about any category. Here you can find a good example of treemap.

To read data I use fread function from data.table package. I also use this package to do some data wrangling operations on the data set. After them, I filter it looking for the word SPANISH in the columns Artist Nationality and Culture and looking for the word SPAIN in the column Country. For me, any piece created by an Spanish artist (like this one), coming from Spanish culture (like this one) or from Spain (like this one) is Spanish (this is my very own definition and may do not match with any academical one). Once it is done, it is easy to extract some interesting figures:

• There are 5.294 Spanish pieces in The Met, which means a 1,16% of the collection
• This percentage varies significantly between departments: it raises to 9,01% in The Cloisters and to 4,83% in The Robert Lehman Collection; on the other hand, it falls to 0.52% in The Libraries and to 0,24% in Photographs.
• The Met is home to 1.895 highlights and 44 of them (2,32%) are Spanish; It means that Spanish art is twice as important as could be expected (remember that represents a 1,16% of the entire collection)

My treemap represents the distribution of Spanish artworks by department (column Department) and type of work (column Classification). There are two important things to know before doing a treemap with highcharter:

• You have to use treemap function from treemap package to create a list with your data frame that will serve as input for hctreemap function
• hctreemap fails if some category name is the same as any of its subcategories. To avoid this, make sure that all names are distinct.

This is the treemap:

Here you can see a full size version of it.

There can be seen several things at a glance: most of the pieces are drawings and prints and european sculpture and decorative arts (in concrete, prints and textiles), there is also big number of costumes, arms and armor is a very fragmented department … I think treemap is a good way to see what kind of works owns The Met.

Mi favorite spanish piece in The Met is the stunning Portrait of Juan de Pareja by Velázquez, which illustrates this post: how nice would be to see it next to El Primo in El Museo del Prado!

Feel free to use my code to do your own experiments:

library(data.table)
library(dplyr)
library(stringr)
library(highcharter)
library(treemap)

file="MetObjects.csv"
destfile=file,
mode='wb')

# Modify column names to remove blanks
colnames(data)=gsub(" ", ".", colnames(data))

# Clean columns to prepare for searching
data[,:=(Artist.Nationality_aux=toupper(Artist.Nationality) %>% str_replace_all("\$\\d+\$", "") %>%
iconv(from='UTF-8', to='ASCII//TRANSLIT'),
Culture_aux=toupper(Culture) %>% str_replace_all("\$\\d+\$", "") %>%
iconv(from='UTF-8', to='ASCII//TRANSLIT'),
Country_aux=toupper(Country) %>% str_replace_all("\$\\d+\$", "") %>%
iconv(from='UTF-8', to='ASCII//TRANSLIT'))]

# Look for Spanish artworks
data[Artist.Nationality_aux %like% "SPANISH" |
Culture_aux %like% "SPANISH" |
Country_aux %like% "SPAIN"] -> data_spain

# Count artworks by Department and Classification
data_spain %>%
mutate(Classification=ifelse(Classification=='', "miscellaneous", Classification)) %>%
mutate(Department=tolower(Department),
Classification1=str_match(Classification, "(\\w+)(-|,|\\|)")[,2],
Classification=ifelse(!is.na(Classification1),
tolower(Classification1),
tolower(Classification))) %>%
group_by(Department, Classification) %>%
summarize(Objects=n()) %>%
ungroup %>%
mutate(Classification=ifelse(Department==Classification, paste0(Classification, "#"),
Classification)) %>%
as.data.frame() -> dfspain

# Do treemap without drawing
tm_dfspain <- treemap(dfspain, index = c("Department", "Classification"),
draw=F,
vSize = "Objects",
vColor = "Objects",
type = "index")

# Do highcharter treemap
hctreemap(
tm_dfspain,
allowDrillToNode = TRUE,
allowPointSelect = T,
levelIsConstant = F,
levels = list(
list(
level = 1,
dataLabels = list (enabled = T, color = '#f7f5ed', style = list("fontSize" = "1em")),
borderWidth = 1
),
list(
level = 2,
dataLabels = list (enabled = F,  align = 'right', verticalAlign = 'top',
style = list("textShadow" = F, "fontWeight" = 'light', "fontSize" = "1em")),
borderWidth = 0.7
)
)) %>%
hc_title(text = "Spanish Artworks in The Met") %>%
hc_subtitle(text = "Distribution by Department") -> plot

plot


# Visualizing the Gender of US Senators With R and Highmaps

I wake up every morning in a house that was built by slaves (Michelle Obama)

Some days ago I was invited by the people of Highcharts to write a post in their blog. What I have done is a simple but revealing map of women senators of the United States of America. Briefly, this is what I’ve done to generate it:

• read from the US senate website a XML file with senators info
• clean and obtain gender of senators from their first names
• summarize results by state
• join data with a US geojson dataset to create the highmap

You can find details and R code here.

It is easy creating a highcharts using highcharter, an amazing library as genderizeR, the one I use to obtain gender names. I like them a lot.

# Visualizing Stirling’s Approximation With Highcharts

I said, “Wait a minute, Chester, you know I’m a peaceful man”, He said, “That’s okay, boy, won’t you feed him when you can” (The Weight, The Band)

It is quite easy to calculate the probability of obtaining the same number of heads and tails when tossing a coin N times, and N is even. There are $2^{N}$ possible outcomes and only $C_{N/2}^{N}$ are favorable so the exact probability is the quotient of these numbers (# of favorable divided by # of possible).

There is another way to approximate this number incredibly well: to use the Stirling’s formula, which is $1/\sqrt{\pi\cdot N/2}$

The next plot represents both calculations for N from 2 to 200. Although for small values of N, Stirling’s approximation tends to overestimate probability, you can see hoy is extremely precise as N becomes bigger:

James Stirling published this amazing formula in 1730. It simplifies the calculus to the extreme and also gives a quick way to obtain the answer to a very interesting question: how many tosses are needed to be sure that the probability of obtaining the same number of heads and tails is under any given threshold? Just solve the formula for N and you will obtain the answer. And, also, the formula is another example of the presence of $pi$ in the most unexpected places, as happens here.

Just another thing: the more I use highcharter package the more I like it.

This is the code:

library(highcharter)
library(dplyr)
data.frame(N=seq(from=2, by=2, length.out = 100)) %>%
mutate(Exact=choose(N,N/2)/2**N, Stirling=1/sqrt(pi*N/2))->data
hc <- highchart() %>%
hc_title(text = "Stirling's Approximation") %>%
hc_subtitle(text = "How likely is getting 50% heads and 50% tails tossing a coin N times?") %>%
hc_xAxis(title = list(text = "N: Number of tosses"), categories = data$N) %>% hc_yAxis(title = list(text = "Probability"), labels = list(format = "{value}%", useHTML = TRUE)) %>% hc_add_series(name = "Stirling", data = data$Stirling*100,  marker = list(enabled = FALSE), color="blue") %>%
hc_add_series(name = "Exact", data = data\$Exact*100,  marker = list(enabled = FALSE), color="lightblue") %>%
hc_tooltip(formatter = JS("function(){return ('<b>Number of tosses: </b>'+this.x+'
<b>Probability: </b>'+Highcharts.numberFormat(this.y, 2)+'%')}")) %>%
hc_exporting(enabled = TRUE) %>%
hc_chart(zoomType = "xy")
hc