Tag Archives: R

How Do We Draw a Line?

August 5, 2018Drawingsggplot2, Google, R, Rstats@aschinchon

She dreams in colour, she dreams in red, can’t find a better man (Better Man, Pearl Jam)

Today I bring another experiment based on The Quick Draw! Data from Google, one of my most fortunate discoveries of the last times. The Quick Draw! is a web game developed by Google, that can be played on a computer, tablet or mobile phone, in which you are asked to draw something (for example, a bird). Then you have just 20 seconds to do it. You win if a machine, trained with a neural network, deduces what are you drawing. The best way to understand how it works is playing to it here. Google published data of about 50 million drawings across 345 categories, contributed by players of the game from all over the world. Datasets are in ndjson format (newline delimited JSON). In my previous post I analyzed one of these datasets, and showed a way to parse and represent the drawings in ggplot.

In this occasion I analyze the simplest drawing that Google can ask you: a line. The dataset, which is called lines.ndjson, can be found here and contains more than 143.000 lines drawn by people from about 170 countries. Most of these drawings come from The United States (45.4%), United Kingdom (7.5%), Canada (3.6%), Germany (3.5%) and Russian Federation (2.3%).

Let’s try to understand how humans draw lines. Concretely, in which direction do we draw them: horizontally? toward right o left? vertically? toward up or down? This analysis is inspired in two great articles I read recently:

How do you draw a circle? by Quartz, an amazing analysis which shows how cultural circumstances strongly determine the way in which we draw circles.
City Street Orientations around the World by Geoff Boeing, an awesome analysis and data visualization which gave me the idea of doing polar graphs to show my results.

There are some technical details around this experiment I would highlight:

I parse the dataset using fromJSON function from rjson package.
I use purrr package to apply a linear regression to the points defining the line for each drawing.
I easily convert the summary of the linear regression into a data frame using tidy function from broom package.
I use the slope of the regression to obtain the angle which describes the line (depending on where it is started I add pi to de arctangent of the slope)
I represent the frequence of angles using polar coordinates dividing circle in sections of 30 degrees in the following way: 345°- 15°, 15°- 45°, 45°-75°, 75°-105°, …, 315°-345° so for example, horizontal lines from left to right will fall into 345º- 15º category.

This is how do we draw lines analysing the entire dataset, without doing any distinction by country:

The fact seems clear: an average human who plays to the Quick Draw! game, draws a line horizontally from left to right with a probability of 59%. I have to admite that I expected a majority of horizontal-left-to-right lines, but not as crushingly as the plot shows. Maybe my a priori is far from the reality because I am lefty and I would draw it in another way. Remember as well that this mean human will probably come from The United States.

Are there differences by country? Yes, and they are very interesting. I removed all that countries with less then 150 drawings. Taking this into account, these are the four countries where more people draw vertical bottom-up lines:

And these are where more people draw horizontal right-left lines:

We’ve seen that on average, 59% of lines are drawn from left to right. This figure reaches more than 75% in the following countries:

And where do people draw more oblique lines? Here:

Surprisingly, a very small amount of lines are drawn toward down, which seems me quite intriguing.

Some thoughts (let me know yours):

Humans prefer doing horizontal lines from left-to-right everywhere
In case of drawing vertical, we clearly prefer bottom-up movement rather than the opposite; maybe the device configuration or the arrangement of the application motivates this behaviour.
Arab and hebrew are written from right-to-left: this fact seems to have a significant influence on the way that people draw lines.

You can find the code of this experiment here.

Exploring The Quick, Draw! Dataset With R: The Mona Lisa

July 1, 2018Drawings, Image Processinggganimate, ggplot2, Google, R, Rstats@aschinchon

All that noise, and all that sound, all those places I have found (Speed of Sound, Coldplay)

Some days ago, my friend Jorge showed me one of the coolest datasets I’ve ever seen: the Google quick draw dataset. In its Github website you can see a detailed description of the data. Briefly, it contains around 50 million of drawings of people around the world in .ndjson format. In this experiment, I used the simplified version of drawings where strokes are simplified and resampled with a 1 pixel spacing. Drawings are also aligned to top-left corner and scaled to have a maximum value of 255. All these things make data easier to manage and to represent into a plot.

Since .ndjson files may be very large, I used LaF package to access randon lines of the file rather than reading it completely. I wrote a script to explore The Mona Lisa.ndjson file, which contains more than 120.000 drawings that the TensorFlow engine from Google recognized as being The Mona Lisa. It is quite funny to see them. Whit this script you can:

Reproduce a random single drawing
Create a 9×9 mosaic of random drawings
Create an animation simulating the way the drawing was created

I use ggplot2 package to render drawings and gganimate package of David Robinson to create animations.

This is an example of a single drawing:

This is an example of a 3×3 mosaic:

This is an example of animation:

If you want to try by yourself, you can find the code here.

Note: to work with gganimate, I downloaded the portable version and pointed to it with Sys.setenv command as explained here.

How Much Money Should Machines Earn?

June 17, 2018Highcharts, The World We Live InDataViz, jobs, machines, OpenData, R, robots, Rstats, wages@aschinchon

Every inch of sky’s got a star
Every inch of skin’s got a scar
(Everything Now, Arcade Fire)

I think that a very good way to start with R is doing an interactive visualization of some open data because you will train many important skills of a data scientist: loading, cleaning, transforming and combinig data and performing a suitable visualization. Doing it interactive will give you an idea of the power of R as well, because you will also realise that you are able to handle indirectly other programing languages such as JavaScript.

That’s precisely what I’ve done today. I combined two interesting datasets:

The probability of computerisation of 702 detailed occupations, obtained by Carl Benedikt Frey and Michael A. Osborne from the University of Oxford, using a Gaussian process classifier and published in this paper in 2013.
Statistics of jobs from (employments, median annual wages and typical education needed for entry) from the US Bureau of Labor, available here.

Apart from using dplyr to manipulate data and highcharter to do the visualization, I used tabulizer package to extract the table of probabilities of computerisation from the pdf: it makes this task extremely easy.

This is the resulting plot:

If you want to examine it in depth, here you have a full size version.

These are some of my insights (its corresponding figures are obtained directly from the dataset):

There is a moderate negative correlation between wages and probability of computerisation.
Around 45% of US employments are threatened by machines (have a computerisation probability higher than 80%): half of them do not require formal education to entry.
In fact, 78% of jobs which do not require formal education to entry are threatened by machines: 0% which require a master’s degree are.
Teachers are absolutely irreplaceable (0% are threatened by machines) but they earn a 2.2% less then the average wage (unfortunately, I’m afraid this phenomenon occurs in many other countries as well).
Don’t study for librarian or archivist: it seems a bad way to invest your time
Mathematicians will survive to machines

The code of this experiment is available here.

Coloring Sudokus

June 1, 2018Curiosities, Drawings, Gamescolourlovers, ggplot2, R, Rstats, sudokus@aschinchon

Someday you will find me
caught beneath the landslide
(Champagne Supernova, Oasis)

I recently read a book called Snowflake Seashell Star: Colouring Adventures in Numberland by Alex Bellos and Edmund Harris which is full of mathematical patterns to be coloured. All images are truly appealing and cause attraction to anyone who look at them, independently of their age, gender, education or political orientation. This book demonstrates how maths are an astonishing way to reach beauty.

One of my favourite patterns are tridokus, a sophisticated colored version of sudokus. Coloring a sudoku is simple: once that is solved it is enough to assign a color to each number (from 1 to 9). If you superimpose three colored sudokus with no cells at the same position sharing the same color, and using again nine colors, the resulting image is a tridoku:

There is something attractive in a tridoku due to the balance of colors but also they seem a quite messy: they are a charmingly unbalanced. I wrote a script to generalize the concept to n-dokus. The idea is the same: superimpose n sudokus without cells sharing color and position (I call them disjoint sudokus) using just nine different colors. I did’n’t prove it, but I think the maximum amount of sudokus can be overimposed with these constrains is 9. This is a complete series from 1-doku to 9-doku (click on any image to enlarge):

I am a big fan of colourlovers package. These tridokus are colored with some of my favourite palettes from there:

Just two technical things to highlight:

There is a package called sudoku that generates sudokus (of course!). I use it to obtain the first solved sudoku which forms the base.
Subsequent sudokus are obtained from this one doing two operations: interchanging groups of columns first (there are three groups: columns 1 to 3, 4 to 6 and 7 to 9) and interchanging columns within each group then.

You can find the code here: do you own colored n-dokus!

The Pleasing Ratio Project

May 15, 2018Games, Highcharts, Shinyhighcharter, R, Rstats, shinydashboard, testing@aschinchon

Music is a world within itself, with a language we all understand (Sir Duke, Stevie Wonder)

This serious man on the left is Gustav Theodor Fechner, a German philosopher, physicist and experimental psychologist who lived between 1801 and 1887. To be honest, I don’t know almost anything of his life or work exepct one thing: he did in the 1860s a thought-provoking experiment. It seems me interesting for two important reasons: he called into question something widely established and obtained experimental data by himself.

Fechner’s experiment was simple: he presented just ten rectangles to 82 students. Then he asked each of them to choose the most pleasing one and obtained revealing discoveries I will not explain here since would cause bias in my experiment. You can find more information about the original experiment here.

I have done a project inspired in Fechner’s one that I called The Pleasing Ratio Project. Once you enter in the App, you will see two rectangles. Both of them have the same area. They only vary in their length-to-width ratios. Then you will be asked to select the one that seems you most pleasing. You can do it as many times as you want (all responses are completely anonymous). Every game will confront a couple of ratios, which can vary from 1 to 3,5. In the Results section you will find the percentage of winning games for each ratio. The one with the highest percentage will be named officially as The Most Pleasing Ratio of the World in the future by myself.

Although my experiment is absolutely inspired in Fechner’s one, there is a important difference: I can explore a bigger set of ratios doing an A/B test. This makes this one a bit richer.

The experiment has also some interesting technical features:

the use of shinydashboard package to arrange the App
the use of shinyjs package to add javaScript to refresh the page when use choose to play again
to save votes in a text file
to read it to visualize results

Will I obtain the same results as Fechner? This is a living project whose results will change over the time so you can check it regularly.

The code of the project is available in GitHub. Thanks a lot for your collaboration!

Pencil Scribbles

April 17, 2018Drawingsggplot2, imager, R, Rstats, TSP@aschinchon

Con las bombas que tiran los fanfarrones, se hacen las gaditanas tirabuzones (Palma y corona, Carmen Linares)

This time I draw Franky again using an algorithm to solve the Travelling Salesman Problem as I did in my last post. On this occasion, instead of doing just one single line drawing, I overlap many of them (250 concretely), each of them sampling 400 points on the original image (in my previous post I sampled 8.000 points). Last difference is that I don’t convert the image to pure black and white with threshold function: now I use the gray scale number of each pixel to weight the sample.

Once again, I use ggplot2 package, and its magical geom_path, to generate the image. The pencil effect is obtained giving a very high transparency to the lines. This is the result:

I love when someone else experiment with my experiments as Mara Averick did:

💼 single-line aRt w/ @aschinchon:
👨‍🎨 "The Travelling Salesman Portrait" https://t.co/MZqXeCleuU #rstats
(you'll never guess which are mine!) pic.twitter.com/CxXzZh7YRR

— Mara Averick (@dataandme) April 5, 2018

Or Erik-Jan van Kesteren:

An image as a travelling salesman problem using #rstats. How did I do it? I didn't: I just cloned a #GitHub repo (https://t.co/u7GDUzsJRu) and put in my own picture. Really nice, follow his blog (https://t.co/rmOqmmsd6T) 👍 pic.twitter.com/zLYXokR6U7

— Erik-Jan van Kesteren (@ejvankesteren) April 12, 2018

You can do it as well with this one, since you will find the code here. Please, let me know your own creations if you do. You can find me on twitter or by email.

P.S.: Although it may seems otherwise, I’m not obsessed with Frankenstein 🙂

The Travelling Salesman Portrait

April 4, 2018Drawingsggplot2, imager, R, Rstats, TSP@aschinchon

I have noticed even people who claim everything is predestined, and that we can do nothing to change it, look before they cross the road (Stephen Hawking)

Imagine a salesman and a set of cities. The salesman has to visit each one of the cities starting from a certain one and returning to the same city. The challenge is finding the route which minimizes the total length of the trip. This is the Travelling Salesman Problem (TSP): one of the most profoundly studied questions in computational mathematics. Since you can find a huge amount of articles about the TSP in the Internet, I will not give more details about it here.

In this experiment I apply an heuristic algorithm to solve the TSP to draw a portrait. The idea is pretty simple:

Load a photo
Convert it to black and white
Choose a sample of black points
Solve the TSP to calculate a route among the points
Plot the route

The result is a single line drawing of the image that you loaded. To solve the TSP I used the arbitrary insertion heuristic algorithm (Rosenkrantz et al. 1977), which is quite efficient.

To illustrate the idea, I have used again this image of Frankenstein (I used it before in this other experiment). This is the result:

You can find the code here.

Mandalas Colored

March 11, 2018Drawingscolourlovers, ggplot2, R, Rstats, tesselations, Voronoi@aschinchon

Apriétame bien la mano, que un lucero se me escapa entre los dedos (Coda Flamenca, Extremoduro)

I have the privilege of being teacher at ESTALMAT, a project run by Spanish Royal Academy of Sciences that tries to detect, guide and stimulate in a continuous way, along two courses, the exceptional mathematical talent of students of 12-13 years old. Some weeks ago I gave a class there about the importance of programming. I tried to convince them that learning R or Python is a good investment that always pays off; It will make them enjoy more of mathematics as well as to see things with their own eyes. The main part of my class was a workshop about Voronoi tesselations in R. We started drawing points on a circle and we finished drawing mandalas like these ones. You can find the details of the workshop here (in Spanish). It was a wonderful experience to see the faces of the students while generating their own mandalas.

In that case all mandalas were empty, ready to be printed and coloured as my 7 years old daughter does. In this experiment I colour them. These are the changes I have done to my previous code:

Remove external segments which intersects the boundary of the enclosing
rectangle
Convert the tesselation into a list of polygons with tile.list function
Use colourlovers package to fill the polygons with beautiful colour palettes

This is an example of the result:

Changing three simple parameters (iter, points and radius) you can obtain completely different images (clicking on any image you can see its full size version):

You can find details of these parameters in my previous post. I cannot resist to place more examples:

You can find the code here. Enjoy.

Mandalas

February 14, 2018Drawings, Fractalsdeldir, ggplot2, R, Rstats, Voronoi@aschinchon

Mathematics is a place where you can do things which you can’t do in the real world (Marcus Du Sautoy, mathematician)

From time to time I have a look to some of my previous posts: it’s like seeing them through another’s eyes. One of my first posts was this one, where I draw fractals using the Multiple Reduction Copy Machine (MRCM) algorithm. That time I was not clever enough to write an efficient code able generate deep fractals. Now I am pretty sure I could do it using ggplot and I started to do it when I come across with the idea of mixing this kind of fractal patterns with Voronoi tessellations, that I have explored in some of my previous posts, like this one. Mixing both techniques, the mandalas appeared.

I will not explain in depth the mathematics behind this patterns. I will just give a brief explanation:

I start obtaining n equidistant points in a unit circle centered in (0,0)
I repeat the process with all these points, obtaining again n points around each of them; the radius is scaled by a factor
I discard the previous (parent) n points

I repeat these steps iteratively. If I start with n points and iterate k times, at the end I obtain n^k points. After that, I calculate the Voronoi tesselation of them, which I represent with ggplot.

This is an example:

Some others:

You can find the code here. Enjoy it.

Fatal Journeys: Visualizing the Horror

January 22, 2018Highcharts, Maps, The World We Live In, Time Serieshighcharter, leaflet, migrants, R, Rstats@aschinchon

In war, truth is the first casualty (Aeschylus)

I am not a pessimistic person. On the contrary, I always try to look at the bright side of life. I also believe that living conditions are now better than years ago as these plots show. But reducing the complexity of our world to just six graphs is riskily simplistic. Our world is quite far of being a fair place and one example is the immigration drama.

Last year there were 934 incidents around the world involving people looking for a better life where more than 5.300 people lost their life o gone missing, 60% of them in Mediterranean. Around 8 out of 100 were children.

The missing migrant project tracks deaths of migrants, including refugees and asylum-seekers, who have gone missing along mixed migration routes worldwide. You can find a huge amount of figures, plots and information about this scourge in their website. You can also download there a historical dataset with information of all these fatal journeys, including location, number of dead or missing people and information source from 2015 until today.

I this experiment I read the dataset and do some plots using highcharter; you can find a link to the R code at the end of the post.

This is the evolution of the amount of deaths or missing migrants in the process of migration towards an international destination from January 2015 to December 2017:

The Mediterranean is the zone with the most incidents. To see it more clearly, this plot compares Mediterranean with the rest of the world, grouping previous zones:

Is there any pattern in the time series of Mediterranean incidents? To see it, I have done a LOESS decomposition of the time series:

Good news: trend is decreasing for last 12 months. Regarding seasonal component, incidents increase in April and May. Why? I don’t know.

This is a map of the location of all incidents in 2017. Clicking on markers you will find information about each incident:

Every of us should try to make our world a better place. I don’t really know how to do it but I will try to make some experiments during this year to show that we have tons of work in front of us. Meanwhile, I hope this experiment is useful to give visibility to this humanitarian disaster. If someone wants to use the code, the complete project is available in GitHub.

Fronkonstin

Experiments in R