Tag Archives: maps

How to Find Equidistant Coordinates Between Two Locations on Earth

Here’s to the ones who dream
foolish, as they may seem
(The Fools Who Dream, ‘La La Land’ OST)

One of the key points of The Meeting Point Locator is to obtain an orthogonal great circle to the bearing defined by any two given locations on Earth. A great circle is the intersection of the sphere and a plane that passes through the center point of the sphere. In other words, a great circle is a false meridian. The orthogonal great circle to the direction defined by any two given locations is the one which passes by all equidistant points to both of them (at least this is what I call orthogonal great circle). This was my first approach to obtain it:

  • Get the midpoint between the initial locations, let’s call it p1
  • Calculate the direction (bearing angle) between the initial locations, let’s call it α
  • Obtain a very close point to p1 (only 1 meter away) with bearing α+90, let’s call it p2
  • Calculate the great circle which passes through p1 and p2

This is the code I used in this first approach:

library(dplyr)
library(ggmap)
library(geosphere)
library(leaflet)
library(ggplot2)
library(scales)
library(extrafont)
windowsFonts(Garamond=windowsFont("Garamond"))

#Starting places
place1="Madrid, Spain"
place2="Toledo, Spain"

# Call to Google Maps API to obtain coordinates of Starting places
p1=geocode(place1, output = "latlon")
p2=geocode(place2, output = "latlon")

#Midpoint of p1 and p2
mid=midPoint(p1, p2)

#Direction between p1 and p2
bea=bearingRhumb(p1, p2)

# Great circle between midpoint and 1-meter separated point with bearing bea+90
points=greatCircle(destPoint(p=mid, b=bea+90, d=1), mid, n=100)

# Arrange the points dependning on the distance to the input locations
data.frame(dist2p1=apply(points, 1, function (x) distGeo(p1, x)),
           dist2p2=apply(points, 1, function (x) distGeo(p2, x))) %>% 
  cbind(points) -> points

opts=theme(
  panel.background = element_rect(fill="gray90"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="white", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=6, family = "Garamond"),
  axis.title = element_text(size=10, colour="gray10", family = "Garamond"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 14, colour="gray10", family = "Garamond"),
  plot.subtitle = element_text(size = 10, colour="gray20", family = "Garamond"))

ggplot(points, aes(x=dist2p1, y=dist2p2), guide=FALSE)+
  geom_abline(intercept = 0, slope = 1, colour = "red", alpha=.25)+
  geom_point(colour="blue", fill="blue", shape=21, alpha=.8, size=1)+
  scale_x_continuous(label=scientific_format())+
  scale_y_continuous(label=scientific_format())+
  labs(title=paste(place1,"and" ,place2, sep=" "),
       subtitle="Equidistant points (2nd approach)",
       x=paste("Distance to" ,place1, "(Km)", sep=" "),
       y=paste("Distance to" ,place2, "(Km)", sep=" "))+opts

#Map
points %>% 
  leaflet() %>% 
  addTiles(urlTemplate = "https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png") %>% 
  addCircleMarkers(
    lng=points$lon, lat=points$lat,
    radius = 6,
    color = "blue",
    stroke = FALSE, fillOpacity = 0.5) %>% 
  addCircleMarkers(
    lng=c(p1$lon, p2$lon), lat=c(p1$lat, p2$lat),
    radius = 6,
    color = "red",
    stroke = FALSE, fillOpacity = 0.5)

I was pretty sure that all points of this last great circle must be equidistant to the initial locations but I was wrong. When the starting points are enough close, everything goes well. This is an example with Madrid and Toledo (separated only by 67 kilometers) as starting points. The following plot shows the distance to Madrid and Toledo of 100 points on the great circle obtained as I described before:


This map shows also these 100 points (in blue) as well as the starting ones (in red):

Quite convincent. But this is what happens when I choose Tokyo and New York (10.873 kms. away) as the starting points:


And the map:

To be honest, I do not know why this happens but, based on the success obtained using close starting points, the final solution was simple: bring the starting points closer preserving the original midpoint. This was my second (and definitive) try:


And the map:

Mission accomplished. The final code:

library(dplyr)
library(ggmap)
library(geosphere)
library(leaflet)
library(ggplot2)
library(scales)
library(extrafont)
windowsFonts(Garamond=windowsFont("Garamond"))

# Starting places
place1="Tokyo, Japan"
place2="New York, USA"

# Call to Google Maps API to obtain coordinates of Starting places
p1=geocode(place1, output = "latlon")
p2=geocode(place2, output = "latlon")

# Midpoint of p1 and p2
mid=midPoint(p1, p2)
# Distance between p1 and p2
dist=distGeo(p1, p2)
# A simple piece of code to bring the starting points closer preserving the original midpoint
 x=p1
 y=p2
 while(dist>1000000)
 {
   x=midPoint(mid, x)
   y=midPoint(mid, y)
   dist=distGeo(x, y)
}
# Direction between resulting (close) points
bea=bearingRhumb(x, y)
# Great circle between midpoint and 1-meter separated point with bearing bea+90
points=greatCircle(destPoint(p=mid, b=bea+90, d=1), mid, n=100)

# Arrange the points dependning on the distance to the input locations
data.frame(dist2p1=apply(points, 1, function (x) distGeo(p1, x)),
           dist2p2=apply(points, 1, function (x) distGeo(p2, x))) %>% 
  cbind(points) -> points

opts=theme(
  panel.background = element_rect(fill="gray90"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="white", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=6, family = "Garamond"),
  axis.title = element_text(size=10, colour="gray10", family = "Garamond"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 14, colour="gray10", family = "Garamond"),
  plot.subtitle = element_text(size = 10, colour="gray20", family = "Garamond"))

ggplot(points, aes(x=dist2p1, y=dist2p2), guide=FALSE)+
  geom_abline(intercept = 0, slope = 1, colour = "red", alpha=.25)+
  geom_point(colour="blue", fill="blue", shape=21, alpha=.8, size=1)+
  scale_x_continuous(label=scientific_format())+
  scale_y_continuous(label=scientific_format())+
  labs(title=paste(place1,"and" ,place2, sep=" "),
       subtitle="Equidistant points (2nd approach)",
       x=paste("Distance to" ,place1, "(Km)", sep=" "),
       y=paste("Distance to" ,place2, "(Km)", sep=" "))+opts

points %>% 
  leaflet() %>% 
  addTiles(urlTemplate = "https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png") %>% 
  addCircleMarkers(
    lng=points$lon, lat=points$lat,
    radius = 6,
    color = "blue",
    stroke = FALSE, fillOpacity = 0.5) %>% 
  addCircleMarkers(
    lng=c(p1$lon, p2$lon), lat=c(p1$lat, p2$lat),
    radius = 6,
    color = "red",
    stroke = FALSE, fillOpacity = 0.5)

A Segmentation Of The World According To Migration Flows ft. Leaflet

Up in the sky you just feel fine, there is no running out of time and you never cross a line (Up In The Sky, 77 Bombay Street)

In this post I analyze two datasets from Enigma:

  • Migration flows: Every 10 years, since 1960, the World Bank estimates migrations worldwide (267.960 rows)
  • World population: Values and percentages of populations for each nation examined beginning in year 1960, by the World Bank’s Health, Nutrition and Population project (4.168.185 rows)

Since the second dataset is very large, I load it into R using fread function of data.table package, which is extremely fast. To filter datasets, I also use dplyr and pipes of magrittr package (my life changed since I discovered it).

To build a comparable indicator across countries, I divide migration flows (from and to each country) by the mean population in each decade. I do this because migration flows are aggregated for each decade since 1960. For example, during the first decade of 21st century, Argentina reveived 1.537.850 inmigrants, which represents a 3,99% of the mean population of the country in this decade. In the same period, inmigration to Burundi only represented a 0,67% of its mean population.

What happened in the whole world in that decade? There were around 166 million people who moved to other countries. It represents a 2.58% of the mean population of the world. I use this figure to divide countries into four groups:

  • Isolated: countries with both % of inmigrants and % of migrants under 2.58%
  • Emitter: countries with % of inmigrants under 2.58% and % of migrants over 2.58%
  • Receiver: countries with % of inmigrants over 2.58% and % of migrants under 2.58%
  • Transit: countries with both % of inmigrants and % of migrants over 2.58%

To create the map I use leaflet package as I did in my previous post. Shapefile of the world can be downloaded here. This is how the world looks like according to this segmentation:

Migration Flows

Some conclusions:

  • There are just sixteen receiver countries: United Arab Emirates, Argentina, Australia, Bhutan, Botswana, Costa Rica, Djibouti, Spain, Gabon, The Gambia, Libya, Qatar, Rwanda, Saudi Arabia, United States and Venezuela
  • China and India (the two most populous countries in the world) are isolated
  • Transit countries are concentrated in the north hemisphere and most of them are located in cold latitudes
  • There are six emitter countries with more than 30% of emigrants between 2000 and 2009: Guyana, Tonga, Tuvalu, Jamaica, Bosnia and Herzegovina and Albania

This is the code you need to reproduce the map:

library(data.table)
library(dplyr) 
library(leaflet)
library(rgdal)
library(RColorBrewer)
setwd("YOU WORKING DIRECTORY HERE")
populflows = read.csv(file="enigma-org.worldbank.migration-remittances.migrants.migration-flow-c57405e33412118c8757b1052e8a1490.csv", stringsAsFactors=FALSE)
population = fread("enigma-org.worldbank.hnp.data-eaa31d1a34fadb52da9d809cc3bec954.csv")
# Population
population %>% 
  filter(indicator_name=="Population, total") %>% 
  as.data.frame %>% 
  mutate(decade=(year-year%%10)) %>% 
  group_by(country_name, country_code, decade) %>% 
  summarise(avg_pop=mean(value)) -> population2
# Inmigrants by country
populflows %>% filter(!is.na(total_migrants)) %>% 
  group_by(migration_year, destination_country) %>% 
  summarise(inmigrants = sum(total_migrants))  %>% 
  merge(population2, by.x = c("destination_country", "migration_year"), by.y = c("country_name", "decade"))  %>% 
  mutate(p_inmigrants=inmigrants/avg_pop) -> inmigrants
# Migrants by country
populflows %>% filter(!is.na(total_migrants)) %>% 
  group_by(migration_year, country_of_origin) %>% 
  summarise(migrants = sum(total_migrants)) %>%  
  merge(population2, by.x = c("country_of_origin", "migration_year"), by.y = c("country_name", "decade"))  %>%
  mutate(p_migrants=migrants/avg_pop) -> migrants
# Join of data sets
migrants %>% 
  merge(inmigrants, by.x = c("country_code", "migration_year"), by.y = c("country_code", "migration_year")) %>%
  filter(migration_year==2000) %>% 
  select(country_of_origin, country_code, avg_pop.x, migrants, p_migrants, inmigrants, p_inmigrants) %>% 
  plyr::rename(., c("country_of_origin"="Country", 
                    "country_code"="Country.code", 
                    "avg_pop.x"="Population.mean",
                    "migrants"="Total.migrants",
                    "p_migrants"="p.of.migrants",
                    "inmigrants"="Total.inmigrants",
                    "p_inmigrants"="p.of.inmigrants")) -> populflows2000
# Threshold to create groups
populflows2000 %>% 
  summarise(x=sum(Total.migrants), y=sum(Total.inmigrants), z=sum(Population.mean)) %>% 
  mutate(m=y/z) %>% 
  select(m)  %>% 
  as.numeric -> avg
# Segmentation
populflows2000$Group="Receiver"
populflows2000[populflows2000$p.of.migrants>avg & populflows2000$p.of.inmigrants>avg, "Group"]="Transit"
populflows2000[populflows2000$p.of.migrants<avg & populflows2000$p.of.inmigrants<avg, "Group"]="Isolated"
populflows2000[populflows2000$p.of.migrants>avg & populflows2000$p.of.inmigrants<avg, "Group"]="Emitter"
#Loading shapefile from http://data.okfn.org/data/datasets/geo-boundaries-world-110m 
countries=readOGR("json/countries.geojson", "OGRGeoJSON") 
# Join shapefile and enigma information 
joined=merge(countries, populflows2000, by.x="wb_a3", by.y="Country.code", all=FALSE, sort = FALSE) 
joined$Group=as.factor(joined$Group) 
# To define one color by segment 
factpal=colorFactor(brewer.pal(4, "Dark2"), joined$Group) 
leaflet(joined) %>%
  addPolygons(stroke = TRUE, color="white", weight=1, smoothFactor = 0.2, fillOpacity = .8, fillColor = ~factpal(Group)) %>%
  addTiles() %>%
  addLegend(pal = factpal, values=c("Emitter", "Isolated", "Receiver", "Transit"))

Silhouettes

Romeo, Juliet, balcony in silhouette, makin o’s with her cigarette, it’s juliet (Flapper Girl, The Lumineers)

Two weeks ago I published this post for which designed two different visualizations. At the end, I decided to place words on the map of the United States. The discarded visualization was this other one, where I place the words over the silhouette of each state:

States In Two Words v1

I do not want to set aside this chart because I really like it and also because I think it is a nice example of the possibilities one have working with R.

Here you have the code. It substitutes the fragment of the code headed by “Visualization” of the original post:

library(ggplot2)
library(maps)
library(gridExtra)
library(extrafont)
opt=theme(legend.position="none",
             panel.background = element_blank(),
             panel.grid = element_blank(),
             axis.ticks=element_blank(),
             axis.title=element_blank(),
             axis.text =element_blank(),
             plot.title = element_text(size = 28))
vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
grid.newpage()
jpeg(filename = "States In Two Words.jpeg", width = 1200, height = 600, quality = 100)
pushViewport(viewport(layout = grid.layout(6, 8)))
for (i in 1:nrow(table))
{
  wd=subset(words, State==as.character(table$"State name"[i]))
  p=ggplot() + geom_polygon( data=subset(map_data("state"), region==tolower(table$"State name"[i])), aes(x=long, y=lat, group = group), colour="white", fill="gold", alpha=0.6, linetype=0 )+opt
  print(p, vp = vplayout(floor((i-1)/8)+1, i%%8+(i%%8==0)*8))
  txt=paste(as.character(table$"State name"[i]),"\n is", wd$word1,"\n and", wd$word2, sep=" ")
  grid.text(txt, gp=gpar(font=1, fontsize=16, col="midnightblue", fontfamily="Humor Sans"), vp = viewport(layout.pos.row = floor((i-1)/8)+1, layout.pos.col = i%%8+(i%%8==0)*8))
}
dev.off()

The United States In Two Words

Sweet home Alabama, Where the skies are so blue; Sweet home Alabama, Lord, I’m coming home to you (Sweet home Alabama, Lynyrd Skynyrd)

This is the second post I write to show the abilities of twitteR package and also the second post I write for KDnuggets. In this case my goal is to have an insight of what people tweet about american states. To do this, I look for tweets containing the exact phrase “[STATE NAME] is” for every states. Once I have the set of tweets for each state I do some simple text mining: cleaning, standardizing, removing empty words and crossing with these sentiment lexicons. Then I choose the two most common words to describe each state. You can read the original post here. This is the visualization I produced to show the result of the algorithm:

States In Two Words v2

Since the right side of the map is a little bit messy, in the original post you can see a table with the couple of words describing each state. This is just an experiment to show how to use and combine some interesting tools of R. If you don’t like what Twitter says about your state, don’t take it too seriously.

This is the code I wrote for this experiment:

# Do this if you have not registered your R app in Twitter
library(twitteR)
library(RCurl)
setwd("YOUR-WORKING-DIRECTORY-HERE")
if (!file.exists('cacert.perm'))
{
  download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile='cacert.perm')
}
requestURL="https://api.twitter.com/oauth/request_token"
accessURL="https://api.twitter.com/oauth/access_token"
authURL="https://api.twitter.com/oauth/authorize"
consumerKey = "YOUR-CONSUMER_KEY-HERE"
consumerSecret = "YOUR-CONSUMER-SECRET-HERE"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=requestURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo=system.file("CurlSSL", "cacert.pem", package="RCurl"))
save(Cred, file="twitter authentification.Rdata")
# Start here if you have already your twitter authentification.Rdata file
library(twitteR)
library(RCurl)
library(XML)
load("twitter authentification.Rdata")
registerTwitterOAuth(Cred)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
#Read state names from wikipedia
webpage=getURL("http://simple.wikipedia.org/wiki/List_of_U.S._states")
table=readHTMLTable(webpage, which=1)
table=table[!(table$"State name" %in% c("Alaska", "Hawaii")), ]
#Extract tweets for each state
results=data.frame()
for (i in 1:nrow(table))
{
  tweets=searchTwitter(searchString=paste("'\"", table$"State name"[i], " is\"'",sep=""), n=200, lang="en")
  tweets.df=twListToDF(tweets)
  results=rbind(cbind(table$"State name"[i], tweets.df), results)
}
results=results[,c(1,2)]
colnames(results)=c("State", "Text")
library(tm)
#Lexicons
pos = scan('positive-words.txt',  what='character', comment.char=';')
neg = scan('negative-words.txt',  what='character', comment.char=';')
posneg=c(pos,neg)
results$Text=tolower(results$Text)
results$Text=gsub("[[:punct:]]", " ", results$Text)
# Extract most important words for each state
words=data.frame(Abbreviation=character(0), State=character(0), word1=character(0), word2=character(0), word3=character(0), word4=character(0))
for (i in 1:nrow(table))
{
  doc=subset(results, State==as.character(table$"State name"[i]))
  doc.vec=VectorSource(doc[,2])
  doc.corpus=Corpus(doc.vec)
  stopwords=c(stopwords("english"), tolower(unlist(strsplit(as.character(table$"State name"), " "))), "like")
  doc.corpus=tm_map(doc.corpus, removeWords, stopwords)
  TDM=TermDocumentMatrix(doc.corpus)
  TDM=TDM[Reduce(intersect, list(rownames(TDM),posneg)),]
  v=sort(rowSums(as.matrix(TDM)), decreasing=TRUE)
  words=rbind(words, data.frame(Abbreviation=as.character(table$"Abbreviation"[i]), State=as.character(table$"State name"[i]),
                                   word1=attr(head(v, 4),"names")[1],
                                   word2=attr(head(v, 4),"names")[2],
                                   word3=attr(head(v, 4),"names")[3],
                                   word4=attr(head(v, 4),"names")[4]))
}
# Visualization
require("sqldf")
statecoords=as.data.frame(cbind(x=state.center$x, y=state.center$y, abb=state.abb))
#To make names of right side readable
texts=sqldf("SELECT a.abb,
            CASE WHEN a.abb IN ('DE', 'NJ', 'RI', 'NH') THEN a.x+1.7
            WHEN a.abb IN ('CT', 'MA') THEN a.x-0.5  ELSE a.x END as x,
            CASE WHEN a.abb IN ('CT', 'VA', 'NY') THEN a.y-0.4 ELSE a.y END as y,
            b.word1, b.word2 FROM statecoords a INNER JOIN words b ON a.abb=b.Abbreviation")
texts$col=rgb(sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),max=255)
library(maps)
jpeg(filename = "States In Two Words v2.jpeg", width = 1200, height = 600, quality = 100)
map("state", interior = FALSE, col="gray40", fill=FALSE)
map("state", boundary = FALSE, col="gray", add = TRUE)
text(x=as.numeric(as.character(texts$x)), y=as.numeric(as.character(texts$y)), apply(texts[,4:5] , 1 , paste , collapse = "\n" ), cex=1, family="Humor Sans", col=texts$col)
dev.off()

How Do Cities Feel?

If you are lost and feel alone, circumnavigate the globe (For You, Coldplay)

You can not consider yourself a R-blogger until you do an analysis of Twitter using twitteR package. Everybody knows it. So here I go.

Inspired by the fabulous work of Jonathan Harris I decided to compare human emotions of people living (or twittering in this case) in different cities. My plan was analysing tweets generated in different locations of USA and UK with one thing in common: all of them must contain the string “I FEEL”. These are the main steps I followed:

  • Locate cities I want to analyze using world cities database of maps package
  • Download tweets around these locations using searchTwitter function of twitteR package.
  • Cross tweets with positive and negative lists of words and calculate a simple scoring for each tweet as number of positive words – number of negative words
  • Calculate how many tweets have non-zero scoring; since these tweets put into words some emotion I call them sentimental tweets
  • Represent cities in a bubble chart where x-axis is percentage of sentimental tweets, y-axis is average scoring and size of bubble is population

This is the result of my experiment:HowDoCitiesFeel3

These are my conclusions (please, do not take it seriously):

  • USA cities seem to have better vibrations and are more sentimental than UK ones
  • Capital city is the happiest one for both countries
  • San Francisco (USA) is the most sentimental city of the analysis; on the other hand, Liverpool (UK) is the coldest one
  • The more sentimental, the better vibrations

From my point of view, this analysis has some important limitations:

  • It strongly depends on particular events (i.e. local football team wins the championship)
  • I have no idea of what kind of people is behind tweets
  • According to my experience, searchTwitter only works well for a small number of searches (no more than 300); for larger number of tweets to return, it use to give malformed JSON response error from server

Anyway, I hope it will serve as starting point of some other analysis in the future. At least, I learned interesting things about R doing it.

Here you have the code:

library(twitteR)
library(RCurl)
library(maps)
library(plyr)
library(stringr)
library(bitops)
library(scales)
#Register
if (!file.exists('cacert.perm'))
{
  download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile='cacert.perm')
}
requestURL="https://api.twitter.com/oauth/request_token"
accessURL="https://api.twitter.com/oauth/access_token"
authURL="https://api.twitter.com/oauth/authorize"
consumerKey = "YOUR CONSUMER KEY HERE"
consumerSecret = "YOUR CONSUMER SECRET HERE"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=requestURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo=system.file("CurlSSL", "cacert.pem", package="RCurl"))
#Save credentials
save(Cred, file="twitter authentification.Rdata")
load("twitter authentification.Rdata")
registerTwitterOAuth(Cred)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
#Cities to analyze
cities=data.frame(
  CITY=c('Edinburgh', 'London', 'Glasgow', 'Birmingham', 'Liverpool', 'Manchester',
         'New York', 'Washington', 'Las Vegas', 'San Francisco', 'Chicago','Los Angeles'),
  COUNTRY=c("UK", "UK", "UK", "UK", "UK", "UK", "USA", "USA", "USA", "USA", "USA", "USA"))
data(world.cities)
cities2=world.cities[which(!is.na(match(
str_trim(paste(world.cities$name, world.cities$country.etc, sep=",")),
str_trim(paste(cities$CITY, cities$COUNTRY, sep=","))
))),]
cities2$SEARCH=paste(cities2$lat, cities2$long, "10mi", sep = ",")
cities2$CITY=cities2$name
#Download tweets
tweets=data.frame()
for (i in 1:nrow(cities2))
{
  tw=searchTwitter("I FEEL", n=400, geocode=cities2[i,]$SEARCH)
  tweets=rbind(merge(cities[i,], twListToDF(tw),all=TRUE), tweets)
}
#Save tweets
write.csv(tweets, file="tweets.csv", row.names=FALSE)
#Import csv file
city.tweets=read.csv("tweets.csv")
#Download lexicon from http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
hu.liu.pos = scan('lexicon/positive-words.txt',  what='character', comment.char=';')
hu.liu.neg = scan('lexicon/negative-words.txt',  what='character', comment.char=';')
#Function to clean and score tweets
score.sentiment=function(sentences, pos.words, neg.words, .progress='none')
{
  require(plyr)
  require(stringr)
  scores=laply(sentences, function(sentence, pos.word, neg.words) {
    sentence=gsub('[[:punct:]]','',sentence)
    sentence=gsub('[[:cntrl:]]','',sentence)
    sentence=gsub('\\d+','',sentence)
    sentence=tolower(sentence)
    word.list=str_split(sentence, '\\s+')
    words=unlist(word.list)
    pos.matches=match(words, pos.words)
    neg.matches=match(words, neg.words)
    pos.matches=!is.na(pos.matches)
    neg.matches=!is.na(neg.matches)
    score=sum(pos.matches) - sum(neg.matches)
    return(score)
  }, pos.words, neg.words, .progress=.progress)
  scores.df=data.frame(score=scores, text=sentences)
  return(scores.df)
}
cities.scores=score.sentiment(city.tweets[1:nrow(city.tweets),], hu.liu.pos, hu.liu.neg, .progress='text')
cities.scores$pos2=apply(cities.scores, 1, function(x) regexpr(",",x[2])[1]-1)
cities.scores$CITY=apply(cities.scores, 1, function(x) substr(x[2], 1, x[3]))
cities.scores=merge(x=cities.scores, y=cities, by='CITY')
df1=aggregate(cities.scores["score"], by=cities.scores[c("CITY")], FUN=length)
names(df1)=c("CITY", "TWEETS")
cities.scores2=cities.scores[abs(cities.scores$score)>0,]
df2=aggregate(cities.scores2["score"], by=cities.scores2[c("CITY")], FUN=length)
names(df2)=c("CITY", "TWEETS.SENT")
df3=aggregate(cities.scores2["score"], by=cities.scores2[c("CITY")], FUN=mean)
names(df3)=c("CITY", "TWEETS.SENT.SCORING")
#Data frame with results
df.result=join_all(list(df1,df2,df3,cities2), by = 'CITY', type='full')
#Plot results
radius <- sqrt(df.result$pop/pi)
symbols(100*df.result$TWEETS.SENT/df.result$TWEETS, df.result$TWEETS.SENT.SCORING, circles=radius,
        inches=0.85, fg="white", bg="gold", xlab="Sentimental Tweets", ylab="Scoring Of Sentimental Tweets (Average)",
        main="How Do Cities Feel?")
text(100*df.result$TWEETS.SENT/df.result$TWEETS, df.result$TWEETS.SENT.SCORING, paste(df.result$CITY, df.result$country.etc, sep="-"), cex=1, col="gray50")