Twitter + GeoJSON + d3.js + Rob Ford = Fun with data visualizations

Using Twitter, GeoJSON (via GDAL) and d3 to visualize data about Rob Ford

(Note: Gists turned into links so as to avoid too many roundtrips to github.com which is slow sometimes)

As a citizen of Toronto, the past few weeks (and months) have been interesting to say the least. Our mayor, Rob Ford, has made headlines for his various lewd remarks, football follies, and drunken stupors. With an election coming up in 2014, I was curious as to how the rest of Toronto felt about our mayor.

The Task

Collect data from Twitter, making sure we only use tweets from people in Toronto. Plot those tweets on an electoral map of Toronto’s wards, colour coding the wards based on how they voted in the 2010 Mayoral Election.

The Tools

  1. Repustate’s API (for data mining Twitter and extracting sentiment, of course!)
  2. Shapefiles for the City of Toronto including boundaries for the wards
  3. GDAL
  4. d3.js
  5. jQuery

I’m impatient, just show me the final product …

Alright, here’s the final visualization of Toronto’s opinion of Rob Ford. Take a look and then come back here and find out more about how we accomplished this.

The Process

For those still with me, let’s go through the step-by-step of how this visualization was accomplished.

Step 1: Get the raw data from Twitter

This is a given – if we want to visualize data, first we need to get some! To do so, we used Repustate to create our rules and monitor Twitter for new data. You can monitor social media data using Repustate’s web interface, but this is a project for hackers, so let’s use the Social API to do this. Here’s the code (you first have to get an API key):

https://gist.github.com/mostrovsky/7762449

Alright, so while Repustate gets our data, we can get the geo visualization aspect of this started. We’ll check back in on the data – for now, let’s delve into the world of geographic information systems and analytics.

Step 2: Get the shapefile

What’s a shapefile? Good question! Paraphrasing Wikipedia, shapefiles are a file format that contain vector information describing the layout and/or location of geographic features, like cities, states, provinces, rivers, lakes, mountains, or even neighbourhoods. Thanks to a recent movement to open up data to the public, it’s not *too* hard to get your hands on shape files anymore. To get the shapefiles for the City of Toronto, we went to the City’s OpenData website, which has quite a bit of cool data there to play with. The shapefile we need, the one that shows Toronto divided up into it’s electoral wards, is right here. Download it!

Step 3: Convert the shapefile to GeoJSON

So we have our shapefile, but it’s in a vector format that’s not exactly ready to be imported into a web page. The next step is to convert this file into a web-friendly format. To do this, we need to convert the data in the shapefile into a JSON-inspired format, called GeoJSON. GeoJSON looks just like normal JSON (it is normal JSON), but it has a spec that defines what a valid GeoJSON object must contain. Luckily, there’s some open source software that will do this for us: the Geospatial Data Abstraction Library or GDAL. GDAL has *tons* of cool stuff in it, take a look when you have time, but for now, we’re just going to use the ogr2ogr command that takes a shapefile in and spits out GeoJSON. Armed with GDAL and our shapefile, let’s convert the data into GeoJSON:

ogr2ogr 
  -f GeoJSON 
  -t_srs EPSG:4326
  toronto.json 
  tcl3_icitw.shp

This new file, toronto.json, is a GeoJSON file that contains the co-ordinates for drawing the polygons representing the 44 wards of Toronto. Let’s take a look at that one liner just so you know what’s going on here:

  1. ogr2ogr is the name of the utility we’re using; it comes with GDAL
  2. Specifying the format of the final output (-f option)
  3. This step is *very* key, we’re making sure the output is projected/translated into the familiar lat/long co-ordinate system. If you omit this (in fact, try omitting this flag), you’ll see the output will require some work from us on the front end, which is unnecessary.
  4. The name of the output file
  5. The input shapefile

Alright, GeoJSON file done.

Step 4: Download the data

Assuming enough time has passed, Repustate will have found some data for us and we can retrieve it via the Social API:

https://gist.github.com/mostrovsky/7762900

The data comes down as a CSV. It’ll remain an exercise for the reader to store the data in a format that you can query and retrieve thereafter, but let’s just assume we’ve stored our data. Included in this data dump will be the latitude and longitude of each tweet(er). How do we know if this tweet(er) is in Toronto? There’s a few ways actually.

We could use PostGIS, an unbelievably amazing extension for PostgreSQL that allows you to do all sorts of interesting queries with geographic data. If we went this route, we could load the original shapefile into a PostGIS enabled database, then for each tweet, use the ST_Contains method to check which tweet(er)s are in Toronto. Another way to do this is to iterate over your dataset and for each one, do a quick point in polygon calculation in the language of your choice. The polygon data can be retrieved from the GeoJSON file we created. Either way you go, it’s not too hard to determine which data originated from Toronto, given specified lat/long co-ordinates.

Step 5: Visualize the data

Let’s see how this data looks. To plot the GeoJSON data and each of the data points from our Twitter data source, we use d3.js For those not familiar with d3.js (Data Driven Documents) – it’s a Javascript library that helps you bind data to graphical representations in pure HTML/SVG. It’s a great library to use if you want to visualize any data really, but it makes creating unique and customized maps pretty easy. And best of all, since the output is SVG, you’re pretty much guaranteed that your visuals will look identical on all browsers (well, IE 8 and down don’t support SVG) and mobile devices. Let’s dive into d3.js. We need to plot 2 things:

  1. The map of Toronto, with dividers for each of the wards
  2. The individual tweets, colour coded for sentiment and placed in the correct lat/long location.

The data for the map comes from the GeoJSON file we created earlier. We’re going to tell d3 to load it and for each ward, we’re going to draw a polygon (the “path” node in SVG) by following the path co-ordinates in the GeoJSON file. Here’s the code:

https://gist.github.com/mostrovsky/7776778

Path elements have an attribute, “d”, which describes how the path should be drawn. You can read more about that here. So for each ward, we create a path node, set the “d” attribute to contain the lat/long co-ordinates that when connected, form the boundaries of the ward. You’ll see we also have custom fill colouring based on the 2010 Mayoral election – that’s just an extra touch we added. You can play around with the thickness of the bordering lines. We’re also adding a mouseover event to show a little tooltip about the ward in question when the user hovers over it. Time to add the data points. We’re going to separate them by date, so we can add a slider later to show how the data changes over time. We’re also including sentiment information that Repustate automatically calculates for each piece of social data. Here’s an example of how one complete data point would look:

{
 "lat": 43.788458853157117, 
 "mm": 10,
 "lng": -79.282781549043008,
 "dd": 28,
 "sentiment": "pos"
}

Now to plot these on our map, we again load a JSON file and for each point, we bind a “circle” object, positioned at the correct lat/long co-ordinates. Here’s the code:

https://gist.github.com/mostrovsky/7776799

At this point, if you tried to render the data on screen, you wouldn’t see a thing, just a blank white SVG. That’s because you need to deal with the concept of map projections. If you think about the map of the world, there’s no “true” way to render it on a flat, 2D surface, like your laptop screen or iPad. This has given rise to the concept of projections, which is the process of placing curved data onto a flat plane. There are many ways to do this and you can see a list here of available projections; for our example, we used the Albers projection. Last thing we have to do is scale and transform our SVG and we’re done. Here’s the code for projecting, scaling and transforming our map:

https://gist.github.com/mostrovsky/7776853

Those values are not random – they came about from some trial & error. I’m not sure how to do this properly or in algorithmic way.

Please take a look at the finished product and of course, if you or your company is looking into doing this sort of thing with your own data set, contact us!

Building a better TripAdvisor

(This is a guest post by Sarah Harmon. Sarah is currently a PhD student at UC Santa Cruz studying artificial intelligence and natural language processing. Her website is NeuroGirl.com)

Click here to see a demo of Sarah’s work  in action.

Using Repustate’s API to to get the most out of TripAdvisor customer reviews

I’m a big traveler, so I often check online ratings, such as those on TripAdvisor, to decide which local hotel or restaurant is worth my time.  In this post, I use Repustate’s text categorization API to analyze the sentiment of hotel reviews, and – in so doing – work towards a better online hotel rating system.

Five-star scales aren’t good enough

The five-star rating scale is a standard ranking system for many products and services.  It’s a fast way for consumers to check out the quality of something before paying the price, but it’s not always accurate.  Five-star ratings are inherently subjective, don’t let raters say all they want to say, and force a generic labelling for a complex experience.

Asking people to write text reviews solves a few of these problems.  Instead of relying on the five-star scale, we ask people to highlight the most memorable parts of their experience.  Still, who has the time to read hundreds of reviews to get a true sense of what a hotel is like?  Even the small sample of reviews shown on the first page could be unhelpful, unreliable or cleverly submitted by the hotel itself to make themselves look good. What we need is a way to summarize the review content – and ideally, we’d like a summary that’s specific to our own values.

Making a personalized hotel ranking system

Let’s take a look at a website that uses star ratings all the time: TripAdvisor, a travel resource that features an incredible wealth of user-reviewed hotels.

trip advisor front page

TripAdvisor uses AJAX-based pagination, which means that if we’d like to access that wealth of data, normal methods of screen-scraping won’t work.  I decided to use CasperJS, a fork of PhantomJS, to easily scrape the TripAdvisor website with JavaScript.

Here’s a simplified example of how you can use CasperJS to view the star ratings for every listed hotel from New York:

In a similar fashion, I used Python and CasperJS to retrieve a sample of 100 hotels each from five major locations – Italy, Spain, Thailand, the US, and the UK – and retrieved the top 400 English reviews from each hotel as listed on TripAdvisor.  To ensure that each stored review was in English, I relied on Python’s Natural Language Toolkit. Finally, I called on the Repustate API to analyze each review using six hotel-specific categories: food, price, location, accommodations, amenities, and staff.

sample of hotel rater

Check out the results in this “HotelRater” demo.  Select a location and a niche you care about, and you’ll see hotels organized in order of highest sentiment for that category.  To generate those results, the sentiment scores for each hotel across each category were averaged, and then placed on a scale from 1 to 100.  (I chose to take a mean of the sentiment values because it’s a value that’s easy to calculate and understand.)  The TripAdvisor five-star rating is shown for comparison.  You can also click on the hotels listed to see how Repustate categorized each of their reviews.

When I started putting the app I built into practice, I could suddenly make sense out of TripAdvisor’s abundance of data.  While a hotel might have a four star review on average, the customers were generally very happy with key aspects, such as its food, staff, and location.  Desirable hotels popped up in my search results that I might never have even seen or considered because of their lower average star rating on TripAdvisor.  The listed sentiment scores also helped to differentiate hotels, which I would have previously had trouble sorting through because they all shared the same 4.5 star rating.

This demo isn’t a replacement for TripAdvisor by any means, since there’s hardly enough stored data or options included to assist you in your quest for the perfect hotel on your vacation.  That said, it’s a positive step towards a new ranking system that’s aligned with our individual values.  We can quickly see how people are really feeling, without condensing their more specific thoughts into a blanket statement.

I’d give that five stars any day.

What’s the secret to great customer service? Don’t be a jerk.

One of the most frustrating aspects of dealing with larger companies is the customer service experience. Whether it’s being caught in a game of customer service hot potato (“Please wait while we transfer you for the 8th time”) or being told “I’m sorry we can’t do that, company policy” it can permanently ruin a customer’s relationship with a company. While larger companies perhaps feel strict policies and procedures provide a consistent experience and therefore make their business easier to administer in the long run, Repustate has never adopted a one-size-fits-all approach to customer service. Each customer is different, so let’s treat them that way.

Examples of good customer service

The one rule we do hold constant is this: Don’t be a jerk. It’s kind of our version of Google’s  (in)famous “Don’t be Evil” motto. Here’s a few examples of “Don’t be a jerk” in practice:

1) I wouldn’t have believed it until I saw it for myself, but every so often, we get customers who sign up for Repustate’s priced plans, use the plan for the month, then contact us and say they’d like a refund for that month. Crazy, I know, and we would be entirely correct and fair if we said “Are you out of your mind? That’s like getting a bucket of chicken from KFC and returning a bucket full of bones and asking for your money back”. But you know what – we always refund their money. It’s just not worth our time to argue and go back and forth. Also, we’re hoping that our leniency and understanding will result in these customers coming back one day. In our experience, about 50% of the time they do come back at a later date as permanent, paying customers.

2) We often get students contacting us wanting to use Repustate for their research papers. The free plan we offer only provides 1000 API calls per month and often these students need hundreds of thousands of calls to complete their project. We always give students a free plan with unlimited usage for a few weeks on condition that they A) cite Repustate in the paper/project and B) send us the final paper or a link to the site. We like to think it keeps Repustate in good standing with the karma gods, but more seriously, these students will graduate one day, go work somewhere and might recommend Repustate to their employers.

3) Prompt email responses. Sounds like a no-brainer, but Repustate does not send out “Thank you for contacting Repustate” emails. I hate getting those because it gets your hopes up that someone replied, and then you check the contents, and you get sad 🙁 We just reply promptly, always same day, usually within the hour.

4) Honour your existing customers. We recently underwent some price plan changes and modified the account quotas. Now, we could have gone to all of our existing customers and said “Too bad, new quotas in place” but we didn’t. We continue to honour some of our older plans because that’s the right thing to do.

To sum up, just be nice, it’s not that hard. It’ll pay off in the long run, if not the short run as well.

Where does social media stand in the world of healthcare?

A new study from business services firm PriceWaterhouseCoopers has found that pharmaceutical and healthcare brands lag behind other business when it comes to taking advantage of growth opportunities available through social media channels.

The study found the industry’s executives are behind in social media use when compared to their customers. Out of 124 executives interviewed, half expressed worries about how they were going to integrate social media into their business strategy. Those leaders also weren’t sure how to prove the return on investment, according to the study.

Out of 1,060 consumers that PwC polled, about 42 percent read health-related user reviews on social media sites like Facebook and Twitter. Thirty-two percent of those polled said they accessed information which concerned the health experiences of friends and family. Twenty-nine percent looked for social media users who had had an illness similar to their own, and 24 percent looked at videos and photos uploaded from people currently suffering from a similar illness.

Kelly Barnes, PwC’s US health industries leader, said, “The power of social media for health organizations is in listening and engaging with consumers on their terms. Savvy adopters are viewing social media as a business strategy, not just a marketing tool.”

A few more key stats from the survey included:

-28% of users supported health-related causes on the web

-24% uploaded comments giving details about their own health status

-16% posted reviews of medication

-15% mentioned health insurers.

The study showed that health brands would see huge benefits to creating and monitoring social media channels because 43% of users said they would be likely to share positive experiences about a brand of medication they used, and 38 percent said they would share negative opinions, giving a transparent view of medication effectiveness for anyone interested.

Seventy percent of the social media users expected a response from an inquiry on a healthcare company’s social media channels to come back within 24 hours, and 66 percent expected the same response time for a complaint about goods or services.

_________

Social media users had one of their first digital interactions with medical world in February 2009 when Henry Ford Hospital in Detroit, MI became one of the first hospitals to allow a procedure to be live-Tweeted from within the operating room. Used as a sort of real-time textbook, other doctors, medical students and anyone else who was curious could follow along as surgeons gave updates on a kidney surgery that removed a cancerous tumour.

While not as directly profitable as a healthcare brand using social media, such procedures can generate excitement for the hospital or medical organization, especially when they look to raise money during fundraising campaigns or attract new patients.

Ever wonder how to do language detection?

A core function that any text analytics package needs is to do language detection. By language detection, we refer to the following problem:

“Given an arbitrary piece of text of arbitrary length, determine in which language the text was written.”

Might sound simple for a human, assuming you know a thing or two about languages, but we’re talking about computers here. How does one automate this process? Before I dive into the solution, if you want to see this in the wild, go to translate.google.com and check out how Google does it. As soon as you start typing, Google guesses which language you’re typing in. How does it know?

The first thing you’ll need is a large corpus (read: millions of words) in the languages you’re interested in detecting. Now, the words can’t just be random, they should be structured sentences. There are many sources for this, but the easiest is probably Wikipedia. You can download the entire Wikipedia corpus in the language of your choosing. Might take a while, but it’s worth it because the more text you have, the higher the accuracy you’ll achieve.

Next step is to generate n-grams over this corpus. An n-gram is a phrase or collection of words “n” long. So a unigram (1-gram) is one word. A bi-gram (or 2-gram) is two words, a tri-gram is three words etc. You probably only need to generate all n-grams where n <= 3. Anything more will probably be overkill. How do you generate n-grams? Well, using the Repustate API of course. There are other n-gram generators on the internet, just google around. The benefit of using Repustate’s is that ours is blazingly fast, even when you take into account the network latency. Now as you generate n-grams, you need to store them in a set-like structure. We want a set rather than a list because sets only store unique items and they are much faster for lookups than lists.

I recommend using a bloom filter to store the n-grams. Bloom filters are awesome data structures, learn to use them and love them. OK, all of our n-grams (there will be millions of them per language) are stored in a bloom filter, one filter for each language.

Next, we take the text for which we want to detect the language, and generate n-grams over it. Just for kicks, let’s generate n-grams for the sentence “I love Repustate”:

Unigrams: I, love, Repustate

Bigrams: I love, love Repustate

Trigrams: I love Repustate

Simple, right? Now for each n-gram above, check to see if it exists in each of the bloom filters were created before. This is why using as large a corpus as possible is preferibile. The more n-grams, the higher the chance of a positive match. The bloom filter which returns the highest number of matches tells you which language you’re dealing with.

Repustate has done all the heavy lifting already and if there’s enough demand (basically, if one person asks), we’ll add language detection to our free text analytics API.

 

What people talk about on Facebook

A lot of interesting things … and some not so interesting things.

I’ve been reading a lot of Facebook status updates over the past week or two, trying to get a good grip on the kinds of things people talk about on Facebook and how Repustate can leverage this avalanche of data. While there’s a lot of useful nuggets of information, there’s a lot of nonsense. Here’s a quick summary for those who don’t want to read thousands of other people’s status messages:

    • A lof of people (presumably younger ones) hate either their lives, or their moms or both. They also can’t see how their life could be any worse than it already is.
    • A lot of people pray to God/Jesus/Jebus for all sorts of things including: doing well on a test, good weather, and for a boy/girl to reciprocate their feelings
    • Playing off the same religious theme (and people on Facebook sure are preachy!) many people took part in some viral meme where they “Wish Heaven had a phone” so they could send a message to a loved one whose passed. In fact, of 200K Facebook messages analyzed, more than 9K wished Heaven had a phone. Perhaps AT&T is missing out on a huge new customer base.
    • Many male users are fed up with their female counterparts abusing their financial generosity. Of course, they don’t state it in such terms, but that’s the gist of things.
    • Many female users are fed up with being emotionally abused and “played” by their male counterparts.
    • And a alarmingly large number of people have tried to sell their younger siblings on Facebook (e.g. Does anybody want to buy my younger brother? He’s so annoying!!!!1111)

There you have it. I have scoured the depths of social networking and filtered out the noise so you don’t have to.

You know what, Twitter can actually be useful

Despite the noise and nonsensical hash tags, Twitter actually does contain some useful information. You just need the tools to find it.

When Repustate first opened its doors, we saw a lot of content coming from established media sources. The overwhelming majority of content being pushed through our API by our developer partners came from sites like the New York Times and TechCrunch. But for the social marketer out there, the true value lies in analyzing the user generated mediums, namely Facebook and Twitter.

Some brands use their Twitter feeds as complements to their RSS feed (wrong!), while some celebrities use it to connect with their fan base and establish a more personal connection (right!). But from my personal experience with Twitter, both as a user and as a lurker, its actual use and its potential utility were two completely different things.

By and large Twitter is a ghetto of content. Annoying chain-letter-like-tweets that play on some hash tag subculture (e.g. #thingsyouhateinthemorning) dominate the public stream. Throw in some slang and poor spelling, and you have a breeding area for incomprehensible dialogue.

But lo and behold, if you’re searching for very specific items, even amongst the sea of crap, you will find some gold. And that’s what we’ve uncovered the past few weeks. Twitter users occasionally produce nuggets of information that a social media marketer would love to know. Complaints about products, feature requests, product comparisons, and opinion mining from their social network are quite commonplace on Twitter. It’s quite stunning and eye opening to see that Twitter is evolving into a focus group that is ripe for dissection.

The ability to create strategic and highly targeted direct marketing campaigns is greatly increased as a result of the information that people volunteer about themselves and the products they buy. The next step for Repustate is to extract the value, organize it, and present it in such a way that the transition from discovery to action is as quick and seamless as possible.

How I almost got to skip my exam and get an A+ (and maybe $1 million)

I coulda been famous, I coulda been a contender.

A PDF came across our Twitter feed today (http://www.scribd.com/doc/35539144/pnp12pt – courtesy of @arnaudsj) and it reminded me of my favourite moment in undergrad. For those who think the PDF is tl;dr – I’ll summarize quickly. The researched who authored the paper believes he has found a proof to a very difficult problem in computer science for which there is a $1 million prize. For 30-40 years, the best minds in computer science and math have tried to solve this problem. Keep that in mind as I tell you my story.In my 3rd year of computer science at York University in Toronto, my algorithms professor, Eric Ruppert, was teaching us about P and NP. He then told us about the $1 million dollar prize for anyone who could either prove or disprove P == NP. He then threw this carrot out there: “And if anyone can solve it, they get an automatic A+ and get to skip the exam.” I was getting about a C+ at the time and did not look forward to the exam (with good reason, only 3 people passed it!)So I went home that night and tried to solve the problem. I don’t remember the details, but I came up with an algorithm that wasn’t polynomial in computation time and could minimize the time needed to traverse a graph (Travelling Salesman’s problem). I was so excited, I emailed my professor telling him I’ll be in his office the next day with good news.

The next day, I strode into his office, like Caesar returning from a victorious battle, expecting to accept my reward and for the crowds of computer science groupies to throw their bras at me. My professor laughed when I told him what I thought I had come up with. So he began to find a hole in my algorithm. A few minutes went by, and he still couldn’t find a hole. I saw that he began to get a little nervous. After what seemed like an hour, but was probably only 5 minutes, he *finally* came up with a scenario where my algorithm failed to find the correct solution and my dreams were dashed. No A+, no automatic deferral of the final exam, and no $1 million prize. But I almost reached the summit of algorithm proficiency. Alas, like George Costanza, I flew too close to the sun on wings of pastrami.