Using Twitter, GeoJSON (via GDAL) and d3 to visualize data about Rob Ford
(Note: Gists turned into links so as to avoid too many roundtrips to github.com which is slow sometimes)
As a citizen of Toronto, the past few weeks (and months) have been interesting to say the least. Our mayor, Rob Ford, has made headlines for his various lewd remarks, football follies, and drunken stupors. With an election coming up in 2014, I was curious as to how the rest of Toronto felt about our mayor.
Collect data from Twitter, making sure we only use tweets from people in Toronto. Plot those tweets on an electoral map of Toronto’s wards, colour coding the wards based on how they voted in the 2010 Mayoral Election.
- Repustate’s API (for data mining Twitter and extracting sentiment, of course!)
- Shapefiles for the City of Toronto including boundaries for the wards
I’m impatient, just show me the final product …
Alright, here’s the final visualization of Toronto’s opinion of Rob Ford. Take a look and then come back here and find out more about how we accomplished this.
For those still with me, let’s go through the step-by-step of how this visualization was accomplished.
Step 1: Get the raw data from Twitter
This is a given – if we want to visualize data, first we need to get some! To do so, we used Repustate to create our rules and monitor Twitter for new data. You can monitor social media data using Repustate’s web interface, but this is a project for hackers, so let’s use the Social API to do this. Here’s the code (you first have to get an API key):
Alright, so while Repustate gets our data, we can get the geo visualization aspect of this started. We’ll check back in on the data – for now, let’s delve into the world of geographic information systems and analytics.
Step 2: Get the shapefile
What’s a shapefile? Good question! Paraphrasing Wikipedia, shapefiles are a file format that contain vector information describing the layout and/or location of geographic features, like cities, states, provinces, rivers, lakes, mountains, or even neighbourhoods. Thanks to a recent movement to open up data to the public, it’s not *too* hard to get your hands on shape files anymore. To get the shapefiles for the City of Toronto, we went to the City’s OpenData website, which has quite a bit of cool data there to play with. The shapefile we need, the one that shows Toronto divided up into it’s electoral wards, is right here. Download it!
Step 3: Convert the shapefile to GeoJSON
So we have our shapefile, but it’s in a vector format that’s not exactly ready to be imported into a web page. The next step is to convert this file into a web-friendly format. To do this, we need to convert the data in the shapefile into a JSON-inspired format, called GeoJSON. GeoJSON looks just like normal JSON (it is normal JSON), but it has a spec that defines what a valid GeoJSON object must contain. Luckily, there’s some open source software that will do this for us: the Geospatial Data Abstraction Library or GDAL. GDAL has *tons* of cool stuff in it, take a look when you have time, but for now, we’re just going to use the ogr2ogr command that takes a shapefile in and spits out GeoJSON. Armed with GDAL and our shapefile, let’s convert the data into GeoJSON:
This new file, toronto.json, is a GeoJSON file that contains the co-ordinates for drawing the polygons representing the 44 wards of Toronto. Let’s take a look at that one liner just so you know what’s going on here:
- ogr2ogr is the name of the utility we’re using; it comes with GDAL
- Specifying the format of the final output (-f option)
- This step is *very* key, we’re making sure the output is projected/translated into the familiar lat/long co-ordinate system. If you omit this (in fact, try omitting this flag), you’ll see the output will require some work from us on the front end, which is unnecessary.
- The name of the output file
- The input shapefile
Alright, GeoJSON file done.
Step 4: Download the data
Assuming enough time has passed, Repustate will have found some data for us and we can retrieve it via the Social API:
The data comes down as a CSV. It’ll remain an exercise for the reader to store the data in a format that you can query and retrieve thereafter, but let’s just assume we’ve stored our data. Included in this data dump will be the latitude and longitude of each tweet(er). How do we know if this tweet(er) is in Toronto? There’s a few ways actually.
We could use PostGIS, an unbelievably amazing extension for PostgreSQL that allows you to do all sorts of interesting queries with geographic data. If we went this route, we could load the original shapefile into a PostGIS enabled database, then for each tweet, use the ST_Contains method to check which tweet(er)s are in Toronto. Another way to do this is to iterate over your dataset and for each one, do a quick point in polygon calculation in the language of your choice. The polygon data can be retrieved from the GeoJSON file we created. Either way you go, it’s not too hard to determine which data originated from Toronto, given specified lat/long co-ordinates.
Step 5: Visualize the data
- The map of Toronto, with dividers for each of the wards
- The individual tweets, colour coded for sentiment and placed in the correct lat/long location.
The data for the map comes from the GeoJSON file we created earlier. We’re going to tell d3 to load it and for each ward, we’re going to draw a polygon (the “path” node in SVG) by following the path co-ordinates in the GeoJSON file. Here’s the code:
Path elements have an attribute, “d”, which describes how the path should be drawn. You can read more about that here. So for each ward, we create a path node, set the “d” attribute to contain the lat/long co-ordinates that when connected, form the boundaries of the ward. You’ll see we also have custom fill colouring based on the 2010 Mayoral election – that’s just an extra touch we added. You can play around with the thickness of the bordering lines. We’re also adding a mouseover event to show a little tooltip about the ward in question when the user hovers over it. Time to add the data points. We’re going to separate them by date, so we can add a slider later to show how the data changes over time. We’re also including sentiment information that Repustate automatically calculates for each piece of social data. Here’s an example of how one complete data point would look:
Now to plot these on our map, we again load a JSON file and for each point, we bind a “circle” object, positioned at the correct lat/long co-ordinates. Here’s the code:
At this point, if you tried to render the data on screen, you wouldn’t see a thing, just a blank white SVG. That’s because you need to deal with the concept of map projections. If you think about the map of the world, there’s no “true” way to render it on a flat, 2D surface, like your laptop screen or iPad. This has given rise to the concept of projections, which is the process of placing curved data onto a flat plane. There are many ways to do this and you can see a list here of available projections; for our example, we used the Albers projection. Last thing we have to do is scale and transform our SVG and we’re done. Here’s the code for projecting, scaling and transforming our map:
Those values are not random – they came about from some trial & error. I’m not sure how to do this properly or in algorithmic way.
Please take a look at the finished product and of course, if you or your company is looking into doing this sort of thing with your own data set, contact us!