How properly abstracted out code saved our asses.
One of the biggest challenges of this new age of information is Big Data. Not just a marketing term, it’s a real concern. If you want to handle tons of data, how do you read, write, and search for you data in a quick and scalable fashion? Thankfully the open source community has sprung forth and numerous options exist today. This is a cautionary tale of why your application layer should be robust enough to work with any Big Data solution.
Like finding a needle in a Haystack
When we started rolling out Repustate’s new backend, we knew we’d be dealing with a lot of data, specifically tweets. That means we’d need something that could write very quickly, but also read quickly. We also needed something that could scale well and would allow us to shard the data, if we needed to. Since Repustate is built on Django, we decided to go with Haystack (pip install django-haystack).
Haystack takes your Django models and adds a layer of natural language search on top for “free”. I say free, but it requires a bit of work. Haystack is awesome in that it’s relatively easy to change which backend you want to use for search. The options are Whoosh, Solr, or Xapian. Having used Xapian before, we went with that.
Haystack works by taking the data from your Postgres table, inserting it into the desired search backend and then re-indexing the search index. The problem is that Postgres can be quite slow to write, especially with transactions. Our worker pools were backlogged with data that needed to be written because we just couldn’t write quickly enough. I’m sure we could have tinkered with some Postgres settings to make writing a bit faster, but it seemed like we could find a better solution quicker. And so we did.
A Solr Eclipse (That’s two Java jokes for the price of one)
Because Django Haystack is written so well, we just had to make a few changes in our haystack settings and call “update_index()” and poof, our Postgres data was now in a Solr database. Not the biggest fan of Solr, but this was beautiful – for a while. Searching is a breeze, replication is trivial, and Jetty is a world class server. Java is tried & tested, so it can handle scale. But we again ran into a problem with writing and updating our index. Solr takes a long time to update indices once you get into the millions & million of documents. We played around with RAM usage, tweaking our Solr schema, nothing could achieve the performance we needed. So we took another stab at a backend solution. MongoDB.
A huMONGOus success.
Tl;dr: MongoDB kicks some serious ass. Writes are fast, reads are blindingly fast and replication sets are super easy to setup and work with. Repustate now processes the Twitter fire hose with ease, making data available for analysis almost as fast as it’s being created on Twitter. But the key to all of this tinkering and iteration was abstraction.
First because of Haystack, we were able to play around with Xapian and then Solr within minutes, because they abstracted out the nitty gritty, and just exposed one common search & index interface.
To replace Haystack with MongoDB, again it took no more than a few hours because of how we wrote our code. In Repustate, users create Data sources. For example, the twitter account of Charlie Sheen is a data source. Internally, we have a Postgres table and corresponding Django model for each Data source. A source has a search function that takes several parameters to limit searches. In our Django/Python code, a search would look like this:
>>> source = Datasource.objects.get(pk=10)
>>> source.search(‘all’, since=datetime.datetime(2011,10,01), until=None, gender=’M’)
The search method knows which backend it’s using so every time we’ve switched out backends, we just have to update this one method. The entire Repustate code base relies on this one search method to retrieve the correct data for the correct data source for the given user. The search above retrieves all tweets or Facebook messages since October 1st that were written by males. No mention of mongo, no mention of Solr, no mention of any specific backend. Just a simple interface.
Your Comp Sci classes teach you something after all.
By properly refactoring our code, we were able to save ourselves from a big headache as we kept tinkering with our backend. This also has the added benefit of keeping our unit tests in tact as the same assertions would still apply. Furthermore, we abstracted out the notion of getting a connection to a database, so even though the search backend kept changing, we only had to update the code in one place.
And now, time to enjoy some MongoDB-supported social media analytics.