Repustate – social media analytics redux

Making social media analytics easier than ever.

A new beginning

Repustate has relaunched with a new & improved user interface to help marketers, researchers and those who are just so darn curious about social media. This new interface augments our already popular API which has been helping our developer partners make heads & tails out of the stream of data they encounter.

What can you do with the new Repustate

There’s a truck load of information being generated every day. How do you keep tabs on it? How do you know who has a question about your product and who are your biggest advocates? How do you assess whether or not your social media campaigns are having their intended effect? Repustate makes this all automatic and easy by doing the following.

    • Automatically categorizing tweets & Facebook messages by their sentiment and semantic meaning
    • Aggregate data from different data sources with different filtering criteria
    • Applying filters through a simple drag-and-drop interface to view your data through specific dimensions
    • Making exporting your data and reporting in PowerPoint or PDF a one click cinch
    • Generate beautiful reports in HTML5 and share them with your social network

On the horizon
Without question, the most requested feature has been natural language analytics in other languages. We’re working on this feverishly to allow our partners to analyze social media across the globe. Stay tuned for more information.

Lastly, we’ll be adding tutorial videos to our site to help our customers see the many ways they can use Repustate to expand their services to their clients.

Until next time.

A case study in abstraction

How properly abstracted out code saved our asses.

One of the biggest challenges of this new age of information is Big Data. Not just a marketing term, it’s a real concern. If you want to handle tons of data, how do you read, write, and search for you data in a quick and scalable fashion? Thankfully the open source community has sprung forth and numerous options exist today. This is a cautionary tale of why your application layer should be robust enough to work with any Big Data solution.

Like finding a needle in a Haystack

When we started rolling out Repustate’s new backend, we knew we’d be dealing with a lot of data, specifically tweets. That means we’d need something that could write very quickly, but also read quickly. We also needed something that could scale well and would allow us to shard the data, if we needed to. Since Repustate is built on Django, we decided to go with Haystack (pip install django-haystack).

Haystack takes your Django models and adds a layer of natural language search on top for “free”. I say free, but it requires a bit of work. Haystack is awesome in that it’s relatively easy to change which backend you want to use for search. The options are WhooshSolr, or Xapian. Having used Xapian before, we went with that.

Haystack works by taking the data from your Postgres table, inserting it into the desired search backend and then re-indexing the search index. The problem is that Postgres can be quite slow to write, especially with transactions. Our worker pools were backlogged with data that needed to be written because we just couldn’t write quickly enough. I’m sure we could have tinkered with some Postgres settings to make writing a bit faster, but it seemed like we could find a better solution quicker. And so we did.

A Solr Eclipse (That’s two Java jokes for the price of one)

Because Django Haystack is written so well, we just had to make a few changes in our haystack settings and call “update_index()” and poof, our Postgres data was now in a Solr database. Not the biggest fan of Solr, but this was beautiful – for a while. Searching is a breeze, replication is trivial, and Jetty is a world class server. Java is tried & tested, so it can handle scale. But we again ran into a problem with writing and updating our index. Solr takes a long time to update indices once you get into the millions & million of documents. We played around with RAM usage, tweaking our Solr schema, nothing could achieve the performance we needed. So we took another stab at a backend solution. MongoDB.

A huMONGOus success.

Tl;dr: MongoDB kicks some serious ass. Writes are fast, reads are blindingly fast and replication sets are super easy to setup and work with. Repustate now processes the Twitter fire hose with ease, making data available for analysis almost as fast as it’s being created on Twitter. But the key to all of this tinkering and iteration was abstraction.

First because of Haystack, we were able to play around with Xapian and then Solr within minutes, because they abstracted out the nitty gritty, and just exposed one common search & index interface.

To replace Haystack with MongoDB, again it took no more than a few hours because of how we wrote our code. In Repustate, users create Data sources. For example, the twitter account of Charlie Sheen is a data source. Internally, we have a Postgres table and corresponding Django model for each Data source. A source has a search function that takes several parameters to limit searches. In our Django/Python code, a search would look like this:

>>> source = Datasource.objects.get(pk=10)
>>> source.search(‘all’, since=datetime.datetime(2011,10,01), until=None, gender=’M’)

The search method knows which backend it’s using so  every time we’ve switched out backends, we just have to update this one method. The entire Repustate code base relies on this one search method to retrieve the correct data for the correct data source for the given user. The search above retrieves all tweets or Facebook messages since October 1st that were written by males. No mention of mongo, no mention of Solr, no mention of any specific backend. Just a simple interface.

Your Comp Sci classes teach you something after all.

By properly refactoring our code, we were able to save ourselves from a big headache as we kept tinkering with our backend. This also has the added benefit of keeping our unit tests in tact as the same assertions would still apply. Furthermore, we abstracted out the notion of getting a connection to a database, so even though the search backend kept changing, we only had to update the code in one place.

And now, time to enjoy some MongoDB-supported social media analytics.