From Python to Go: migrating our entire API

The tl;dr Summary

If you want to know the whole story, read on. But for the impatient out there, here’s the executive summary:

  • We migrated our entire API stack from Python (First Django then Falcon) to Go, reducing the mean response time of an API call from 100ms to 10ms
  • We reduced the number of EC2 instances required by 85%
  • Because Go compiles to a single static binary and because Go 1.5 makes cross compilation a breeze, we can now ship a self-hosted version of Repustate that is identical to the one we host for customers. (Previously we shipped virtual machine appliances to customers which was a support nightmare)
  • Due to the similarity between Python and Go, we were able to quickly re-purpose our unit tests written in nose to fit the structure that Go requires with a just a few simple sed scripts.


Repustate provides text analytics services to small business, large enterprises and government organizations the world over. As the company has grown, so too has the strain on our servers. We process anywhere from 500 million to 1 billion pieces of text EACH day. Text comes in the form of tweets, news articles, blog comments, customer feedback forms and anything else our customers send our way. This text can be in one of the 9 languages we support so there’s that to consider as well since some languages tend to be more verbose than others (ahem, Arabic).

Text analytics is tough to do at scale since you can’t really leverage caching as much as you could in, say, serving static content on the web. Seldom do we analyze the exact same piece of text twice so we don’t bother maintaining any caches – which means each and every request we get is purely dynamic.

But the key insight when analyzing text is you realize that much of it can be done in parallel. Consider the task of running text through a part of speech tagger. For the most part, part of speech tagging algorithms use some sort of probabilistic modelling to determine the most likely tag for a word. But these probability models don’t cross sentence boundaries; the grammatical structure of one sentence doesn’t affect another. This means given a large block of text, we can split it up into sentences and then do the analysis of each sentence in parallel. The same strategy can be employed for sentiment as well.

So what’s wrong with Python?

Our first version of our API was in Django because, well, everyone knew Django and our site runs on Django so why not. And it worked. We got a prototype up and running and then built on top of that. We were able to get a profitable business up and running just on Django (and an old version at that, 1.3 was the version we were using when even 1.6 was out!).

But there’s a lot of overhead to each Django request/response cycle. As our API grew in usage, so too did reliability issues and our Amazon bill. We decided to look at other Python alternatives and Flask came up. It’s lightweight and almost ready-made for APIs, but then came across Falcon. We liked Falcon because it was optimized right off the bat using Cython. Simple benchmarks showed that it was *much* faster than Django and we liked how it enforced clean REST principles. As a bonus, our existing tests could be ported over quite easily, so we didn’t lose any time there.

Falcon proved to be a great stop gap. Our mean response time fell and the number of outages and support issues fell, too. I’d recommend Falcon to anyone building an API in Python today.

The performance, while better than Django, still couldn’t keep up with our demand. In particular, Python is a world of pain for doing concurrency. We were on python2.7 so we didn’t check out the new asyncio package in python3, but even then, you still have the GIL to worry about. Also, Falcon still didn’t solve one other major pain point: self-hosted deployment.

Python does not lend itself to being packaged up neatly, like Java or C, and distributed. Many of our customers run Repustate within their own networks for privacy & security reasons. Up to this point, we’ve been deploying our entire stack as a virtual appliance that can work with either VMware or Virtual Box. And this was an OK solution, but it was clunky. Updates were a pain, support was a pain (“how do I know the IP address of my virtual machine?”) and so on. If we could provide Repustate as a single, installable binary that was the exact same code base as our public API, then we’d have the best of both worlds. Also, this ideal solution had to be even faster than our Python version in Falcon, which meant leveraging the idea that text analytics lends itself to concurrent processing.

Go get gopher

Taking a step back in our story – our Arabic engine was written in this fancy new (at the time) language called Go. Here’s the blog post where we talk about our experience in migrating the code base to Go, but suffice to say, were quite happy with it. The ideal solution was staring us right in the face – we had to port everything to Go.

Go met all of our criteria:

  • faster than Python
  • compiles to a single binary
  • could be deployed into any operating system (and since Go 1.5, very easily at that)
  • makes concurrency trivial to reason about

As an added bonus, the layout of a Go test suite looks pretty similar to our nose tests. Test function headers were simple enough to migrate over e.g.:

def test_my_function():

becomes this:

func TestMyFunction(t *testing.T) {

With a couple of replacements of “=” to “:=” and single quotes to double quotes, we had Go-ready tests.

Because go routines and channels are so easy to work with, we were able to finally realize our dream of analyzing text in parallel. On a beefy machine with say 16 cores, we could just blast our way through text by chunking large pieces of text into smaller ones and then reconstituting the results on the other end e.g.

    chunks := s.Chunks(tws)
    channel := make(chan *ChunkScoreResult, len(chunks))
    for _, chunk := range chunks {
        go s.ScoreChunk(chunk, custom, channel)

    // Now loop until all goroutines have finished.
    chunkScoreResults := make([]*ChunkScoreResult, len(chunks))
    var r *ChunkScoreResult
    for i := 0; i < len(chunks); i++ {
        r = <-channel
        chunkScoreResults[i] = r 

This code snippet shows us taking a slice of chunks of text, “Scoring” them using go routines, and then collecting the results by reading from the channel one by one. Each ChunkScoreResult contains an “order” attribute which allows us to re-order things once we’re done. Pretty simple.

The entire port took about 3 months and resulted in several improvements unrelated to performance as the team was required to go through the Python code again and make improvements. As an aside, it’s always a good idea, time permitting, to go back and look as some of your old code. You’d be surprised at how bad it could be. The old “what the heck was I thinking when I wrote this” sentiment was felt by all.

We now have one code base for all of our customers that compiles to a single binary. No more virtual appliances. Our deployment process is just a matter of downloading the latest version of our binary.

Concluding remarks

The one thing writing code in a language like Go does to you is make you very aware of how memory works. Writing software in languages like Python or Ruby often seduces you into being ignorant of what’s going on under the hood because it’s just so easy to do pretty complex things, but languages like Go and C don’t hide that. So if you’re not used to that way of thinking, it takes some getting used to (how will the memory be allocated? Am I creating too much garbage? When does the garbage collector kick in?) but it makes your software run that much more smoothly and to be honest, makes you a better Python programmer, too.

Go isn’t perfect and there’s no shortage of blogs out there that can point out what’s wrong with the language. But if you write Go as it is intended to be written, and leverage its strengths, the results are fantastic.

Go – duplicate symbols for architecture x86_64

This is a short blog piece and really intended for fellow Go developers who stumble upon the same error, the dreaded “duplicate symbols” error.

Currently, some of Repustate’s Go code is using cgo to talk to various C libraries. It’s a stop gap until we finish porting all C code to pure Go. While writing some tests, we hit this error:

“ld: 1 duplicate symbol for architecture x86”

(note: if you had more than 1 duplicate, it would tell you exactly how many)

What does this mean? Well, it means we’re trying to link the same symbol name (in our case, a method) from two (or more) different source files. The fix was easy: rename one of the methods by updating the header file, the source file (.c or .cpp file) and lastly, updating your references to the symbol in your Go code, if it is directly referenced there.

Smooth sailing from here on in.

Named entity recognition in Russian

Named entity recognition is now available in Russian via the Repustate API. Combined with Russian sentiment analysis, customers can now do full text analytics in Russian with Repustate.

Repustate is happy to launch Russian named entity recognition to solutions already available in English & Arabic. But like all languages Russian has its nuances that caused named entity recognition to be a bit tougher than say English.

Consider the following sentence:
Путин дал Обаме новую ядерную соглашение

In English, this says:
Putin gave Obama the new nuclear agreement

This is how Barack Obama is written in Russian:
Барак Обама

Notice that in our sentence, “Obama” is spelled with a different suffix at the end. That’s because in Russian, nouns (even proper nouns), are conjugated based on the their use case. In this example, Obama is being used in what’s called the “dative” case, meaning the noun is the recipient of something. In English, there is no concept of conjugating nouns for this reason. English only requires changing the suffix in the case of pluralization.

So Repustate has to know how to stem proper nouns as well in order to properly identify “Обаме” as Barack Obama during the contextual expansion phase.

These are the sorts of problems we have solved so you don’t have to. Try out the Semantic API and let us know what you think.

Russian sentiment analysis: our newest language

Russian sentiment analysis is finally here – by popular demand

We’re often asked to implement text analytics in particular languages by customers, but no language has received as many requests as Russian. Some requests date back to 3 years ago!

We’re happy to announce that Russian sentiment analysis is now open to all to use. Repustate has been testing Russian text analytics in a private beta with select customers and the results have been great. Now all of our customers can analyze Russian using our API.

Russian semantic analysis, including named entity extraction and theme classification, will soon be available as well, completing the loop on full-blown text analytics in Russian.

Try out Repustate’s Russian sentiment analysis now on the Repustate API demo page.

быть здоровым!

Dashboard design: it’s all about communication

Dashboard design is how best to  communicate the stories the underlying data is trying to tell you.

Every good B2B product these days has to provide some sort of dashboard to let its customers see into the data from a  higher view. Dashboards should provide a narrative of what’s happening under the hood – so why do we make people do all the work? Repustate’s old dashboard for social media analytics flat out sucked. It was clunky, ugly, wasted valuable screen real estate on non-essential items and had very low engagement levels as a result. So we decided to redesign it.

Thank you, Stephen Few

The first step was to get a copy of Stephen Few’s book, Information Dashboard Design. Published in 2006, it covers a variety of topics related to dashboard design and presenting information. Although some of the screen shots in the book are dated, the same principles that applied then still apply now. Here are the main takeaways we got from this great book:

  1. No clutter – remove any elements that don’t add value
  2. Use colours to communicate a specific idea, not just for fun
  3. Place the most important graphs/numbers/conclusions at the top
  4. Provide feedback where possible – make the dashboard tell a story

If you read those 4 things, you’re probably thinking, “Well, duh, that’s obvious.” And yet, take a look at the dashboards you use everyday; they’re loaded with visual noise that is irrelevant, poorly placed, or difficult to make heads or tails from. Here’s a screen shot of Repustate’s new dashboard:

Repustate's new dashboard
Repustate’s new dashboard. Click for larger image.

Here’s a run down of various features and attributes of this new design:

  1. Minimal “chrome” (visual clutter) on the page. The side bar can slide out, but by default is closed and each of the menu items is also a hover menu so you never have to open the menu itself to see what each item means.
  2. The Repustate logo is in the sidebar menu and is visible only when expanded. At this point, does a user really need to be reminded of what they’re using – they know. Hide your logos.
  3. Top left area of the page provides a common action – download the raw data. No wasted space.
  4. Most important information for our users is sentiment trend over time – put that at the top. We colour coded the “panels” depending on the sentiment and also have a background image (thumb up, thumb down, question mark) depending on the value. The visuals align with the data itself.
  5. Summary statements tell the user exactly what’s happening – don’t make them guess. We have a bunch of rules set up that determine which sentences appear when. Within seconds, our users know exactly what’s going on – their dashboard is communicating with them directly. You can’t believe how much engagement increases as a result.
  6. Easy to read graphs & lists. Minimal clutter and minimal labelling. If your graphs and charts are done properly, the labelling of data points should be unnecessary.

In less than 20s, any executive will be able to glean from this exactly what’s going on. With more drilling down, clicking some links, they can get deeper insights, but the point is, to get the “executive summary”, they shouldn’t have to. We’re telling them if they’re doing well or not. We’re telling them what’s changed since last time. Ask yourself what information can your dashboards communicate and then make them do so. Your customers will love you for it.

SurveyMonkey analytics supercharged with Semantic & Sentiment analysis

SurveyMonkey API + Repustate text analytics = insightful open ended responses

We recently worked with a new entrant into the healthy snack foods business who wanted to understand the market they were getting into. Specifically, they wanted to know the following:

  1. Which foods do people currently eat as their “healthy snack”
  2. Which brands do consumers think of when they hear the word “snack”
  3. Was there anything about the current selection of snack foods that consumers didn’t like?
  4. If you were having friends or family over for a casual get together (summer BBQ, catching up etc.) what kinds of snacks would you serve?

Armed with these goals in mind, a survey was created using SurveyMonkey and distributed to the new entrant’s target market via their newsletter (Protip: before you launch a product, build up a mailing list. It works to 1) validate your idea and 2) tighten the feedback loop on product decisions). A telemarketing service was also employed to call people and ask them the same questions. These responses were transcribed and sent to Repustate so the same analysis could be performed.

OK so that’s the easy part; thousands of responses were collected. But the responses were what is referred to in the market research industry as “open ended” meaning they were just free form text as opposed to a multiple choice list. The reason being was this brand didn’t want to introduce any bias into the survey by prompting the respondents with possible answers. For example, take question #2 from above. If presented with a list of snacks, the respondent might in their head say “Oh yeah, I forgot about brand X” and check them off as being a snack they think of, when in reality, that brand’s product had slipped off their radar. Open ended responses test how deeply ingrained a particular idea, concept or in this case, brand, is within a person’s consciousness.

But having open ended responses poses a challenge – how do you data mine them en masse and aggregate the results to come up with something meaningful? If you have a few hundred responses to read, maybe you hire a few interns. But what about when you have ten’s of thousands? That’s where Repustate comes in.

Use the APIs, Luke

Fortunately, SurveyMonkey has a pretty simple to use API. Combined with Repustate’s even easier to use API, you can go from open ended response to data mined text in seconds. Here’s a code snippet that provides a good blueprint for how one can marry these two APIs together. While some details have been omitted, it should be relatively straightforward as to how you can adapt it to suit your needs:

So with very few lines of Python code, we’ve grabbed the open ended responses, processed them through the named entities API call, and can store the results in our backend of choice. Let’s take a look at a sample response and see  how Repustate categorized it.

Q: If you were having friends or family over for a casual get together (summer BBQ, catching up etc.) what kinds of snacks would you serve?

A: I usually put veggies out, like carrots, celery, cucumbers etc. etc. and maybe some dip like hummus and crackers.

Running that response through the Repustate API yields this information:

  "themes": [
  "entities": {
    "celery": "food.vegetable",
    "crackers": "food.other",
    "carrots": "food.vegetable",
    "cucumbers": "food.fruit",
    "hummus": "food.other"
  "status": "OK",
  "expansions": {}

Armed with this analysis, we then aggregated the results to see which categories of food, and which brands were being mentioned the most frequently. This helped our client understand who they were competing against.


As it turns out, it was plain old vegetables that were the biggest competition to this new entrant, which is a double edged sword. On the one hand, it means they don’ have to spent the marketing dollars to compete with an entrenched incumbent who dominates most of the shelf space in supermarkets. On the other hand, it’s a troubling place to be in because vegetables are well known, cheap, and are viewed as healthy (obviously).

We’re fortunate to be living in a time when so much data is at our disposal, ready to be sliced & diced. We’re also cursed because there’s so much of it! We need the right tools and a clear mind to handle these sorts of problems, but it’s possible.

If you think your company could benefit from this sort of semantic analysis, we’d love to help so contact us.

Moving from freemium to premium

The freemium business model has suited Repustate well to a point, but now it’s time to transition to a fully paid service.

When Repustate launched all that time ago, it was a completely free service. We didn’t take your money even if you offered. The reason was we wanted more customer data to improve our various language models and felt giving software away for free in exchange for the data was a good bargain for both sides.

As our products matured and as our user base grew, it was time to flip the monetary switch and start charging – but still offering a free “Starter” plan as well as a free demo to try out our sentiment & semantic analysis engine.

As we’ve grown and as our SEO has improved, we’ve received a lot more interest from “tire kickers”. People who just want to play around, not really interested in buying. And that was fine by us because again, we got their data so we could see how to improve our engines. But most recently, the abuse of our Starter plan has got to the point where this is no longer worth our while. People are creating those 10 minute throwaway accounts to sign up, activate their accounts, and then use our API.

While one could argue that if people aren’t willing to pay, maybe the product isn’t that good – the extremes people are going to in order to use Repustate’s API for free tells us that we do have a good product and charging everyone is perfectly reasonable.

As a result, we will be removing the Starter plan. From now on, all accounts must be created with a valid credit card. We’ll probably offer a money-back trial period, say 14 days, but other than that, customers must be committing to payment on Day 0. We will also bump up the quotas for all of our plans to make the value proposition all the better.

Any plans currently on Starter accounts will be allowed to remain so. If you have any questions about this change and how it affects you, please contact us anytime.

Beware the lure of crowdsourced data

Crowdsourced data can often be inconsistent, messy or downright wrong

We all like something for nothing, that’s why open source software is so popular. (It’s also why the Pirate Bay exists). But sometimes things that seem too good to be true are just that.

Repustate is in the text analytics game which means we needs lots and lots of data to model certain characteristics of written text. We need common words, grammar constructs, human-annotated corpora of text etc. to make our various language models work as quickly and as well as they do.

We recently embarked on the next phase of our text analytics adventure: semantic analysis. Semantic analysis the process of taking arbitrary text and assigning meaning to the individual, relevant components. For example, being able to identify “apple” as a fruit in the sentence “I went apple picking yesterday” but to identify “Apple’ the company when saying “I can’t wait for the new Apple product announcement” (note: even though I used title case for the latter example, casing should not matter)

To be able to accomplish this task, we need a few things:

1) List of every possible person/place/business/thing we care about and the classification they belong to

2) A corpus of text (or corpora) that will allow us to disambiguate terms based on context. In other words, if we see the word “banana” near the word “apple”, we can safely assume we’re talking about fruits and not computers.

Since we’re not Google, we don’t have access to every person’s search history and resulting click throughs (although their n-gram data is useful in some applications). So we have to be clever.

For anyone who’s done work in text analysis, you’ll have heard of Freebase. Freebase is a crowdsourced repository of facts. Kind volunteers have contributed lists of data and tagged meta information about them. For example, you can look up all makes of a particular automotive manufacturer, like Audi. You can see a list of musicians (hundreds of thousands actually), movie stars, TV actors or types of food.

It’s tempting to use data like Freebase. It seems like someone did all the work for you. But once you dig inside, you realize it’s tons of junk, all the way the down.

For example, under the Food category, you’ll see the name of each US state. I didn’t realize I could eat Alaska. Under book authors, you’ll see any athlete who’s ever “written” an autobiography. I highly doubt Michael Jordan wrote his own book, but there it is. LeBron James, NBA all-star for the Miami Heat, is listed as a movie actor.

The list goes on and on. While Freebase definitely does lend itself to being a good starting point, ultimately you’re on your own to come up with a better list of entities either through some mechanical turking or being more clever :)

By the way, if you’d like to see the end result of Repustate’s curation process, head on over to the Semantic API and try it out.

Introducing Semantic Analysis

Repustate is announcing today the release of its new product: semantic analysis. Combined with sentiment analysis, Repustate provides any organization, from startup to Fortune 50 enterprise, all the necessary tools they need to conduct in-depth text analytics. For the impatient, head on over to the semantic analysis docs page to get started.

Text analytics: the story so far

Until this point, Repustate has been concerned with analyzing text structurally. Part of speech tagging, grammatical analysis, even sentiment analysis is really all about the structure of the text. The order in which words come, the use of conjunctions, adjectives or adverbs to denote any sentiment. All of this is a great first step in understanding the content around you – but it’s just that, a first step.

Today we’re proud and excited to announce Semantic Analysis by Repustate. We consider this release to be the biggest product release in Repustate’s history and the one that we’re most proud of (although Arabic sentiment analysis was a doozy as well!)

Semantic analysis explained

Repustate can determine the subject matter of any piece of text. We know that a tweet saying “I love shooting hoops with my friends” has to do with sports, namely, basketball. Using Repustate’s semantic analysis API you can now determine the theme or subject matter of any tweet, comment or blog post.

But beyond just identifying the subject matter of a piece of text, Repustate can dig deeper and understand each and every key entity in the text and disambiguate based on context.

Named entity recognition

Repustate’s semantic analysis tool extracts each and every one of these entities and tells you the context. Repustate knows that the term “Obama” refers to “Barack Obama”, the President of the United States. Repustate knows that in the sentence “I can’t wait to see the new Scorsese film”, Scorsese refers to “Martin Scorsese” the director. With very little context (and sometimes no context at all), Repustate knows exactly what an arbitrary piece of text is talking about. Take the following examples:

  1. Obama.
  2. Obama is the President.
  3. Obama is the First Lady.

Here we have three instances where the term “Obama” is being used in different contexts. In the first example, there is no context, just the name ‘Obama’. Repustate will use its internal probability model to determine the most likely usage for this term is in the name ‘Barack Obama’, hence an API call will return ‘Barack Obama’. Similarly, in the second example, the word “President” acts as a hint to the single term ‘Obama’ and again, the API call will return ‘Barack Obama’. But what about the third example?

Here, Repustate is smart enough to see the phrase “First Lady”. This tells Repustate to select ‘Michelle Obama’ instead of Barack. Pretty neat, huh?

Semantic analysis in your language

Like every other feature Repustate offers, no language takes a back seat and that’s why semantic analysis is available in every language Repustate supports. Currently only English is publicly available but we’re rolling out every other language in the coming weeks.

Semantic analysis by the numbers

Repustate currently has over 5.5 million entities, including people, places, brands, companies and ideas in its ontology. There are over 500 categorizations of entities, and over 30 themes with which to classify a piece of text’s subject matter. And an infinite number of ways to use Repustate to transform your text analysis.

Head on over to the Semantic Analysis Tour to see more.


Using python requests – whose encoding is it anyway?

Python requests encoding – using the Python requests module might give you surprising results

If you do anything remotely related to HTTP and use Python, then you’re probably using requests, the amazing library by Kenneth Reitz. It’s to Python HTTP programming what jQuery is to Javascript DOM manipulation – once you use it, you wonder how you ever did without it.

But there’s a subtle issue with regards to encodings that tripped us up. A customer told us that some Chinese web pages were coming back garbled when using the clean-html API call we provide. Here’s the URL:

In the HTML on these pages, the charset is gb2312 which is an encoding that came out of China used for the Simplified Chinese set of characters. However, many web servers do not send this as the charset in the response headers (due to the programmers, not the web server itself). As a result, requests defaults to ISO 8851-9 as the encoding when the response doesn’t contain a charset. This is done in accordance with RFC 2616. The upshot is that the Chinese text in the web page doesn’t get encoded properly when you access the encoded content of the response and so what you see is garbled characters.

Here’s the response headers for the above URL:

curl -I
HTTP/1.1 200 OK

Content-Type: text/html
Vary: Accept-Encoding
X-Powered-By: schi_v1.02
Server: nginx
Date: Mon, 17 Feb 2014 15:54:28 GMT
Last-Modified: Sat, 08 Feb 2014 03:56:49 GMT
Expires: Mon, 17 Feb 2014 15:56:28 GMT
Cache-Control: max-age=120
Content-Length: 133944
X-Cache: HIT from

There is a thread on the Github repository for requests that explains why they do this – requests shouldn’t be about HTML, the argument goes, it’s about HTTP so if a server doesn’t respond with the proper charset declaration, it’s up to the client (or the developer) to figure out what to do. That’s a reasonable position to take, but it poses an interesting question: When “common” use or expectations, go against official spec, whose side does one take? Do you tell developers to put on their big boy and girl pants and deal with it or do you acquiesce and just do what most people expect/want?

Specs be damned, make it easy for people

I believe it was former Twitter API lead at the time, Alex Payne, who was asked why does Twitter include the version of the API in the URL rather than in the request header, as is more RESTful. His paraphrased response (because I can’t find the quote) is that Twitter’s goal was to get as many people using the API as possible and settings headers was beyond the skill level of many developers, whereas including it in the URL is dead simple. (We at Repustate do the same thing; our APIs are versioned via the URL. It’s simpler and more transparent.)

Now the odd thing about requests is that the package has an attribute called apparent_encoding which does correctly guess the charset based on the content of the response. It’s just not automatically applied because the response header takes precedence.

We ended up patching requests so that the apparent_encoding attribute is what gets used in the case no headers are set by default, but this is not the default behaviour of the package.

I can’t say I necessarily disagree with the choices the maintainers of requests have made. I’m not sure if there is a right answer because if you write your code to be user friendly in direct opposition to a published spec, you will almost certainly raise the ire of someone who *does* expect things to work to spec. Damned if you do, damned if you don’t.