Introducing Repustate Sync for distributed deployment

Keeping your Repustate data in sync used to be a pain point – not anymore.

Since we launched the Repustate Server nearly three years ago, the biggest complaint has always been keeping the Servers in-sync across the entire cluster. While previously we had resorted to using databases to keep all data synced up, it placed too much of a burden on our customers. Today, that burden is lifted and Repustate Sync just works.

Distributed deployments with Repustate are now easier than ever with Repustate Sync. All your customized rules and filters are automatically available on all your servers.

To get the latest version of Repustate, which contains Repustate Sync, head on over to your account and download the newest Repustate version.

What’s great about Repustate Sync is it works even if one of your peer nodes goes down and then comes back up online. Each Repustate Sync instance stores a queue of transactions that were not synced properly, so whenever peers are brought back online, everyone is brought up to date immediately.

The only configuration that is needed is the IP addresses and ports that each peer in your Repustate cluster is running on.

For more information about the Repustate Server or Repustate Sync, head on over to the Repustate Server documentation area.

From Python to Go: migrating our entire API

The tl;dr Summary

If you want to know the whole story, read on. But for the impatient out there, here’s the executive summary:

  • We migrated our entire API stack from Python (First Django then Falcon) to Go, reducing the mean response time of an API call from 100ms to 10ms
  • We reduced the number of EC2 instances required by 85%
  • Because Go compiles to a single static binary and because Go 1.5 makes cross compilation a breeze, we can now ship a self-hosted version of Repustate that is identical to the one we host for customers. (Previously we shipped virtual machine appliances to customers which was a support nightmare)
  • Due to the similarity between Python and Go, we were able to quickly re-purpose our unit tests written in nose to fit the structure that Go requires with a just a few simple sed scripts.

Background

Repustate provides text analytics services to small business, large enterprises and government organizations the world over. As the company has grown, so too has the strain on our servers. We process anywhere from 500 million to 1 billion pieces of text EACH day. Text comes in the form of tweets, news articles, blog comments, customer feedback forms and anything else our customers send our way. This text can be in one of the 9 languages we support so there’s that to consider as well since some languages tend to be more verbose than others (ahem, Arabic).

Text analytics is tough to do at scale since you can’t really leverage caching as much as you could in, say, serving static content on the web. Seldom do we analyze the exact same piece of text twice so we don’t bother maintaining any caches – which means each and every request we get is purely dynamic.

But the key insight when analyzing text is you realize that much of it can be done in parallel. Consider the task of running text through a part of speech tagger. For the most part, part of speech tagging algorithms use some sort of probabilistic modelling to determine the most likely tag for a word. But these probability models don’t cross sentence boundaries; the grammatical structure of one sentence doesn’t affect another. This means given a large block of text, we can split it up into sentences and then do the analysis of each sentence in parallel. The same strategy can be employed for sentiment as well.

So what’s wrong with Python?

Our first version of our API was in Django because, well, everyone knew Django and our site runs on Django so why not. And it worked. We got a prototype up and running and then built on top of that. We were able to get a profitable business up and running just on Django (and an old version at that, 1.3 was the version we were using when even 1.6 was out!).

But there’s a lot of overhead to each Django request/response cycle. As our API grew in usage, so too did reliability issues and our Amazon bill. We decided to look at other Python alternatives and Flask came up. It’s lightweight and almost ready-made for APIs, but then came across Falcon. We liked Falcon because it was optimized right off the bat using Cython. Simple benchmarks showed that it was *much* faster than Django and we liked how it enforced clean REST principles. As a bonus, our existing tests could be ported over quite easily, so we didn’t lose any time there.

Falcon proved to be a great stop gap. Our mean response time fell and the number of outages and support issues fell, too. I’d recommend Falcon to anyone building an API in Python today.

The performance, while better than Django, still couldn’t keep up with our demand. In particular, Python is a world of pain for doing concurrency. We were on python2.7 so we didn’t check out the new asyncio package in python3, but even then, you still have the GIL to worry about. Also, Falcon still didn’t solve one other major pain point: self-hosted deployment.

Python does not lend itself to being packaged up neatly, like Java or C, and distributed. Many of our customers run Repustate within their own networks for privacy & security reasons. Up to this point, we’ve been deploying our entire stack as a virtual appliance that can work with either VMware or Virtual Box. And this was an OK solution, but it was clunky. Updates were a pain, support was a pain (“how do I know the IP address of my virtual machine?”) and so on. If we could provide Repustate as a single, installable binary that was the exact same code base as our public API, then we’d have the best of both worlds. Also, this ideal solution had to be even faster than our Python version in Falcon, which meant leveraging the idea that text analytics lends itself to concurrent processing.

Go get gopher

Taking a step back in our story – our Arabic engine was written in this fancy new (at the time) language called Go. Here’s the blog post where we talk about our experience in migrating the code base to Go, but suffice to say, were quite happy with it. The ideal solution was staring us right in the face – we had to port everything to Go.

Go met all of our criteria:

  • faster than Python
  • compiles to a single binary
  • could be deployed into any operating system (and since Go 1.5, very easily at that)
  • makes concurrency trivial to reason about

As an added bonus, the layout of a Go test suite looks pretty similar to our nose tests. Test function headers were simple enough to migrate over e.g.:

def test_my_function():

becomes this:

func TestMyFunction(t *testing.T) {

With a couple of replacements of “=” to “:=” and single quotes to double quotes, we had Go-ready tests.

Because go routines and channels are so easy to work with, we were able to finally realize our dream of analyzing text in parallel. On a beefy machine with say 16 cores, we could just blast our way through text by chunking large pieces of text into smaller ones and then reconstituting the results on the other end e.g.

    chunks := s.Chunks(tws)
    channel := make(chan *ChunkScoreResult, len(chunks))
    for _, chunk := range chunks {
        go s.ScoreChunk(chunk, custom, channel)
    }   

    // Now loop until all goroutines have finished.
    chunkScoreResults := make([]*ChunkScoreResult, len(chunks))
    var r *ChunkScoreResult
    for i := 0; i < len(chunks); i++ {
        r = <-channel
        chunkScoreResults[i] = r 
    }

This code snippet shows us taking a slice of chunks of text, “Scoring” them using go routines, and then collecting the results by reading from the channel one by one. Each ChunkScoreResult contains an “order” attribute which allows us to re-order things once we’re done. Pretty simple.

The entire port took about 3 months and resulted in several improvements unrelated to performance as the team was required to go through the Python code again and make improvements. As an aside, it’s always a good idea, time permitting, to go back and look as some of your old code. You’d be surprised at how bad it could be. The old “what the heck was I thinking when I wrote this” sentiment was felt by all.

We now have one code base for all of our customers that compiles to a single binary. No more virtual appliances. Our deployment process is just a matter of downloading the latest version of our binary.

Concluding remarks

The one thing writing code in a language like Go does to you is make you very aware of how memory works. Writing software in languages like Python or Ruby often seduces you into being ignorant of what’s going on under the hood because it’s just so easy to do pretty complex things, but languages like Go and C don’t hide that. So if you’re not used to that way of thinking, it takes some getting used to (how will the memory be allocated? Am I creating too much garbage? When does the garbage collector kick in?) but it makes your software run that much more smoothly and to be honest, makes you a better Python programmer, too.

Go isn’t perfect and there’s no shortage of blogs out there that can point out what’s wrong with the language. But if you write Go as it is intended to be written, and leverage its strengths, the results are fantastic.

Go – duplicate symbols for architecture x86_64

This is a short blog piece and really intended for fellow Go developers who stumble upon the same error, the dreaded “duplicate symbols” error.

Currently, some of Repustate’s Go code is using cgo to talk to various C libraries. It’s a stop gap until we finish porting all C code to pure Go. While writing some tests, we hit this error:

“ld: 1 duplicate symbol for architecture x86”

(note: if you had more than 1 duplicate, it would tell you exactly how many)

What does this mean? Well, it means we’re trying to link the same symbol name (in our case, a method) from two (or more) different source files. The fix was easy: rename one of the methods by updating the header file, the source file (.c or .cpp file) and lastly, updating your references to the symbol in your Go code, if it is directly referenced there.

Smooth sailing from here on in.

Moving from freemium to premium

The freemium business model has suited Repustate well to a point, but now it’s time to transition to a fully paid service.

When Repustate launched all that time ago, it was a completely free service. We didn’t take your money even if you offered. The reason was we wanted more customer data to improve our various language models and felt giving software away for free in exchange for the data was a good bargain for both sides.

As our products matured and as our user base grew, it was time to flip the monetary switch and start charging – but still offering a free “Starter” plan as well as a free demo to try out our sentiment & semantic analysis engine.

As we’ve grown and as our SEO has improved, we’ve received a lot more interest from “tire kickers”. People who just want to play around, not really interested in buying. And that was fine by us because again, we got their data so we could see how to improve our engines. But most recently, the abuse of our Starter plan has got to the point where this is no longer worth our while. People are creating those 10 minute throwaway accounts to sign up, activate their accounts, and then use our API.

While one could argue that if people aren’t willing to pay, maybe the product isn’t that good – the extremes people are going to in order to use Repustate’s API for free tells us that we do have a good product and charging everyone is perfectly reasonable.

As a result, we will be removing the Starter plan. From now on, all accounts must be created with a valid credit card. We’ll probably offer a money-back trial period, say 14 days, but other than that, customers must be committing to payment on Day 0. We will also bump up the quotas for all of our plans to make the value proposition all the better.

Any plans currently on Starter accounts will be allowed to remain so. If you have any questions about this change and how it affects you, please contact us anytime.

Beware the lure of crowdsourced data

Crowdsourced data can often be inconsistent, messy or downright wrong

We all like something for nothing, that’s why open source software is so popular. (It’s also why the Pirate Bay exists). But sometimes things that seem too good to be true are just that.

Repustate is in the text analytics game which means we needs lots and lots of data to model certain characteristics of written text. We need common words, grammar constructs, human-annotated corpora of text etc. to make our various language models work as quickly and as well as they do.

We recently embarked on the next phase of our text analytics adventure: semantic analysis. Semantic analysis the process of taking arbitrary text and assigning meaning to the individual, relevant components. For example, being able to identify “apple” as a fruit in the sentence “I went apple picking yesterday” but to identify “Apple’ the company when saying “I can’t wait for the new Apple product announcement” (note: even though I used title case for the latter example, casing should not matter)

To be able to accomplish this task, we need a few things:

1) List of every possible person/place/business/thing we care about and the classification they belong to

2) A corpus of text (or corpora) that will allow us to disambiguate terms based on context. In other words, if we see the word “banana” near the word “apple”, we can safely assume we’re talking about fruits and not computers.

Since we’re not Google, we don’t have access to every person’s search history and resulting click throughs (although their n-gram data is useful in some applications). So we have to be clever.

For anyone who’s done work in text analysis, you’ll have heard of Freebase. Freebase is a crowdsourced repository of facts. Kind volunteers have contributed lists of data and tagged meta information about them. For example, you can look up all makes of a particular automotive manufacturer, like Audi. You can see a list of musicians (hundreds of thousands actually), movie stars, TV actors or types of food.

It’s tempting to use data like Freebase. It seems like someone did all the work for you. But once you dig inside, you realize it’s tons of junk, all the way the down.

For example, under the Food category, you’ll see the name of each US state. I didn’t realize I could eat Alaska. Under book authors, you’ll see any athlete who’s ever “written” an autobiography. I highly doubt Michael Jordan wrote his own book, but there it is. LeBron James, NBA all-star for the Miami Heat, is listed as a movie actor.

The list goes on and on. While Freebase definitely does lend itself to being a good starting point, ultimately you’re on your own to come up with a better list of entities either through some mechanical turking or being more clever 🙂

By the way, if you’d like to see the end result of Repustate’s curation process, head on over to the Semantic API and try it out.

Introducing Semantic Analysis

Repustate is announcing today the release of its new product: semantic analysis. Combined with sentiment analysis, Repustate provides any organization, from startup to Fortune 50 enterprise, all the necessary tools they need to conduct in-depth text analytics. For the impatient, head on over to the semantic analysis docs page to get started.

Text analytics: the story so far

Until this point, Repustate has been concerned with analyzing text structurally. Part of speech tagging, grammatical analysis, even sentiment analysis is really all about the structure of the text. The order in which words come, the use of conjunctions, adjectives or adverbs to denote any sentiment. All of this is a great first step in understanding the content around you – but it’s just that, a first step.

Today we’re proud and excited to announce Semantic Analysis by Repustate. We consider this release to be the biggest product release in Repustate’s history and the one that we’re most proud of (although Arabic sentiment analysis was a doozy as well!)

Semantic analysis explained

Repustate can determine the subject matter of any piece of text. We know that a tweet saying “I love shooting hoops with my friends” has to do with sports, namely, basketball. Using Repustate’s semantic analysis API you can now determine the theme or subject matter of any tweet, comment or blog post.

But beyond just identifying the subject matter of a piece of text, Repustate can dig deeper and understand each and every key entity in the text and disambiguate based on context.

Named entity recognition

Repustate’s semantic analysis tool extracts each and every one of these entities and tells you the context. Repustate knows that the term “Obama” refers to “Barack Obama”, the President of the United States. Repustate knows that in the sentence “I can’t wait to see the new Scorsese film”, Scorsese refers to “Martin Scorsese” the director. With very little context (and sometimes no context at all), Repustate knows exactly what an arbitrary piece of text is talking about. Take the following examples:

  1. Obama.
  2. Obama is the President.
  3. Obama is the First Lady.

Here we have three instances where the term “Obama” is being used in different contexts. In the first example, there is no context, just the name ‘Obama’. Repustate will use its internal probability model to determine the most likely usage for this term is in the name ‘Barack Obama’, hence an API call will return ‘Barack Obama’. Similarly, in the second example, the word “President” acts as a hint to the single term ‘Obama’ and again, the API call will return ‘Barack Obama’. But what about the third example?

Here, Repustate is smart enough to see the phrase “First Lady”. This tells Repustate to select ‘Michelle Obama’ instead of Barack. Pretty neat, huh?

Semantic analysis in your language

Like every other feature Repustate offers, no language takes a back seat and that’s why semantic analysis is available in every language Repustate supports. Currently only English is publicly available but we’re rolling out every other language in the coming weeks.

Semantic analysis by the numbers

Repustate currently has over 5.5 million entities, including people, places, brands, companies and ideas in its ontology. There are over 500 categorizations of entities, and over 30 themes with which to classify a piece of text’s subject matter. And an infinite number of ways to use Repustate to transform your text analysis.

Head on over to the Semantic Analysis Tour to see more.

 

Using python requests – whose encoding is it anyway?

Python requests encoding – using the Python requests module might give you surprising results

If you do anything remotely related to HTTP and use Python, then you’re probably using requests, the amazing library by Kenneth Reitz. It’s to Python HTTP programming what jQuery is to Javascript DOM manipulation – once you use it, you wonder how you ever did without it.

But there’s a subtle issue with regards to encodings that tripped us up. A customer told us that some Chinese web pages were coming back garbled when using the clean-html API call we provide. Here’s the URL:

http://finance.sina.com.cn/china/20140208/111618150293.shtml

In the HTML on these pages, the charset is gb2312 which is an encoding that came out of China used for the Simplified Chinese set of characters. However, many web servers do not send this as the charset in the response headers (due to the programmers, not the web server itself). As a result, requests defaults to ISO 8851-9 as the encoding when the response doesn’t contain a charset. This is done in accordance with RFC 2616. The upshot is that the Chinese text in the web page doesn’t get encoded properly when you access the encoded content of the response and so what you see is garbled characters.

Here’s the response headers for the above URL:

curl -I http://finance.sina.com.cn/china/20140208/111618150293.shtml
HTTP/1.1 200 OK

Content-Type: text/html
Vary: Accept-Encoding
X-Powered-By: schi_v1.02
Server: nginx
Date: Mon, 17 Feb 2014 15:54:28 GMT
Last-Modified: Sat, 08 Feb 2014 03:56:49 GMT
Expires: Mon, 17 Feb 2014 15:56:28 GMT
Cache-Control: max-age=120
Content-Length: 133944
X-Cache: HIT from 236-41.D07071951.sina.com.cn

There is a thread on the Github repository for requests that explains why they do this – requests shouldn’t be about HTML, the argument goes, it’s about HTTP so if a server doesn’t respond with the proper charset declaration, it’s up to the client (or the developer) to figure out what to do. That’s a reasonable position to take, but it poses an interesting question: When “common” use or expectations, go against official spec, whose side does one take? Do you tell developers to put on their big boy and girl pants and deal with it or do you acquiesce and just do what most people expect/want?

Specs be damned, make it easy for people

I believe it was former Twitter API lead at the time, Alex Payne, who was asked why does Twitter include the version of the API in the URL rather than in the request header, as is more RESTful. His paraphrased response (because I can’t find the quote) is that Twitter’s goal was to get as many people using the API as possible and settings headers was beyond the skill level of many developers, whereas including it in the URL is dead simple. (We at Repustate do the same thing; our APIs are versioned via the URL. It’s simpler and more transparent.)

Now the odd thing about requests is that the package has an attribute called apparent_encoding which does correctly guess the charset based on the content of the response. It’s just not automatically applied because the response header takes precedence.

We ended up patching requests so that the apparent_encoding attribute is what gets used in the case no headers are set by default, but this is not the default behaviour of the package.

I can’t say I necessarily disagree with the choices the maintainers of requests have made. I’m not sure if there is a right answer because if you write your code to be user friendly in direct opposition to a published spec, you will almost certainly raise the ire of someone who *does* expect things to work to spec. Damned if you do, damned if you don’t.

Social sentiment analysis is missing something

Sentiment analysis by itself doesn’t suffice

An interesting article by Seth Grimes caught our eye this week. Seth is one of the few voices of reason in the world of text analytics that I feel “gets it”. His views on sentiment’s strengths and weaknesses, advantages and shortcomings align quite perfectly with Repustate’s general philosophy.

In the article, Seth states that simply getting relying on a number denoting sentiment or a label like “positive” or “negative” is too coarse a measurement and doesn’t carry any meaning with it. By doing so, you risk overlooking deeper insights that are hidden beneath the high level sentiment score. Couldn’t agree more with this and that’s why Repustate supports categorizations.

Sentiment by itself is meaningless; sentiment analysis scoped to a particular business need or product feature etc. is where true value lies. Categorizing your social data by features of your service (e.g. price, selection, quality) first and THEN applying sentiment analysis is the way to go. In the article, Seth proceeds to list a few “emotional” ones (promoter/detractor, angry, happy etc). that quite frankly I would ignore. These categories are too touchy-feely, hard to really disambiguate at a machine learning level and don’t tie closely to actual business processes/features. For instance, if someone is a detractor, what is that is causing them to be a detractor? Was it the service they received? If so, then customer service is the category you want and negative polarity of the text in question gives you invaluable insights. The fact that someone is being negative about your business means almost by definition they are detractors.

Repustate provides our customers with the ability to create their own categories according to the definitions that they create. Each customer is different, each business is different, hence the need for customized categories. Once you have your categories, sentiment analysis becomes much more insightful and valuable to your business.

Sharing large data structures across processes in Python

At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server. We also distribute our software as virtual machines to some customers so memory usage has to be light because we can’t control how much memory our customers will allocate to the VMs they deploy.

To summarize, here’s our checklist of requirements:

  1. Low memory footprint
  2. Can be shared amongst multiple processes with no issues (read only)
  3. Very fast access
  4. Easy to update (write) out of process

So our first attempt was to store the models on disk in a MongoDB and to load them into memory as Python dictionaries. This worked and satisfied #3 and #4 but failed #1 and #2. This is how Repustate operated for a while, but memory usage kept growing and it became unsustainable. Python dictionaries are not memory efficient. And it was too expensive for each Apache process to need a copy of this since we were not sharing the data between processes.

One night I was complaining about our dilemma and a friend of mine, who happens to be a great developer at Red Hat, said these three words: “memory mapped file”. Of course! In fact, Repustate already uses memory mapped files but I completely forgot about this. So that solves half my problem – it meets requirements #2. But what format does the memory mapped file take? Thankfully computer science has already solved all the world’s problems and the perfect data structure was already out there: tries.

Tries (pronounced “trees” for some reason and not “try’s”) AKA radix trees AKA prefix trees are a data structure that lend themselves to objects that need string keys. Wikipedia has a better write up but long story short, tries are great for the type of models Repustate uses.

I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.

What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.

Next time you’re in need of sharing large amounts of data, give memory mapped tries a chance.

 

Chinese POS Tagger (and other languages)

Need an Arabic part of speech tagger (AKA an Arabic POS Tagger)? How about German or Italian? You’re in luck – Repustate’s internal POS taggers have been opened up via our API to give our developers the ability to slice and dice multilingual text the way we do.

The documentation for the POS tagger API call outlines all you need to know to get started. With this new API call you get:

  • English POS tagger
  • German POS tagger
  • Chinese POS tagger
  • French POS tagger
  • Italian POS tagger
  • Spanish POS tagger
  • Arabic POS tagger

Beyond this, we’ve unified the tag set you get from the various POS taggers so that you only have to write code once to handle all languages. The complete tag set includes nouns, adjectives, verbs, adverbs, punctuation marks, conjunctions and prepositions. Give it a shot and let us know what you think.