Introducing Repustate Sync for distributed deployment

Keeping your Repustate data in sync used to be a pain point – not anymore.

Since we launched the Repustate Server nearly three years ago, the biggest complaint has always been keeping the Servers in-sync across the entire cluster. While previously we had resorted to using databases to keep all data synced up, it placed too much of a burden on our customers. Today, that burden is lifted and Repustate Sync just works.

Distributed deployments with Repustate are now easier than ever with Repustate Sync. All your customized rules and filters are automatically available on all your servers.

To get the latest version of Repustate, which contains Repustate Sync, head on over to your account and download the newest Repustate version.

What’s great about Repustate Sync is it works even if one of your peer nodes goes down and then comes back up online. Each Repustate Sync instance stores a queue of transactions that were not synced properly, so whenever peers are brought back online, everyone is brought up to date immediately.

The only configuration that is needed is the IP addresses and ports that each peer in your Repustate cluster is running on.

For more information about the Repustate Server or Repustate Sync, head on over to the Repustate Server documentation area.

Named entity recognition in Russian

Named entity recognition is now available in Russian via the Repustate API. Combined with Russian sentiment analysis, customers can now do full text analytics in Russian with Repustate.

Repustate is happy to launch Russian named entity recognition to solutions already available in English & Arabic. But like all languages Russian has its nuances that caused named entity recognition to be a bit tougher than say English.

Consider the following sentence:
Путин дал Обаме новую ядерную соглашение

In English, this says:
Putin gave Obama the new nuclear agreement

This is how Barack Obama is written in Russian:
Барак Обама

Notice that in our sentence, “Obama” is spelled with a different suffix at the end. That’s because in Russian, nouns (even proper nouns), are conjugated based on the their use case. In this example, Obama is being used in what’s called the “dative” case, meaning the noun is the recipient of something. In English, there is no concept of conjugating nouns for this reason. English only requires changing the suffix in the case of pluralization.

So Repustate has to know how to stem proper nouns as well in order to properly identify “Обаме” as Barack Obama during the contextual expansion phase.

These are the sorts of problems we have solved so you don’t have to. Try out the Semantic API and let us know what you think.

Beware the lure of crowdsourced data

Crowdsourced data can often be inconsistent, messy or downright wrong

We all like something for nothing, that’s why open source software is so popular. (It’s also why the Pirate Bay exists). But sometimes things that seem too good to be true are just that.

Repustate is in the text analytics game which means we needs lots and lots of data to model certain characteristics of written text. We need common words, grammar constructs, human-annotated corpora of text etc. to make our various language models work as quickly and as well as they do.

We recently embarked on the next phase of our text analytics adventure: semantic analysis. Semantic analysis the process of taking arbitrary text and assigning meaning to the individual, relevant components. For example, being able to identify “apple” as a fruit in the sentence “I went apple picking yesterday” but to identify “Apple’ the company when saying “I can’t wait for the new Apple product announcement” (note: even though I used title case for the latter example, casing should not matter)

To be able to accomplish this task, we need a few things:

1) List of every possible person/place/business/thing we care about and the classification they belong to

2) A corpus of text (or corpora) that will allow us to disambiguate terms based on context. In other words, if we see the word “banana” near the word “apple”, we can safely assume we’re talking about fruits and not computers.

Since we’re not Google, we don’t have access to every person’s search history and resulting click throughs (although their n-gram data is useful in some applications). So we have to be clever.

For anyone who’s done work in text analysis, you’ll have heard of Freebase. Freebase is a crowdsourced repository of facts. Kind volunteers have contributed lists of data and tagged meta information about them. For example, you can look up all makes of a particular automotive manufacturer, like Audi. You can see a list of musicians (hundreds of thousands actually), movie stars, TV actors or types of food.

It’s tempting to use data like Freebase. It seems like someone did all the work for you. But once you dig inside, you realize it’s tons of junk, all the way the down.

For example, under the Food category, you’ll see the name of each US state. I didn’t realize I could eat Alaska. Under book authors, you’ll see any athlete who’s ever “written” an autobiography. I highly doubt Michael Jordan wrote his own book, but there it is. LeBron James, NBA all-star for the Miami Heat, is listed as a movie actor.

The list goes on and on. While Freebase definitely does lend itself to being a good starting point, ultimately you’re on your own to come up with a better list of entities either through some mechanical turking or being more clever 🙂

By the way, if you’d like to see the end result of Repustate’s curation process, head on over to the Semantic API and try it out.

Sharing large data structures across processes in Python

At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server. We also distribute our software as virtual machines to some customers so memory usage has to be light because we can’t control how much memory our customers will allocate to the VMs they deploy.

To summarize, here’s our checklist of requirements:

  1. Low memory footprint
  2. Can be shared amongst multiple processes with no issues (read only)
  3. Very fast access
  4. Easy to update (write) out of process

So our first attempt was to store the models on disk in a MongoDB and to load them into memory as Python dictionaries. This worked and satisfied #3 and #4 but failed #1 and #2. This is how Repustate operated for a while, but memory usage kept growing and it became unsustainable. Python dictionaries are not memory efficient. And it was too expensive for each Apache process to need a copy of this since we were not sharing the data between processes.

One night I was complaining about our dilemma and a friend of mine, who happens to be a great developer at Red Hat, said these three words: “memory mapped file”. Of course! In fact, Repustate already uses memory mapped files but I completely forgot about this. So that solves half my problem – it meets requirements #2. But what format does the memory mapped file take? Thankfully computer science has already solved all the world’s problems and the perfect data structure was already out there: tries.

Tries (pronounced “trees” for some reason and not “try’s”) AKA radix trees AKA prefix trees are a data structure that lend themselves to objects that need string keys. Wikipedia has a better write up but long story short, tries are great for the type of models Repustate uses.

I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.

What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.

Next time you’re in need of sharing large amounts of data, give memory mapped tries a chance.

 

Chinese POS Tagger (and other languages)

Need an Arabic part of speech tagger (AKA an Arabic POS Tagger)? How about German or Italian? You’re in luck – Repustate’s internal POS taggers have been opened up via our API to give our developers the ability to slice and dice multilingual text the way we do.

The documentation for the POS tagger API call outlines all you need to know to get started. With this new API call you get:

  • English POS tagger
  • German POS tagger
  • Chinese POS tagger
  • French POS tagger
  • Italian POS tagger
  • Spanish POS tagger
  • Arabic POS tagger

Beyond this, we’ve unified the tag set you get from the various POS taggers so that you only have to write code once to handle all languages. The complete tag set includes nouns, adjectives, verbs, adverbs, punctuation marks, conjunctions and prepositions. Give it a shot and let us know what you think.

Segmenting Twitter hashtags

Segmenting Twitter hashtags to gain insight

Twitter hashtags are everywhere these days and the ability to data mine them is an important one. The problem with hashtags is they are one long string that is composed of a few smaller ones. If we were able to segment the long hash tag into its individual words, we can gain some extra insight into the context of the tweet and maybe determine the sentiment as a result.

So what to do? How do we solve the problem of the long, single string?

Use the probabilities, Luke

As we did with Chinese sentiment, we had to rely on conditional probabilities to determine the most likely words in a string of characters. Put differently, you’re trying to answer the question: “If the previous word was X, what are the odds the next word is Y?” To answer this, you need to build up a probability model using some tagged data. We grabbed the most common bigrams from Google’s ngram database and then using the frequencies listed, constructed a probability model.

To better understand why we needed the probabilities, let’s take a look at a concrete example. Take the following hashtag: #vacationfallout. There are two possible segmentations here, [“vacation”, “fallout”] or [“vacation”, “fall”, “out”]. So how we do know which to use? We examine the probability that the string “fallout” comes after “vacation”. This probability, as we know from our model, is higher than the probability of the words “fall” and “out” coming after “vacation”, so that’s the one we go with.

Now of course, since we’re dealing with probabilities, we might be wrong. Perhaps the author did intend for that hashtag to mean [“vacation”, “fall”, “out”]. But we learn to live with the fact that we’ll be wrong sometimes; the key is that we’ll be wrong much less frequently than when we’re right.

Memoizing to increase performance

Since the Repustate API is hit pretty heavily, we still need to be concerned with performance. The first step we take is to determine if there is a hashtag to expand. We do this using a simple regular expression. The next step, once we’ve determined there is a hashtag present, is to expand it into its individual words. To make things go a bit faster, we memoize the functions we use so that when we encounter the same patterns, and we do, we won’t waste time calculating things from scratch each time. Here’s the decorator we use to memoize in Python:

Using Python’s AST parser to translate text queries to regular expressions.

Python AST (Abstract Syntax Tree) module is pretty darn useful

We recently introduced a new set of API calls for our enterprise customers (all customers will soon have access to these APIs) that allows you to create customized rules for categorizing text. For example, let’s say you want to classify Twitter data into one of two categories: Photography or Photoshop. If it has to do with photography, that’s one category, if it has to do with Photoshop, that’s another category.

So we begin by listing out some boolean rules as to how we want to match our text. We can use the OR operator, we can use AND, we can use negation by placing a “-” (dash or hyphen) before a token and we can use parentheses to group pieces of logic together. Here are our definitions for the two categories:

Photography: photo OR camera OR picture
Photoshop: “Photoshop” -photo -shop

The first rule states if a tweet has at least one of the words “photo”, “camera” or “picture” then classify it as being in the Photograph category. The 2nd rule states if it has the word “Photoshop” and does not contain the words “photo” and “shop” by themselves, then this piece of text is under the Photoshop category. You’ll notice there’s an implicit AND operator where ever we use a white space to separate tokens.

Now one approach would be to take your text, tokenize it into a bag of words, and then go one by one through each of the boolean clauses and see which matches. But that’s way too slow. We can do this much faster using regular expressions.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

The hilarity of the quote above not withstanding, this problem is ready made for regular expressions because we can compile our rules once at startup time and then just iterate over them for each piece of text. But how do you convert the category definitions above into a regular expression, using negative lookaheads/behinds and all that other regexp goodness? We used Python’s AST module.

AST to the rescue

Thinking back to your days in CS, you’ll remember that an expression like 2 + 2 can be parsed and converted into a tree, where the binary operator ‘+’ is the parent and it has two child nodes, namely ‘2’ and ‘2’. If we take our boolean expressions and replace OR with ‘+’ and the AND operator (whitespace) with a ‘*’, we can feed our text into the Python ast module like so:

The “process” method, shown below, is what then traverses the tree and emits the necessary regular expression text:

(I’ve omitted the code for some of the helper methods but what you see here is the heart of the algorithm). So the final regular expression for the two rules about would look like this:

Photograph: ‘photo|camera|picture’
Photoshop: ‘(?=.*(?=.*\bPhotoshop\b).*^((?!photo).)*$).*^((?!shop).)*$’

That second rule in particular is a doozy because it’s using lookarounds which are a pain in the butt to try to manually derive.

The AST module emits a tree where each node has a type. So when traversing the tree, we just have to check which type of node we’re dealing with and proceed accordingly. If it’s a binary operator for example, such as the OR operation, we know we have to put a pipe (i.e. “|”) between the two operands to form an “OR” in regular expression syntax. Similarly, AND and NOT are processed accordingly and since it’s a tree, we can do this recursively. Neat.

(More documentation on the AST module can be found here.)

The final product is a very fast regular expression which allows us to categorize text quickly and accurately. Next post, I’m going to talk about semantic categorization (e.g. Tag all pieces of text that have to do with football or baseball under the “Sports” category) so stay tuned!

The Repustate Promoter Score (RPS) explained

Repustate’s new metric

One of the newest and most popular features of Repustate’s Social Media Monitoring platform has been the addition of a metric we call the Repustate Promoter Score, or RPS for short. The RPS provides a simple score, from 0 to 10, that indicates the overall sentiment people have about a particular topic in question. This could be a person, a brand, a product, anything.

The RPS is calculated as follows: Take the total number of positive mentions, the total number of negative mentions, compute the Wilson Confidence interval using those two numbers, and then multiple by 10 to get a score between 0 and 10. That’s a mouth full, let’s dig a bit deeper.

How it works

Total number of positive & negative mentions is easy enough to understand, but what’s this about a Wilson Confidence interval? Well, it’s a math formula which tries to determine how confident you are about a particular outcome given the data. To put this in social terms, let’s say we have 1 tweet about the Apple iPhone and that tweet is positive. How confident are we that everyone is positive about the iPhone? Well, not too confident, because we only have 1 piece of data. But let’s say we have 1000 tweets, and 950 are positive. Now our confidence level rises, and that’s the thinking behind using the Wilson Confidence internal.

Some examples

Here are some examples of Repustate Promoter Scores given a number of positive and negative mentions:

Positive: 1, Negative: 0, RPS: 2
Positive: 10, Negative: 0, RPS: 7
Positive: 10, Negative: 10, RPS: 3
Positive: 0, Negative: 100, RPS: 0
Positive: 100, Negative: 5, RPS: 9

When you create a new data source, Repustate calculates the RPS on an ongoing basis so you can see how the RPS fluctuates over time in your dashboard.

Arabic sentiment analysis – now 200x faster

A few months ago we began a task of migrating our Arabic sentiment analysis engine from a Python/Cython implementation to a Go implementation. The reason: speed. Go makes asynchronous programming and concurrency a cinch to use and that’s where we were able to realize some crazy speed boosts.

Our English language sentiment analysis engine can analyze about 500 documents / second. Our Arabic sentiment engine, in Python, did about 2 documents / second. As a result, we never allowed our customers to use our bulk sentiment API call with Arabic text because it would be too slow. Until now.

Now that our engine is in Go, we’ve opened up bulk scoring in Arabic for everyone. You can now score 100 Arabic documents per API call. We’ll gradually increase the limit as time goes on – we’re just testing the waters for now.

We believe our solution was already the most accurate Arabic sentiment engine available – now we’re quite sure we have the fastest one out there, too.

So go give it a shot and let us know what you think.

Profiling Go programs with pprof

There’s an old blog post floating around on the Go Blog with instructions on how to profile your code using pprof – except it’s out of date and  might confuse some people. Here’s an up-to-date (as of Go 1.1) way to profile your code.

1. import runtime/pprof
2. Where you want the CPU profiling to start, you write:

pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()

This has the effect of starting the profiler and then ending it once the containing function (often your “main” function) exits. The argument you pass to StartCPUProfile, “f”, is an open file descriptor to which the profile information will be written to. You can create this file by writing:

f, err := os.Create("my_profile.file")

3. OK we have our profiling code setup. Now, build your binary. Let’s say my go file is called “foo.go”, I would run “go build foo.go”. This will create a binary called “foo”.
4. Run your binary (e.g. ./foo)
5. Now you run “go tool pprof foo my_profile.file” and you will be in the pprof interactive prompt. Here you can issue commands like top10 or top50 etc. Read up on pprof to find out all the commands you can enter.