Named entity recognition is now available in Russian via the Repustate API. Combined with Russian sentiment analysis, customers can now do full text analytics in Russian with Repustate.
Repustate is happy to launch Russian named entity recognition to solutions already available in English & Arabic. But like all languages Russian has its nuances that caused named entity recognition to be a bit tougher than say English.
Consider the following sentence:
Путин дал Обаме новую ядерную соглашение
In English, this says:
Putin gave Obama the new nuclear agreement
This is how Barack Obama is written in Russian:
Notice that in our sentence, “Obama” is spelled with a different suffix at the end. That’s because in Russian, nouns (even proper nouns), are conjugated based on the their use case. In this example, Obama is being used in what’s called the “dative” case, meaning the noun is the recipient of something. In English, there is no concept of conjugating nouns for this reason. English only requires changing the suffix in the case of pluralization.
So Repustate has to know how to stem proper nouns as well in order to properly identify “Обаме” as Barack Obama during the contextual expansion phase.
These are the sorts of problems we have solved so you don’t have to. Try out the Semantic API and let us know what you think.
Russian sentiment analysis is finally here – by popular demand
We’re often asked to implement text analytics in particular languages by customers, but no language has received as many requests as Russian. Some requests date back to 3 years ago!
We’re happy to announce that Russian sentiment analysis is now open to all to use. Repustate has been testing Russian text analytics in a private beta with select customers and the results have been great. Now all of our customers can analyze Russian using our API.
Russian semantic analysis, including named entity extraction and theme classification, will soon be available as well, completing the loop on full-blown text analytics in Russian.
Dashboard design is how best to communicate the stories the underlying data is trying to tell you.
Every good B2B product these days has to provide some sort of dashboard to let its customers see into the data from a higher view. Dashboards should provide a narrative of what’s happening under the hood – so why do we make people do all the work? Repustate’s old dashboard for social media analytics flat out sucked. It was clunky, ugly, wasted valuable screen real estate on non-essential items and had very low engagement levels as a result. So we decided to redesign it.
Thank you, Stephen Few
The first step was to get a copy of Stephen Few’s book, Information Dashboard Design. Published in 2006, it covers a variety of topics related to dashboard design and presenting information. Although some of the screen shots in the book are dated, the same principles that applied then still apply now. Here are the main takeaways we got from this great book:
No clutter – remove any elements that don’t add value
Use colours to communicate a specific idea, not just for fun
Place the most important graphs/numbers/conclusions at the top
Provide feedback where possible – make the dashboard tell a story
If you read those 4 things, you’re probably thinking, “Well, duh, that’s obvious.” And yet, take a look at the dashboards you use everyday; they’re loaded with visual noise that is irrelevant, poorly placed, or difficult to make heads or tails from. Here’s a screen shot of Repustate’s new dashboard:
Here’s a run down of various features and attributes of this new design:
Minimal “chrome” (visual clutter) on the page. The side bar can slide out, but by default is closed and each of the menu items is also a hover menu so you never have to open the menu itself to see what each item means.
The Repustate logo is in the sidebar menu and is visible only when expanded. At this point, does a user really need to be reminded of what they’re using – they know. Hide your logos.
Top left area of the page provides a common action – download the raw data. No wasted space.
Most important information for our users is sentiment trend over time – put that at the top. We colour coded the “panels” depending on the sentiment and also have a background image (thumb up, thumb down, question mark) depending on the value. The visuals align with the data itself.
Summary statements tell the user exactly what’s happening – don’t make them guess. We have a bunch of rules set up that determine which sentences appear when. Within seconds, our users know exactly what’s going on – their dashboard is communicating with them directly. You can’t believe how much engagement increases as a result.
Easy to read graphs & lists. Minimal clutter and minimal labelling. If your graphs and charts are done properly, the labelling of data points should be unnecessary.
In less than 20s, any executive will be able to glean from this exactly what’s going on. With more drilling down, clicking some links, they can get deeper insights, but the point is, to get the “executive summary”, they shouldn’t have to. We’re telling them if they’re doing well or not. We’re telling them what’s changed since last time. Ask yourself what information can your dashboards communicate and then make them do so. Your customers will love you for it.
SurveyMonkey API + Repustate text analytics = insightful open ended responses
We recently worked with a new entrant into the healthy snack foods business who wanted to understand the market they were getting into. Specifically, they wanted to know the following:
Which foods do people currently eat as their “healthy snack”
Which brands do consumers think of when they hear the word “snack”
Was there anything about the current selection of snack foods that consumers didn’t like?
If you were having friends or family over for a casual get together (summer BBQ, catching up etc.) what kinds of snacks would you serve?
Armed with these goals in mind, a survey was created using SurveyMonkey and distributed to the new entrant’s target market via their newsletter (Protip: before you launch a product, build up a mailing list. It works to 1) validate your idea and 2) tighten the feedback loop on product decisions). A telemarketing service was also employed to call people and ask them the same questions. These responses were transcribed and sent to Repustate so the same analysis could be performed.
OK so that’s the easy part; thousands of responses were collected. But the responses were what is referred to in the market research industry as “open ended” meaning they were just free form text as opposed to a multiple choice list. The reason being was this brand didn’t want to introduce any bias into the survey by prompting the respondents with possible answers. For example, take question #2 from above. If presented with a list of snacks, the respondent might in their head say “Oh yeah, I forgot about brand X” and check them off as being a snack they think of, when in reality, that brand’s product had slipped off their radar. Open ended responses test how deeply ingrained a particular idea, concept or in this case, brand, is within a person’s consciousness.
But having open ended responses poses a challenge – how do you data mine them en masse and aggregate the results to come up with something meaningful? If you have a few hundred responses to read, maybe you hire a few interns. But what about when you have ten’s of thousands? That’s where Repustate comes in.
Use the APIs, Luke
Fortunately, SurveyMonkey has a pretty simple to use API. Combined with Repustate’s even easier to use API, you can go from open ended response to data mined text in seconds. Here’s a code snippet that provides a good blueprint for how one can marry these two APIs together. While some details have been omitted, it should be relatively straightforward as to how you can adapt it to suit your needs:
So with very few lines of Python code, we’ve grabbed the open ended responses, processed them through the named entities API call, and can store the results in our backend of choice. Let’s take a look at a sample response and see how Repustate categorized it.
Q: If you were having friends or family over for a casual get together (summer BBQ, catching up etc.) what kinds of snacks would you serve?
A: I usually put veggies out, like carrots, celery, cucumbers etc. etc. and maybe some dip like hummus and crackers.
Running that response through the Repustate API yields this information:
Armed with this analysis, we then aggregated the results to see which categories of food, and which brands were being mentioned the most frequently. This helped our client understand who they were competing against.
As it turns out, it was plain old vegetables that were the biggest competition to this new entrant, which is a double edged sword. On the one hand, it means they don’ have to spent the marketing dollars to compete with an entrenched incumbent who dominates most of the shelf space in supermarkets. On the other hand, it’s a troubling place to be in because vegetables are well known, cheap, and are viewed as healthy (obviously).
We’re fortunate to be living in a time when so much data is at our disposal, ready to be sliced & diced. We’re also cursed because there’s so much of it! We need the right tools and a clear mind to handle these sorts of problems, but it’s possible.
The freemium business model has suited Repustate well to a point, but now it’s time to transition to a fully paid service.
When Repustate launched all that time ago, it was a completely free service. We didn’t take your money even if you offered. The reason was we wanted more customer data to improve our various language models and felt giving software away for free in exchange for the data was a good bargain for both sides.
As our products matured and as our user base grew, it was time to flip the monetary switch and start charging – but still offering a free “Starter” plan as well as a free demo to try out our sentiment & semantic analysis engine.
As we’ve grown and as our SEO has improved, we’ve received a lot more interest from “tire kickers”. People who just want to play around, not really interested in buying. And that was fine by us because again, we got their data so we could see how to improve our engines. But most recently, the abuse of our Starter plan has got to the point where this is no longer worth our while. People are creating those 10 minute throwaway accounts to sign up, activate their accounts, and then use our API.
While one could argue that if people aren’t willing to pay, maybe the product isn’t that good – the extremes people are going to in order to use Repustate’s API for free tells us that we do have a good product and charging everyone is perfectly reasonable.
As a result, we will be removing the Starter plan. From now on, all accounts must be created with a valid credit card. We’ll probably offer a money-back trial period, say 14 days, but other than that, customers must be committing to payment on Day 0. We will also bump up the quotas for all of our plans to make the value proposition all the better.
Any plans currently on Starter accounts will be allowed to remain so. If you have any questions about this change and how it affects you, please contact us anytime.
Crowdsourced data can often be inconsistent, messy or downright wrong
We all like something for nothing, that’s why open source software is so popular. (It’s also why the Pirate Bay exists). But sometimes things that seem too good to be true are just that.
Repustate is in the text analytics game which means we needs lots and lots of data to model certain characteristics of written text. We need common words, grammar constructs, human-annotated corpora of text etc. to make our various language models work as quickly and as well as they do.
We recently embarked on the next phase of our text analytics adventure: semantic analysis. Semantic analysis the process of taking arbitrary text and assigning meaning to the individual, relevant components. For example, being able to identify “apple” as a fruit in the sentence “I went apple picking yesterday” but to identify “Apple’ the company when saying “I can’t wait for the new Apple product announcement” (note: even though I used title case for the latter example, casing should not matter)
To be able to accomplish this task, we need a few things:
1) List of every possible person/place/business/thing we care about and the classification they belong to
2) A corpus of text (or corpora) that will allow us to disambiguate terms based on context. In other words, if we see the word “banana” near the word “apple”, we can safely assume we’re talking about fruits and not computers.
Since we’re not Google, we don’t have access to every person’s search history and resulting click throughs (although their n-gram data is useful in some applications). So we have to be clever.
For anyone who’s done work in text analysis, you’ll have heard of Freebase. Freebase is a crowdsourced repository of facts. Kind volunteers have contributed lists of data and tagged meta information about them. For example, you can look up all makes of a particular automotive manufacturer, like Audi. You can see a list of musicians (hundreds of thousands actually), movie stars, TV actors or types of food.
It’s tempting to use data like Freebase. It seems like someone did all the work for you. But once you dig inside, you realize it’s tons of junk, all the way the down.
For example, under the Food category, you’ll see the name of each US state. I didn’t realize I could eat Alaska. Under book authors, you’ll see any athlete who’s ever “written” an autobiography. I highly doubt Michael Jordan wrote his own book, but there it is. LeBron James, NBA all-star for the Miami Heat, is listed as a movie actor.
The list goes on and on. While Freebase definitely does lend itself to being a good starting point, ultimately you’re on your own to come up with a better list of entities either through some mechanical turking or being more clever
By the way, if you’d like to see the end result of Repustate’s curation process, head on over to the Semantic API and try it out.
Repustate is announcing today the release of its new product: semantic analysis. Combined with sentiment analysis, Repustate provides any organization, from startup to Fortune 50 enterprise, all the necessary tools they need to conduct in-depth text analytics. For the impatient, head on over to the semantic analysis docs page to get started.
Text analytics: the story so far
Until this point, Repustate has been concerned with analyzing text structurally. Part of speech tagging, grammatical analysis, even sentiment analysis is really all about the structure of the text. The order in which words come, the use of conjunctions, adjectives or adverbs to denote any sentiment. All of this is a great first step in understanding the content around you – but it’s just that, a first step.
Today we’re proud and excited to announce Semantic Analysis by Repustate. We consider this release to be the biggest product release in Repustate’s history and the one that we’re most proud of (although Arabic sentiment analysis was a doozy as well!)
Semantic analysis explained
Repustate can determine the subject matter of any piece of text. We know that a tweet saying “I love shooting hoops with my friends” has to do with sports, namely, basketball. Using Repustate’s semantic analysis API you can now determine the theme or subject matter of any tweet, comment or blog post.
But beyond just identifying the subject matter of a piece of text, Repustate can dig deeper and understand each and every key entity in the text and disambiguate based on context.
Named entity recognition
Repustate’s semantic analysis tool extracts each and every one of these entities and tells you the context. Repustate knows that the term “Obama” refers to “Barack Obama”, the President of the United States. Repustate knows that in the sentence “I can’t wait to see the new Scorsese film”, Scorsese refers to “Martin Scorsese” the director. With very little context (and sometimes no context at all), Repustate knows exactly what an arbitrary piece of text is talking about. Take the following examples:
Obama is the President.
Obama is the First Lady.
Here we have three instances where the term “Obama” is being used in different contexts. In the first example, there is no context, just the name ‘Obama’. Repustate will use its internal probability model to determine the most likely usage for this term is in the name ‘Barack Obama’, hence an API call will return ‘Barack Obama’. Similarly, in the second example, the word “President” acts as a hint to the single term ‘Obama’ and again, the API call will return ‘Barack Obama’. But what about the third example?
Here, Repustate is smart enough to see the phrase “First Lady”. This tells Repustate to select ‘Michelle Obama’ instead of Barack. Pretty neat, huh?
Semantic analysis in your language
Like every other feature Repustate offers, no language takes a back seat and that’s why semantic analysis is available in every language Repustate supports. Currently only English is publicly available but we’re rolling out every other language in the coming weeks.
Semantic analysis by the numbers
Repustate currently has over 5.5 million entities, including people, places, brands, companies and ideas in its ontology. There are over 500 categorizations of entities, and over 30 themes with which to classify a piece of text’s subject matter. And an infinite number of ways to use Repustate to transform your text analysis.
Python requests encoding – using the Python requests module might give you surprising results
But there’s a subtle issue with regards to encodings that tripped us up. A customer told us that some Chinese web pages were coming back garbled when using the clean-html API call we provide. Here’s the URL:
In the HTML on these pages, the charset is gb2312 which is an encoding that came out of China used for the Simplified Chinese set of characters. However, many web servers do not send this as the charset in the response headers (due to the programmers, not the web server itself). As a result, requests defaults to ISO 8851-9 as the encoding when the response doesn’t contain a charset. This is done in accordance with RFC 2616. The upshot is that the Chinese text in the web page doesn’t get encoded properly when you access the encoded content of the response and so what you see is garbled characters.
Here’s the response headers for the above URL:
curl -I http://finance.sina.com.cn/china/20140208/111618150293.shtml
HTTP/1.1 200 OK
Date: Mon, 17 Feb 2014 15:54:28 GMT
Last-Modified: Sat, 08 Feb 2014 03:56:49 GMT
Expires: Mon, 17 Feb 2014 15:56:28 GMT
X-Cache: HIT from 236-41.D07071951.sina.com.cn
There is a thread on the Github repository for requests that explains why they do this – requests shouldn’t be about HTML, the argument goes, it’s about HTTP so if a server doesn’t respond with the proper charset declaration, it’s up to the client (or the developer) to figure out what to do. That’s a reasonable position to take, but it poses an interesting question: When “common” use or expectations, go against official spec, whose side does one take? Do you tell developers to put on their big boy and girl pants and deal with it or do you acquiesce and just do what most people expect/want?
Specs be damned, make it easy for people
I believe it was former Twitter API lead at the time, Alex Payne, who was asked why does Twitter include the version of the API in the URL rather than in the request header, as is more RESTful. His paraphrased response (because I can’t find the quote) is that Twitter’s goal was to get as many people using the API as possible and settings headers was beyond the skill level of many developers, whereas including it in the URL is dead simple. (We at Repustate do the same thing; our APIs are versioned via the URL. It’s simpler and more transparent.)
Now the odd thing about requests is that the package has an attribute called apparent_encoding which does correctly guess the charset based on the content of the response. It’s just not automatically applied because the response header takes precedence.
We ended up patching requests so that the apparent_encoding attribute is what gets used in the case no headers are set by default, but this is not the default behaviour of the package.
I can’t say I necessarily disagree with the choices the maintainers of requests have made. I’m not sure if there is a right answer because if you write your code to be user friendly in direct opposition to a published spec, you will almost certainly raise the ire of someone who *does* expect things to work to spec. Damned if you do, damned if you don’t.
An interesting article by Seth Grimes caught our eye this week. Seth is one of the few voices of reason in the world of text analytics that I feel “gets it”. His views on sentiment’s strengths and weaknesses, advantages and shortcomings align quite perfectly with Repustate’s general philosophy.
In the article, Seth states that simply getting relying on a number denoting sentiment or a label like “positive” or “negative” is too coarse a measurement and doesn’t carry any meaning with it. By doing so, you risk overlooking deeper insights that are hidden beneath the high level sentiment score. Couldn’t agree more with this and that’s why Repustate supports categorizations.
Sentiment by itself is meaningless; sentiment analysis scoped to a particular business need or product feature etc. is where true value lies. Categorizing your social data by features of your service (e.g. price, selection, quality) first and THEN applying sentiment analysis is the way to go. In the article, Seth proceeds to list a few “emotional” ones (promoter/detractor, angry, happy etc). that quite frankly I would ignore. These categories are too touchy-feely, hard to really disambiguate at a machine learning level and don’t tie closely to actual business processes/features. For instance, if someone is a detractor, what is that is causing them to be a detractor? Was it the service they received? If so, then customerservice is the category you want and negative polarity of the text in question gives you invaluable insights. The fact that someone is being negative about your business means almost by definition they are detractors.
Repustate provides our customers with the ability to create their own categories according to the definitions that they create. Each customer is different, each business is different, hence the need for customized categories. Once you have your categories, sentiment analysis becomes much more insightful and valuable to your business.
At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server. We also distribute our software as virtual machines to some customers so memory usage has to be light because we can’t control how much memory our customers will allocate to the VMs they deploy.
To summarize, here’s our checklist of requirements:
Low memory footprint
Can be shared amongst multiple processes with no issues (read only)
Very fast access
Easy to update (write) out of process
So our first attempt was to store the models on disk in a MongoDB and to load them into memory as Python dictionaries. This worked and satisfied #3 and #4 but failed #1 and #2. This is how Repustate operated for a while, but memory usage kept growing and it became unsustainable. Python dictionaries are not memory efficient. And it was too expensive for each Apache process to need a copy of this since we were not sharing the data between processes.
One night I was complaining about our dilemma and a friend of mine, who happens to be a great developer at Red Hat, said these three words: “memory mapped file”. Of course! In fact, Repustate already uses memory mapped files but I completely forgot about this. So that solves half my problem – it meets requirements #2. But what format does the memory mapped file take? Thankfully computer science has already solved all the world’s problems and the perfect data structure was already out there: tries.
Tries (pronounced “trees” for some reason and not “try’s”) AKA radix trees AKA prefix trees are a data structure that lend themselves to objects that need string keys. Wikipedia has a better write up but long story short, tries are great for the type of models Repustate uses.
I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.
What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.
Next time you’re in need of sharing large amounts of data, give memory mapped tries a chance.