Beware the lure of crowdsourced data

Crowdsourced data can often be inconsistent, messy or downright wrong

We all like something for nothing, that’s why open source software is so popular. (It’s also why the Pirate Bay exists). But sometimes things that seem too good to be true are just that.

Repustate is in the text analytics game which means we needs lots and lots of data to model certain characteristics of written text. We need common words, grammar constructs, human-annotated corpora of text etc. to make our various language models work as quickly and as well as they do.

We recently embarked on the next phase of our text analytics adventure: semantic analysis. Semantic analysis the process of taking arbitrary text and assigning meaning to the individual, relevant components. For example, being able to identify “apple” as a fruit in the sentence “I went apple picking yesterday” but to identify “Apple’ the company when saying “I can’t wait for the new Apple product announcement” (note: even though I used title case for the latter example, casing should not matter)

To be able to accomplish this task, we need a few things:

1) List of every possible person/place/business/thing we care about and the classification they belong to

2) A corpus of text (or corpora) that will allow us to disambiguate terms based on context. In other words, if we see the word “banana” near the word “apple”, we can safely assume we’re talking about fruits and not computers.

Since we’re not Google, we don’t have access to every person’s search history and resulting click throughs (although their n-gram data is useful in some applications). So we have to be clever.

For anyone who’s done work in text analysis, you’ll have heard of Freebase. Freebase is a crowdsourced repository of facts. Kind volunteers have contributed lists of data and tagged meta information about them. For example, you can look up all makes of a particular automotive manufacturer, like Audi. You can see a list of musicians (hundreds of thousands actually), movie stars, TV actors or types of food.

It’s tempting to use data like Freebase. It seems like someone did all the work for you. But once you dig inside, you realize it’s tons of junk, all the way the down.

For example, under the Food category, you’ll see the name of each US state. I didn’t realize I could eat Alaska. Under book authors, you’ll see any athlete who’s ever “written” an autobiography. I highly doubt Michael Jordan wrote his own book, but there it is. LeBron James, NBA all-star for the Miami Heat, is listed as a movie actor.

The list goes on and on. While Freebase definitely does lend itself to being a good starting point, ultimately you’re on your own to come up with a better list of entities either through some mechanical turking or being more clever 🙂

By the way, if you’d like to see the end result of Repustate’s curation process, head on over to the Semantic API and try it out.

Introducing Semantic Analysis

Repustate is announcing today the release of its new product: semantic analysis. Combined with sentiment analysis, Repustate provides any organization, from startup to Fortune 50 enterprise, all the necessary tools they need to conduct in-depth text analytics. For the impatient, head on over to the semantic analysis docs page to get started.

Text analytics: the story so far

Until this point, Repustate has been concerned with analyzing text structurally. Part of speech tagging, grammatical analysis, even sentiment analysis is really all about the structure of the text. The order in which words come, the use of conjunctions, adjectives or adverbs to denote any sentiment. All of this is a great first step in understanding the content around you – but it’s just that, a first step.

Today we’re proud and excited to announce Semantic Analysis by Repustate. We consider this release to be the biggest product release in Repustate’s history and the one that we’re most proud of (although Arabic sentiment analysis was a doozy as well!)

Semantic analysis explained

Repustate can determine the subject matter of any piece of text. We know that a tweet saying “I love shooting hoops with my friends” has to do with sports, namely, basketball. Using Repustate’s semantic analysis API you can now determine the theme or subject matter of any tweet, comment or blog post.

But beyond just identifying the subject matter of a piece of text, Repustate can dig deeper and understand each and every key entity in the text and disambiguate based on context.

Named entity recognition

Repustate’s semantic analysis tool extracts each and every one of these entities and tells you the context. Repustate knows that the term “Obama” refers to “Barack Obama”, the President of the United States. Repustate knows that in the sentence “I can’t wait to see the new Scorsese film”, Scorsese refers to “Martin Scorsese” the director. With very little context (and sometimes no context at all), Repustate knows exactly what an arbitrary piece of text is talking about. Take the following examples:

  1. Obama.
  2. Obama is the President.
  3. Obama is the First Lady.

Here we have three instances where the term “Obama” is being used in different contexts. In the first example, there is no context, just the name ‘Obama’. Repustate will use its internal probability model to determine the most likely usage for this term is in the name ‘Barack Obama’, hence an API call will return ‘Barack Obama’. Similarly, in the second example, the word “President” acts as a hint to the single term ‘Obama’ and again, the API call will return ‘Barack Obama’. But what about the third example?

Here, Repustate is smart enough to see the phrase “First Lady”. This tells Repustate to select ‘Michelle Obama’ instead of Barack. Pretty neat, huh?

Semantic analysis in your language

Like every other feature Repustate offers, no language takes a back seat and that’s why semantic analysis is available in every language Repustate supports. Currently only English is publicly available but we’re rolling out every other language in the coming weeks.

Semantic analysis by the numbers

Repustate currently has over 5.5 million entities, including people, places, brands, companies and ideas in its ontology. There are over 500 categorizations of entities, and over 30 themes with which to classify a piece of text’s subject matter. And an infinite number of ways to use Repustate to transform your text analysis.

Head on over to the Semantic Analysis Tour to see more.