Introducing Repustate Sync for distributed deployment

Keeping your Repustate data in sync used to be a pain point – not anymore.

Since we launched the Repustate Server nearly three years ago, the biggest complaint has always been keeping the Servers in-sync across the entire cluster. While previously we had resorted to using databases to keep all data synced up, it placed too much of a burden on our customers. Today, that burden is lifted and Repustate Sync just works.

Distributed deployments with Repustate are now easier than ever with Repustate Sync. All your customized rules and filters are automatically available on all your servers.

To get the latest version of Repustate, which contains Repustate Sync, head on over to your account and download the newest Repustate version.

What’s great about Repustate Sync is it works even if one of your peer nodes goes down and then comes back up online. Each Repustate Sync instance stores a queue of transactions that were not synced properly, so whenever peers are brought back online, everyone is brought up to date immediately.

The only configuration that is needed is the IP addresses and ports that each peer in your Repustate cluster is running on.

For more information about the Repustate Server or Repustate Sync, head on over to the Repustate Server documentation area.

Russian sentiment analysis: our newest language

Russian sentiment analysis is finally here – by popular demand

We’re often asked to implement text analytics in particular languages by customers, but no language has received as many requests as Russian. Some requests date back to 3 years ago!

We’re happy to announce that Russian sentiment analysis is now open to all to use. Repustate has been testing Russian text analytics in a private beta with select customers and the results have been great. Now all of our customers can analyze Russian using our API.

Russian semantic analysis, including named entity extraction and theme classification, will soon be available as well, completing the loop on full-blown text analytics in Russian.

Try out Repustate’s Russian sentiment analysis now on the Repustate API demo page.

быть здоровым!

SurveyMonkey analytics supercharged with Semantic & Sentiment analysis

SurveyMonkey API + Repustate text analytics = insightful open ended responses

We recently worked with a new entrant into the healthy snack foods business who wanted to understand the market they were getting into. Specifically, they wanted to know the following:

  1. Which foods do people currently eat as their “healthy snack”
  2. Which brands do consumers think of when they hear the word “snack”
  3. Was there anything about the current selection of snack foods that consumers didn’t like?
  4. If you were having friends or family over for a casual get together (summer BBQ, catching up etc.) what kinds of snacks would you serve?

Armed with these goals in mind, a survey was created using SurveyMonkey and distributed to the new entrant’s target market via their newsletter (Protip: before you launch a product, build up a mailing list. It works to 1) validate your idea and 2) tighten the feedback loop on product decisions). A telemarketing service was also employed to call people and ask them the same questions. These responses were transcribed and sent to Repustate so the same analysis could be performed.

OK so that’s the easy part; thousands of responses were collected. But the responses were what is referred to in the market research industry as “open ended” meaning they were just free form text as opposed to a multiple choice list. The reason being was this brand didn’t want to introduce any bias into the survey by prompting the respondents with possible answers. For example, take question #2 from above. If presented with a list of snacks, the respondent might in their head say “Oh yeah, I forgot about brand X” and check them off as being a snack they think of, when in reality, that brand’s product had slipped off their radar. Open ended responses test how deeply ingrained a particular idea, concept or in this case, brand, is within a person’s consciousness.

But having open ended responses poses a challenge – how do you data mine them en masse and aggregate the results to come up with something meaningful? If you have a few hundred responses to read, maybe you hire a few interns. But what about when you have ten’s of thousands? That’s where Repustate comes in.

Use the APIs, Luke

Fortunately, SurveyMonkey has a pretty simple to use API. Combined with Repustate’s even easier to use API, you can go from open ended response to data mined text in seconds. Here’s a code snippet that provides a good blueprint for how one can marry these two APIs together. While some details have been omitted, it should be relatively straightforward as to how you can adapt it to suit your needs:

So with very few lines of Python code, we’ve grabbed the open ended responses, processed them through the named entities API call, and can store the results in our backend of choice. Let’s take a look at a sample response and see  how Repustate categorized it.

Q: If you were having friends or family over for a casual get together (summer BBQ, catching up etc.) what kinds of snacks would you serve?

A: I usually put veggies out, like carrots, celery, cucumbers etc. etc. and maybe some dip like hummus and crackers.

Running that response through the Repustate API yields this information:

{
  "themes": [
    "food"
  ],
  "entities": {
    "celery": "food.vegetable",
    "crackers": "food.other",
    "carrots": "food.vegetable",
    "cucumbers": "food.fruit",
    "hummus": "food.other"
  },
  "status": "OK",
  "expansions": {}
}

Armed with this analysis, we then aggregated the results to see which categories of food, and which brands were being mentioned the most frequently. This helped our client understand who they were competing against.

Conclusions

As it turns out, it was plain old vegetables that were the biggest competition to this new entrant, which is a double edged sword. On the one hand, it means they don’ have to spent the marketing dollars to compete with an entrenched incumbent who dominates most of the shelf space in supermarkets. On the other hand, it’s a troubling place to be in because vegetables are well known, cheap, and are viewed as healthy (obviously).

We’re fortunate to be living in a time when so much data is at our disposal, ready to be sliced & diced. We’re also cursed because there’s so much of it! We need the right tools and a clear mind to handle these sorts of problems, but it’s possible.

If you think your company could benefit from this sort of semantic analysis, we’d love to help so contact us.

Moving from freemium to premium

The freemium business model has suited Repustate well to a point, but now it’s time to transition to a fully paid service.

When Repustate launched all that time ago, it was a completely free service. We didn’t take your money even if you offered. The reason was we wanted more customer data to improve our various language models and felt giving software away for free in exchange for the data was a good bargain for both sides.

As our products matured and as our user base grew, it was time to flip the monetary switch and start charging – but still offering a free “Starter” plan as well as a free demo to try out our sentiment & semantic analysis engine.

As we’ve grown and as our SEO has improved, we’ve received a lot more interest from “tire kickers”. People who just want to play around, not really interested in buying. And that was fine by us because again, we got their data so we could see how to improve our engines. But most recently, the abuse of our Starter plan has got to the point where this is no longer worth our while. People are creating those 10 minute throwaway accounts to sign up, activate their accounts, and then use our API.

While one could argue that if people aren’t willing to pay, maybe the product isn’t that good – the extremes people are going to in order to use Repustate’s API for free tells us that we do have a good product and charging everyone is perfectly reasonable.

As a result, we will be removing the Starter plan. From now on, all accounts must be created with a valid credit card. We’ll probably offer a money-back trial period, say 14 days, but other than that, customers must be committing to payment on Day 0. We will also bump up the quotas for all of our plans to make the value proposition all the better.

Any plans currently on Starter accounts will be allowed to remain so. If you have any questions about this change and how it affects you, please contact us anytime.

Introducing Semantic Analysis

Repustate is announcing today the release of its new product: semantic analysis. Combined with sentiment analysis, Repustate provides any organization, from startup to Fortune 50 enterprise, all the necessary tools they need to conduct in-depth text analytics. For the impatient, head on over to the semantic analysis docs page to get started.

Text analytics: the story so far

Until this point, Repustate has been concerned with analyzing text structurally. Part of speech tagging, grammatical analysis, even sentiment analysis is really all about the structure of the text. The order in which words come, the use of conjunctions, adjectives or adverbs to denote any sentiment. All of this is a great first step in understanding the content around you – but it’s just that, a first step.

Today we’re proud and excited to announce Semantic Analysis by Repustate. We consider this release to be the biggest product release in Repustate’s history and the one that we’re most proud of (although Arabic sentiment analysis was a doozy as well!)

Semantic analysis explained

Repustate can determine the subject matter of any piece of text. We know that a tweet saying “I love shooting hoops with my friends” has to do with sports, namely, basketball. Using Repustate’s semantic analysis API you can now determine the theme or subject matter of any tweet, comment or blog post.

But beyond just identifying the subject matter of a piece of text, Repustate can dig deeper and understand each and every key entity in the text and disambiguate based on context.

Named entity recognition

Repustate’s semantic analysis tool extracts each and every one of these entities and tells you the context. Repustate knows that the term “Obama” refers to “Barack Obama”, the President of the United States. Repustate knows that in the sentence “I can’t wait to see the new Scorsese film”, Scorsese refers to “Martin Scorsese” the director. With very little context (and sometimes no context at all), Repustate knows exactly what an arbitrary piece of text is talking about. Take the following examples:

  1. Obama.
  2. Obama is the President.
  3. Obama is the First Lady.

Here we have three instances where the term “Obama” is being used in different contexts. In the first example, there is no context, just the name ‘Obama’. Repustate will use its internal probability model to determine the most likely usage for this term is in the name ‘Barack Obama’, hence an API call will return ‘Barack Obama’. Similarly, in the second example, the word “President” acts as a hint to the single term ‘Obama’ and again, the API call will return ‘Barack Obama’. But what about the third example?

Here, Repustate is smart enough to see the phrase “First Lady”. This tells Repustate to select ‘Michelle Obama’ instead of Barack. Pretty neat, huh?

Semantic analysis in your language

Like every other feature Repustate offers, no language takes a back seat and that’s why semantic analysis is available in every language Repustate supports. Currently only English is publicly available but we’re rolling out every other language in the coming weeks.

Semantic analysis by the numbers

Repustate currently has over 5.5 million entities, including people, places, brands, companies and ideas in its ontology. There are over 500 categorizations of entities, and over 30 themes with which to classify a piece of text’s subject matter. And an infinite number of ways to use Repustate to transform your text analysis.

Head on over to the Semantic Analysis Tour to see more.

 

Social sentiment analysis is missing something

Sentiment analysis by itself doesn’t suffice

An interesting article by Seth Grimes caught our eye this week. Seth is one of the few voices of reason in the world of text analytics that I feel “gets it”. His views on sentiment’s strengths and weaknesses, advantages and shortcomings align quite perfectly with Repustate’s general philosophy.

In the article, Seth states that simply getting relying on a number denoting sentiment or a label like “positive” or “negative” is too coarse a measurement and doesn’t carry any meaning with it. By doing so, you risk overlooking deeper insights that are hidden beneath the high level sentiment score. Couldn’t agree more with this and that’s why Repustate supports categorizations.

Sentiment by itself is meaningless; sentiment analysis scoped to a particular business need or product feature etc. is where true value lies. Categorizing your social data by features of your service (e.g. price, selection, quality) first and THEN applying sentiment analysis is the way to go. In the article, Seth proceeds to list a few “emotional” ones (promoter/detractor, angry, happy etc). that quite frankly I would ignore. These categories are too touchy-feely, hard to really disambiguate at a machine learning level and don’t tie closely to actual business processes/features. For instance, if someone is a detractor, what is that is causing them to be a detractor? Was it the service they received? If so, then customer service is the category you want and negative polarity of the text in question gives you invaluable insights. The fact that someone is being negative about your business means almost by definition they are detractors.

Repustate provides our customers with the ability to create their own categories according to the definitions that they create. Each customer is different, each business is different, hence the need for customized categories. Once you have your categories, sentiment analysis becomes much more insightful and valuable to your business.

Twitter + GeoJSON + d3.js + Rob Ford = Fun with data visualizations

Using Twitter, GeoJSON (via GDAL) and d3 to visualize data about Rob Ford

(Note: Gists turned into links so as to avoid too many roundtrips to github.com which is slow sometimes)

As a citizen of Toronto, the past few weeks (and months) have been interesting to say the least. Our mayor, Rob Ford, has made headlines for his various lewd remarks, football follies, and drunken stupors. With an election coming up in 2014, I was curious as to how the rest of Toronto felt about our mayor.

The Task

Collect data from Twitter, making sure we only use tweets from people in Toronto. Plot those tweets on an electoral map of Toronto’s wards, colour coding the wards based on how they voted in the 2010 Mayoral Election.

The Tools

  1. Repustate’s API (for data mining Twitter and extracting sentiment, of course!)
  2. Shapefiles for the City of Toronto including boundaries for the wards
  3. GDAL
  4. d3.js
  5. jQuery

I’m impatient, just show me the final product …

Alright, here’s the final visualization of Toronto’s opinion of Rob Ford. Take a look and then come back here and find out more about how we accomplished this.

The Process

For those still with me, let’s go through the step-by-step of how this visualization was accomplished.

Step 1: Get the raw data from Twitter

This is a given – if we want to visualize data, first we need to get some! To do so, we used Repustate to create our rules and monitor Twitter for new data. You can monitor social media data using Repustate’s web interface, but this is a project for hackers, so let’s use the Social API to do this. Here’s the code (you first have to get an API key):

https://gist.github.com/mostrovsky/7762449

Alright, so while Repustate gets our data, we can get the geo visualization aspect of this started. We’ll check back in on the data – for now, let’s delve into the world of geographic information systems and analytics.

Step 2: Get the shapefile

What’s a shapefile? Good question! Paraphrasing Wikipedia, shapefiles are a file format that contain vector information describing the layout and/or location of geographic features, like cities, states, provinces, rivers, lakes, mountains, or even neighbourhoods. Thanks to a recent movement to open up data to the public, it’s not *too* hard to get your hands on shape files anymore. To get the shapefiles for the City of Toronto, we went to the City’s OpenData website, which has quite a bit of cool data there to play with. The shapefile we need, the one that shows Toronto divided up into it’s electoral wards, is right here. Download it!

Step 3: Convert the shapefile to GeoJSON

So we have our shapefile, but it’s in a vector format that’s not exactly ready to be imported into a web page. The next step is to convert this file into a web-friendly format. To do this, we need to convert the data in the shapefile into a JSON-inspired format, called GeoJSON. GeoJSON looks just like normal JSON (it is normal JSON), but it has a spec that defines what a valid GeoJSON object must contain. Luckily, there’s some open source software that will do this for us: the Geospatial Data Abstraction Library or GDAL. GDAL has *tons* of cool stuff in it, take a look when you have time, but for now, we’re just going to use the ogr2ogr command that takes a shapefile in and spits out GeoJSON. Armed with GDAL and our shapefile, let’s convert the data into GeoJSON:

ogr2ogr 
  -f GeoJSON 
  -t_srs EPSG:4326
  toronto.json 
  tcl3_icitw.shp

This new file, toronto.json, is a GeoJSON file that contains the co-ordinates for drawing the polygons representing the 44 wards of Toronto. Let’s take a look at that one liner just so you know what’s going on here:

  1. ogr2ogr is the name of the utility we’re using; it comes with GDAL
  2. Specifying the format of the final output (-f option)
  3. This step is *very* key, we’re making sure the output is projected/translated into the familiar lat/long co-ordinate system. If you omit this (in fact, try omitting this flag), you’ll see the output will require some work from us on the front end, which is unnecessary.
  4. The name of the output file
  5. The input shapefile

Alright, GeoJSON file done.

Step 4: Download the data

Assuming enough time has passed, Repustate will have found some data for us and we can retrieve it via the Social API:

https://gist.github.com/mostrovsky/7762900

The data comes down as a CSV. It’ll remain an exercise for the reader to store the data in a format that you can query and retrieve thereafter, but let’s just assume we’ve stored our data. Included in this data dump will be the latitude and longitude of each tweet(er). How do we know if this tweet(er) is in Toronto? There’s a few ways actually.

We could use PostGIS, an unbelievably amazing extension for PostgreSQL that allows you to do all sorts of interesting queries with geographic data. If we went this route, we could load the original shapefile into a PostGIS enabled database, then for each tweet, use the ST_Contains method to check which tweet(er)s are in Toronto. Another way to do this is to iterate over your dataset and for each one, do a quick point in polygon calculation in the language of your choice. The polygon data can be retrieved from the GeoJSON file we created. Either way you go, it’s not too hard to determine which data originated from Toronto, given specified lat/long co-ordinates.

Step 5: Visualize the data

Let’s see how this data looks. To plot the GeoJSON data and each of the data points from our Twitter data source, we use d3.js For those not familiar with d3.js (Data Driven Documents) – it’s a Javascript library that helps you bind data to graphical representations in pure HTML/SVG. It’s a great library to use if you want to visualize any data really, but it makes creating unique and customized maps pretty easy. And best of all, since the output is SVG, you’re pretty much guaranteed that your visuals will look identical on all browsers (well, IE 8 and down don’t support SVG) and mobile devices. Let’s dive into d3.js. We need to plot 2 things:

  1. The map of Toronto, with dividers for each of the wards
  2. The individual tweets, colour coded for sentiment and placed in the correct lat/long location.

The data for the map comes from the GeoJSON file we created earlier. We’re going to tell d3 to load it and for each ward, we’re going to draw a polygon (the “path” node in SVG) by following the path co-ordinates in the GeoJSON file. Here’s the code:

https://gist.github.com/mostrovsky/7776778

Path elements have an attribute, “d”, which describes how the path should be drawn. You can read more about that here. So for each ward, we create a path node, set the “d” attribute to contain the lat/long co-ordinates that when connected, form the boundaries of the ward. You’ll see we also have custom fill colouring based on the 2010 Mayoral election – that’s just an extra touch we added. You can play around with the thickness of the bordering lines. We’re also adding a mouseover event to show a little tooltip about the ward in question when the user hovers over it. Time to add the data points. We’re going to separate them by date, so we can add a slider later to show how the data changes over time. We’re also including sentiment information that Repustate automatically calculates for each piece of social data. Here’s an example of how one complete data point would look:

{
 "lat": 43.788458853157117, 
 "mm": 10,
 "lng": -79.282781549043008,
 "dd": 28,
 "sentiment": "pos"
}

Now to plot these on our map, we again load a JSON file and for each point, we bind a “circle” object, positioned at the correct lat/long co-ordinates. Here’s the code:

https://gist.github.com/mostrovsky/7776799

At this point, if you tried to render the data on screen, you wouldn’t see a thing, just a blank white SVG. That’s because you need to deal with the concept of map projections. If you think about the map of the world, there’s no “true” way to render it on a flat, 2D surface, like your laptop screen or iPad. This has given rise to the concept of projections, which is the process of placing curved data onto a flat plane. There are many ways to do this and you can see a list here of available projections; for our example, we used the Albers projection. Last thing we have to do is scale and transform our SVG and we’re done. Here’s the code for projecting, scaling and transforming our map:

https://gist.github.com/mostrovsky/7776853

Those values are not random – they came about from some trial & error. I’m not sure how to do this properly or in algorithmic way.

Please take a look at the finished product and of course, if you or your company is looking into doing this sort of thing with your own data set, contact us!

Building a better TripAdvisor

(This is a guest post by Sarah Harmon. Sarah is currently a PhD student at UC Santa Cruz studying artificial intelligence and natural language processing. Her website is NeuroGirl.com)

Click here to see a demo of Sarah’s work  in action.

Using Repustate’s API to to get the most out of TripAdvisor customer reviews

I’m a big traveler, so I often check online ratings, such as those on TripAdvisor, to decide which local hotel or restaurant is worth my time.  In this post, I use Repustate’s text categorization API to analyze the sentiment of hotel reviews, and – in so doing – work towards a better online hotel rating system.

Five-star scales aren’t good enough

The five-star rating scale is a standard ranking system for many products and services.  It’s a fast way for consumers to check out the quality of something before paying the price, but it’s not always accurate.  Five-star ratings are inherently subjective, don’t let raters say all they want to say, and force a generic labelling for a complex experience.

Asking people to write text reviews solves a few of these problems.  Instead of relying on the five-star scale, we ask people to highlight the most memorable parts of their experience.  Still, who has the time to read hundreds of reviews to get a true sense of what a hotel is like?  Even the small sample of reviews shown on the first page could be unhelpful, unreliable or cleverly submitted by the hotel itself to make themselves look good. What we need is a way to summarize the review content – and ideally, we’d like a summary that’s specific to our own values.

Making a personalized hotel ranking system

Let’s take a look at a website that uses star ratings all the time: TripAdvisor, a travel resource that features an incredible wealth of user-reviewed hotels.

trip advisor front page

TripAdvisor uses AJAX-based pagination, which means that if we’d like to access that wealth of data, normal methods of screen-scraping won’t work.  I decided to use CasperJS, a fork of PhantomJS, to easily scrape the TripAdvisor website with JavaScript.

Here’s a simplified example of how you can use CasperJS to view the star ratings for every listed hotel from New York:

In a similar fashion, I used Python and CasperJS to retrieve a sample of 100 hotels each from five major locations – Italy, Spain, Thailand, the US, and the UK – and retrieved the top 400 English reviews from each hotel as listed on TripAdvisor.  To ensure that each stored review was in English, I relied on Python’s Natural Language Toolkit. Finally, I called on the Repustate API to analyze each review using six hotel-specific categories: food, price, location, accommodations, amenities, and staff.

sample of hotel rater

Check out the results in this “HotelRater” demo.  Select a location and a niche you care about, and you’ll see hotels organized in order of highest sentiment for that category.  To generate those results, the sentiment scores for each hotel across each category were averaged, and then placed on a scale from 1 to 100.  (I chose to take a mean of the sentiment values because it’s a value that’s easy to calculate and understand.)  The TripAdvisor five-star rating is shown for comparison.  You can also click on the hotels listed to see how Repustate categorized each of their reviews.

When I started putting the app I built into practice, I could suddenly make sense out of TripAdvisor’s abundance of data.  While a hotel might have a four star review on average, the customers were generally very happy with key aspects, such as its food, staff, and location.  Desirable hotels popped up in my search results that I might never have even seen or considered because of their lower average star rating on TripAdvisor.  The listed sentiment scores also helped to differentiate hotels, which I would have previously had trouble sorting through because they all shared the same 4.5 star rating.

This demo isn’t a replacement for TripAdvisor by any means, since there’s hardly enough stored data or options included to assist you in your quest for the perfect hotel on your vacation.  That said, it’s a positive step towards a new ranking system that’s aligned with our individual values.  We can quickly see how people are really feeling, without condensing their more specific thoughts into a blanket statement.

I’d give that five stars any day.

Sentiment Analysis Accuracy – Goal or Distraction?

Everything you need to know about sentiment analysis accuracy

When potential customers approach us, one of the first questions we’re asked is “How accurate is your sentiment analysis engine?”. Well as any good MBA graduate would tell you, the correct answer is “It depends.” That might sound like a cop-out and a way to avoid answering the question, but in fact, it’s the most accurate response one can give.

Let’s see why. Take this sentence:

“It is sunny today.”

Positive or negative? Most would perhaps say positive – Repustate would score this as being neutral as no opinion or intent is being stated, merely a fact. For those who would argue that this sentence is positive, let’s tweak this sentence:

“It is sunny today, my crops are getting dried out and will cause me to go bankrupt.”

Well, that changes things doesn’t it? Put into a greater context and the polarity (positive or negative sentiment) of a phrase can change dramatically. That’s why it is difficult to achieve 100% accuracy with sentiment.

Let’s take a look at another example. From a review of a horror movie:

“It will make you quake in fear.”

Positive or negative? Well, for a horror movie, this is a positive because horror movies are supposed to be scary. But if we were talking about watching a graphic video about torture, then the sentiment is negative. Again, context matters here more than the actual words themselves.

What about when we introduce conjunctions into the mix:

I thought the movie was great, but the popcorn was stale and too salty.

The first part of the sentence is positive, but the second part is negative. So what is the sentiment of this sentence? At Repustate, we would say “It depends.” The sentiment for the movie is positive; the sentiment surrounding the popcorn is negative. The question is: which topic are you interested in analyzing?

There is no one TRUE sentiment engine

As the previous examples have demonstrated, sentiment analysis is tricky and highly contextual. As a result, any benchmarks that companies post have to be taken with a pinch of salt. Repustate’s approach is the following:

  1. Make our global model as flexible as possible to catch as many cases as possible without being too inconsistent
  2. Allow customers the ability to load their own custom, domain-specific, definitions via our API (e.g. Make the phrase “quake in fear” positive for movie reviews)
  3. Allow sentiment to be scored in the context of only one topic, again, via our API

When shopping for sentiment analysis solutions, ask potential vendors about these points. Make sure whichever solution you end up going with can be tailored to your specific domain set because chances are your data has unique attributes and characteristics that might be overlooked or incorrectly accounted for a by a one-size-fits-all sentiment engine.

Or you can save yourself a lot of time and use Repustate’s flexible sentiment analysis engine.

Chinese sentiment analysis now available

Chinese sentiment analysis is now part of the Repustate API

We are very proud to announce our new Chinese sentiment analysis engine. Based on the same engine that we used to create our world-leading Arabic sentiment engine, the Chinese sentiment analysis engine is blazingly fast and accurate.

For the impatient ones, if you want to try it out, you can use our online demo here. If you’re interested in how our Chinese sentiment analysis works, read on!

Conditional Random Fields

Unlike English or Latin-based languages, Chinese (simplified) doesn’t necessarily disambiguate words using whitespace. For example the following string of symbols is a completely normal sentence in Chinese:

团购分量比较一般,不过肉多,而且是和两个女生,所以基本都能吃饱。 猪手香肠无得讲,的确系一般餐厅做唔出的味道,其他就比较一般啦。 后来和朋友们正价去吃> 了一次,感觉分量比团购多,希望商家以后能一视同仁啦。

(For those who don’t read Chinese, this is a review of a restaurant). Now you’ll see a few white spaces here & there but there’s actually many more words being expressed than there are separated tokens. So how do we know where one word (or idea) begins and the next ends?

We use a technique called conditional random fields which uses probabilistic models to infer what the meaning of a particular glyph (character) is given the glyphs around it. With a large enough pre-tagged corpus of Chinese text, Repustate can achieve almost 100% perfection in identifying the individual words or ideas being expressed in a long chain of Chinese glyphs.

Part of speech tagging & sentiment

Now that we know which words are being used, we can apply part of speech tagging (nouns, verbs, adjectives etc.) to help construct a grammatical overview of a piece of text. This then allows us to perform sentiment analysis using our proprietary engine. It’s the same engine that powers our Arabic sentiment analysis. Sentiment analysis uses a combination of probabilistic models, a dictionary of terms or phrases which connote sentiment as well as hand-tuned heuristics that are language specific. All of this is done in a split second so you can still analyze hundreds of Chinese documents in one HTTP request using the Repustate API.

Try it out!

Create your free account and try out Repustate’s new Chinese sentiment engine today.