Introducing Repustate Sync for distributed deployment

Keeping your Repustate data in sync used to be a pain point – not anymore.

Since we launched the Repustate Server nearly three years ago, the biggest complaint has always been keeping the Servers in-sync across the entire cluster. While previously we had resorted to using databases to keep all data synced up, it placed too much of a burden on our customers. Today, that burden is lifted and Repustate Sync just works.

Distributed deployments with Repustate are now easier than ever with Repustate Sync. All your customized rules and filters are automatically available on all your servers.

To get the latest version of Repustate, which contains Repustate Sync, head on over to your account and download the newest Repustate version.

What’s great about Repustate Sync is it works even if one of your peer nodes goes down and then comes back up online. Each Repustate Sync instance stores a queue of transactions that were not synced properly, so whenever peers are brought back online, everyone is brought up to date immediately.

The only configuration that is needed is the IP addresses and ports that each peer in your Repustate cluster is running on.

For more information about the Repustate Server or Repustate Sync, head on over to the Repustate Server documentation area.

From Python to Go: migrating our entire API

The tl;dr Summary

If you want to know the whole story, read on. But for the impatient out there, here’s the executive summary:

  • We migrated our entire API stack from Python (First Django then Falcon) to Go, reducing the mean response time of an API call from 100ms to 10ms
  • We reduced the number of EC2 instances required by 85%
  • Because Go compiles to a single static binary and because Go 1.5 makes cross compilation a breeze, we can now ship a self-hosted version of Repustate that is identical to the one we host for customers. (Previously we shipped virtual machine appliances to customers which was a support nightmare)
  • Due to the similarity between Python and Go, we were able to quickly re-purpose our unit tests written in nose to fit the structure that Go requires with a just a few simple sed scripts.

Background

Repustate provides text analytics services to small business, large enterprises and government organizations the world over. As the company has grown, so too has the strain on our servers. We process anywhere from 500 million to 1 billion pieces of text EACH day. Text comes in the form of tweets, news articles, blog comments, customer feedback forms and anything else our customers send our way. This text can be in one of the 9 languages we support so there’s that to consider as well since some languages tend to be more verbose than others (ahem, Arabic).

Text analytics is tough to do at scale since you can’t really leverage caching as much as you could in, say, serving static content on the web. Seldom do we analyze the exact same piece of text twice so we don’t bother maintaining any caches – which means each and every request we get is purely dynamic.

But the key insight when analyzing text is you realize that much of it can be done in parallel. Consider the task of running text through a part of speech tagger. For the most part, part of speech tagging algorithms use some sort of probabilistic modelling to determine the most likely tag for a word. But these probability models don’t cross sentence boundaries; the grammatical structure of one sentence doesn’t affect another. This means given a large block of text, we can split it up into sentences and then do the analysis of each sentence in parallel. The same strategy can be employed for sentiment as well.

So what’s wrong with Python?

Our first version of our API was in Django because, well, everyone knew Django and our site runs on Django so why not. And it worked. We got a prototype up and running and then built on top of that. We were able to get a profitable business up and running just on Django (and an old version at that, 1.3 was the version we were using when even 1.6 was out!).

But there’s a lot of overhead to each Django request/response cycle. As our API grew in usage, so too did reliability issues and our Amazon bill. We decided to look at other Python alternatives and Flask came up. It’s lightweight and almost ready-made for APIs, but then came across Falcon. We liked Falcon because it was optimized right off the bat using Cython. Simple benchmarks showed that it was *much* faster than Django and we liked how it enforced clean REST principles. As a bonus, our existing tests could be ported over quite easily, so we didn’t lose any time there.

Falcon proved to be a great stop gap. Our mean response time fell and the number of outages and support issues fell, too. I’d recommend Falcon to anyone building an API in Python today.

The performance, while better than Django, still couldn’t keep up with our demand. In particular, Python is a world of pain for doing concurrency. We were on python2.7 so we didn’t check out the new asyncio package in python3, but even then, you still have the GIL to worry about. Also, Falcon still didn’t solve one other major pain point: self-hosted deployment.

Python does not lend itself to being packaged up neatly, like Java or C, and distributed. Many of our customers run Repustate within their own networks for privacy & security reasons. Up to this point, we’ve been deploying our entire stack as a virtual appliance that can work with either VMware or Virtual Box. And this was an OK solution, but it was clunky. Updates were a pain, support was a pain (“how do I know the IP address of my virtual machine?”) and so on. If we could provide Repustate as a single, installable binary that was the exact same code base as our public API, then we’d have the best of both worlds. Also, this ideal solution had to be even faster than our Python version in Falcon, which meant leveraging the idea that text analytics lends itself to concurrent processing.

Go get gopher

Taking a step back in our story – our Arabic engine was written in this fancy new (at the time) language called Go. Here’s the blog post where we talk about our experience in migrating the code base to Go, but suffice to say, were quite happy with it. The ideal solution was staring us right in the face – we had to port everything to Go.

Go met all of our criteria:

  • faster than Python
  • compiles to a single binary
  • could be deployed into any operating system (and since Go 1.5, very easily at that)
  • makes concurrency trivial to reason about

As an added bonus, the layout of a Go test suite looks pretty similar to our nose tests. Test function headers were simple enough to migrate over e.g.:

def test_my_function():

becomes this:

func TestMyFunction(t *testing.T) {

With a couple of replacements of “=” to “:=” and single quotes to double quotes, we had Go-ready tests.

Because go routines and channels are so easy to work with, we were able to finally realize our dream of analyzing text in parallel. On a beefy machine with say 16 cores, we could just blast our way through text by chunking large pieces of text into smaller ones and then reconstituting the results on the other end e.g.

    chunks := s.Chunks(tws)
    channel := make(chan *ChunkScoreResult, len(chunks))
    for _, chunk := range chunks {
        go s.ScoreChunk(chunk, custom, channel)
    }   

    // Now loop until all goroutines have finished.
    chunkScoreResults := make([]*ChunkScoreResult, len(chunks))
    var r *ChunkScoreResult
    for i := 0; i < len(chunks); i++ {
        r = <-channel
        chunkScoreResults[i] = r 
    }

This code snippet shows us taking a slice of chunks of text, “Scoring” them using go routines, and then collecting the results by reading from the channel one by one. Each ChunkScoreResult contains an “order” attribute which allows us to re-order things once we’re done. Pretty simple.

The entire port took about 3 months and resulted in several improvements unrelated to performance as the team was required to go through the Python code again and make improvements. As an aside, it’s always a good idea, time permitting, to go back and look as some of your old code. You’d be surprised at how bad it could be. The old “what the heck was I thinking when I wrote this” sentiment was felt by all.

We now have one code base for all of our customers that compiles to a single binary. No more virtual appliances. Our deployment process is just a matter of downloading the latest version of our binary.

Concluding remarks

The one thing writing code in a language like Go does to you is make you very aware of how memory works. Writing software in languages like Python or Ruby often seduces you into being ignorant of what’s going on under the hood because it’s just so easy to do pretty complex things, but languages like Go and C don’t hide that. So if you’re not used to that way of thinking, it takes some getting used to (how will the memory be allocated? Am I creating too much garbage? When does the garbage collector kick in?) but it makes your software run that much more smoothly and to be honest, makes you a better Python programmer, too.

Go isn’t perfect and there’s no shortage of blogs out there that can point out what’s wrong with the language. But if you write Go as it is intended to be written, and leverage its strengths, the results are fantastic.

Go – duplicate symbols for architecture x86_64

This is a short blog piece and really intended for fellow Go developers who stumble upon the same error, the dreaded “duplicate symbols” error.

Currently, some of Repustate’s Go code is using cgo to talk to various C libraries. It’s a stop gap until we finish porting all C code to pure Go. While writing some tests, we hit this error:

“ld: 1 duplicate symbol for architecture x86”

(note: if you had more than 1 duplicate, it would tell you exactly how many)

What does this mean? Well, it means we’re trying to link the same symbol name (in our case, a method) from two (or more) different source files. The fix was easy: rename one of the methods by updating the header file, the source file (.c or .cpp file) and lastly, updating your references to the symbol in your Go code, if it is directly referenced there.

Smooth sailing from here on in.

Named entity recognition in Russian

Named entity recognition is now available in Russian via the Repustate API. Combined with Russian sentiment analysis, customers can now do full text analytics in Russian with Repustate.

Repustate is happy to launch Russian named entity recognition to solutions already available in English & Arabic. But like all languages Russian has its nuances that caused named entity recognition to be a bit tougher than say English.

Consider the following sentence:
Путин дал Обаме новую ядерную соглашение

In English, this says:
Putin gave Obama the new nuclear agreement

This is how Barack Obama is written in Russian:
Барак Обама

Notice that in our sentence, “Obama” is spelled with a different suffix at the end. That’s because in Russian, nouns (even proper nouns), are conjugated based on the their use case. In this example, Obama is being used in what’s called the “dative” case, meaning the noun is the recipient of something. In English, there is no concept of conjugating nouns for this reason. English only requires changing the suffix in the case of pluralization.

So Repustate has to know how to stem proper nouns as well in order to properly identify “Обаме” as Barack Obama during the contextual expansion phase.

These are the sorts of problems we have solved so you don’t have to. Try out the Semantic API and let us know what you think.

Russian sentiment analysis: our newest language

Russian sentiment analysis is finally here – by popular demand

We’re often asked to implement text analytics in particular languages by customers, but no language has received as many requests as Russian. Some requests date back to 3 years ago!

We’re happy to announce that Russian sentiment analysis is now open to all to use. Repustate has been testing Russian text analytics in a private beta with select customers and the results have been great. Now all of our customers can analyze Russian using our API.

Russian semantic analysis, including named entity extraction and theme classification, will soon be available as well, completing the loop on full-blown text analytics in Russian.

Try out Repustate’s Russian sentiment analysis now on the Repustate API demo page.

быть здоровым!

Moving from freemium to premium

The freemium business model has suited Repustate well to a point, but now it’s time to transition to a fully paid service.

When Repustate launched all that time ago, it was a completely free service. We didn’t take your money even if you offered. The reason was we wanted more customer data to improve our various language models and felt giving software away for free in exchange for the data was a good bargain for both sides.

As our products matured and as our user base grew, it was time to flip the monetary switch and start charging – but still offering a free “Starter” plan as well as a free demo to try out our sentiment & semantic analysis engine.

As we’ve grown and as our SEO has improved, we’ve received a lot more interest from “tire kickers”. People who just want to play around, not really interested in buying. And that was fine by us because again, we got their data so we could see how to improve our engines. But most recently, the abuse of our Starter plan has got to the point where this is no longer worth our while. People are creating those 10 minute throwaway accounts to sign up, activate their accounts, and then use our API.

While one could argue that if people aren’t willing to pay, maybe the product isn’t that good – the extremes people are going to in order to use Repustate’s API for free tells us that we do have a good product and charging everyone is perfectly reasonable.

As a result, we will be removing the Starter plan. From now on, all accounts must be created with a valid credit card. We’ll probably offer a money-back trial period, say 14 days, but other than that, customers must be committing to payment on Day 0. We will also bump up the quotas for all of our plans to make the value proposition all the better.

Any plans currently on Starter accounts will be allowed to remain so. If you have any questions about this change and how it affects you, please contact us anytime.

What does Facebook’s IPO flop mean for other social media start-ups?

After many months of speculation, Facebook’s highly anticipated IPO didn’t go exactly as planned. The company could previously boast of a valuation worth over $100 billion when its shares were trading in the 30s for most of 2011 on SecondMarket, an American marketplace that buys assets which aren’t easily converted into cash (such as before a company goes public). Shares cost $38 during the company’s initial public opening, but slid down and kept going down. They hit a new low on Monday at a price of 26.90, down 2.96 % from the day before and valuing the company at over $25 billion less than before its IPO.

Time will tell if the company’s value will bounce back but one of the main questions for Silicon Valley is whether or not Facebook’s flop will make investors hesitant to invest in other start ups at the super high valuations easily give to social media companies in the past.

Several start-ups anticipated consequences ahead of Facebook’s IPO and raised venture capital ahead of time. Quora, a question and answer service founded by Facebook alumni, raised $50 million at a valuation of $400 million and Pinterest raised $100 million at a valuation of $1.5 billion.

But many others didn’t. One company that is aiming to raise venture capital at a valuation of $4 billion is Square. Run by Twitter co-founder Jack Dorsey, the San Francisco company is trying to re-invent payment functions through mobile devices and identify software and technologies. Valued at only $1 billion one year ago, the company had a huge increase in value that was similar to Facebook’s valuation growth. Square received a $100 million investment on year ago, and is looking for a $4 billion valuation in its next round of capital funding.

Another hot start-up company, Asana, which is a workplace collaboration network run by Facebook co-founder Dustin Moskovitz,  is looking for funding in the $20 million to $30 million range, with a valuation of $250 million. Online retailers Just Fabulous and Ideeli were looking for $30 to 50 million at $300 to 500 million valuations.

Venture capitalists will general invest in companies they think will be worth three to five times more than their first investment by the time the company announces its IPO.

So that means Square, with its current $4 billion valuation, would have to be worth at least $12 billion a few years down the road, which could prove difficult since the business operates with slim margins and against many strong companies in the industry.

High valuations by Silicon Valley start ups can also have another problem with  a very limited pool of potential buyers. Only the biggest players in town, including Apple, Facebook, Google or Microsoft could buy companies with 10-figure valuations.

But not everyone thinks the start-ups will take a big hit with their valuations. According to earlier media reports, Norwest investor Sergio Monsalve said, “When you look a year from now, two years from now, I’m not sure you’re going to say prices came down at high-quality companies.”

So it looks like investors and companies will have to wait and see if Facebook’s IPO flop was an isolated event or if it will create a ripple effect in the industry.

To learn more about the numbers, check out this interesting chart to explain Facebook’s IPO slide:

http://news.cnet.com/8301-1023_3-57443723-93/heres-the-chart-that-explains-facebooks-ipo-mess/

Should businesses backup their social media data?

Many of us have long heard the well-repeated mantra to “backup our files.” And still many of us have cursed ourselves for forgetting to backup files and invariably losing a document once in a while when our computer freezes. But what about backing up social media data? While it might seem like a tedious task for personal accounts, it makes sense for businesses that depend on their social media activity for part of their company’s growth.

Any number of things can happen. If a company has a Facebook page that is very engaged with a large following,  it could be bad for business if their page gets shut down for some reason, or a bug deletes their followers on Twitter. A lot of valuable data could be lost and it could take a large amount of time and resources to re-build an online community that might have previously taken years to build. Backing up social media exchanges can also be key for businesses that might find themselves needing documents for legal reasons. Or on a simpler note, an Instagram picture uploaded one year ago or a Tweet sent last month might need to be accessed quickly and easily.

There are a wide range of backup options available for businesses on social media sites. Some of these options include:

TweetBackup, is a service powered by Backupify. It creates a daily backup and also asks that you follow  @tweetbackup on Twitter. For a cost ($5 per month), Backupify will also backup data from five different social media accounts, including Facebook, Twitter, LinkedIn and Flickr.

BackUpMyTweets, does exactly what the name suggests and is free if you Tweet about the service. Businesses can also pay $12 a year to download their Twitter data for analysis. To backup Facebook data, users can download a copy by going to their profile and clicking on “Account Settings,” and then “Download a copy of your Facebook data,” which will reflect data at that particular point in time.

SocialSafe.: There is both a free version and a Pro version which costs around $7 for the year. The backup happens automatically and covers Facebook, Twitter, Linkedin, Google+ and Instagram photos. Most importantly for small businesses, the service can back up a Facebook business page, which includes wall posts, notes, active fans, etc, with a functionality that can search by date, person, photos or wall post. What’s interesting is that while most services will only provide backup activity from the time the user subscribes to the service, SocialSafe will retrieve as much old data as Facebook, Twitter, LinkedIn API (etc..) will allow.

Businesses with a Twitter following can export a list of their followers with a .csv (Excel) file, which can give them valuable information including a follower’s  name, Twitter handle, when they joined Twitter, their location, the number of followers they have, how many people they follow, whether or not you also follow them and how many times they’ve tweeted since joining Twitter. Business owners can also easily identify their most important and influential followers and categorize them by geography and level of activity, meaning a company can schedule their Tweets to the time of day when those followers are most likely to see them… very valuable indeed!

Canadian newcomers actively using social media

 “Immigrants are said to be highly motivated to understand events of the new society in addition to those within their minority circle and news of events in their home countries.  According to this ‘need to be informed’ explanation, the immigrant consumers are likely to spend more time with media than the majority.” [1]

While online users can find social media websites useful for keeping in touch with friends, family or for seeking new job opportunities, newcomers to Canada can find the networks a lifeline for cultural integration into their new society. This week I decided to explore these somewhat overlooked but important users of social media.

Ask an immigrant what their primary concern is when moving to a new country and many will tell you it’s finding a new job, and one that is similar to the work they did in their home country. Since 68 per cent of companies would hire a candidate based on their online profile on a social networking site, according to social media monitoring service Reppler, it’s extremely important for immigrants to create the right profile online when seeking that new job in the new country.

One indicator of how newcomers are flocking to the digital world is the website LoonLounge.com, an online community which is exclusively dedicated to connecting immigrants to services and groups that can help them settle. It has over 80,000 Canadian newcomer members and can be a perfect springboard for immigrants beginning to use other social networking sites which are popular in Canada but might not back in their native countries.

The site’s aims to “improve the Canadian immigration process for the millions of people involved: applicants waiting in the queue, new immigrants adjusting to life in Canada, Canadian employers waiting for skilled workers to arrive, and the many people around the world who dream of one day making Canada their home.”

With all this in mind, I decided to talk to a Canadian newcomer who is currently using social media channels to settle into a new Canuck way of life. Sandhya Ranjit is a former manager of corporate communications, from Bangalore India. She immigrated to Canada two years ago and is focused on trying to find similar work in her field.

Ranjit said in the social media arena, Ranjit has found the most success with LinkedIn for her job hunt. Two years ago, when Ranjit was working in India, she said that no one was on LinkedIn, which was launched in May 2003. She only created her profile after coming to Canada.

She doesn’t have Facebook and isn’t active on Twitter, but she likes LinkedIn for its professional feel and the fact that it generates lists of other users who she might want to connect with, and notifies her of discussions she might want to be a part of, which are relevant to her field.

Ranjit said there can be a hesitation for people of her culture to become very visible in the social media world, so they might have an initial fear of creating visible profiles on channels like Twitter. While she is very comfortable writing emails because they are a one-on-one interaction, she said a social network makes the user more visible, which some people from her culture might hesitate to do because it’s not something they do normally. But she said she has to do it because it will help her with her job search.

“We don’t come out; socially we’re not very active,” she said, adding that in Canada she hasn’t met anyone who is uncomfortable with creating social media profiles, especially the more professionally-oriented ones. “When you’re using LinkedIn for professional purposes, it’s a little different. You’re not talking about hobbies like on Facebook,” she said.

“My friends, most who are here, are also active. I’ve met so many immigrants on LinkedIn, the learning curve is the hesitation to use it,” she added.

She said that now, everyone she knows in India is active on social media sites like Twitter and Facebook, but the idea of making oneself visible professionally isn’t as popular, because of the hesitation that a company might see a user’s profile and they may not like the open visibility of their employee. But she said that more people might be opening up to the idea, especially if limited the information posted.

Ranjit said she likes to use LinkedIn because she can connect to hundreds of people, which would be impossible to do in real life in the same amount of time. She said when she finally talks to them or meets them, she feels like she already knows them, which really helps with her networking efforts in trying to find a great job.

[1] Wei Na Lee and David K. Tse in their paper, “Changing Media Consumption In A New Home:  Acculturation Patterns Among Hong Kong Immigrants to Canada.”

Where does social media stand in the world of healthcare?

A new study from business services firm PriceWaterhouseCoopers has found that pharmaceutical and healthcare brands lag behind other business when it comes to taking advantage of growth opportunities available through social media channels.

The study found the industry’s executives are behind in social media use when compared to their customers. Out of 124 executives interviewed, half expressed worries about how they were going to integrate social media into their business strategy. Those leaders also weren’t sure how to prove the return on investment, according to the study.

Out of 1,060 consumers that PwC polled, about 42 percent read health-related user reviews on social media sites like Facebook and Twitter. Thirty-two percent of those polled said they accessed information which concerned the health experiences of friends and family. Twenty-nine percent looked for social media users who had had an illness similar to their own, and 24 percent looked at videos and photos uploaded from people currently suffering from a similar illness.

Kelly Barnes, PwC’s US health industries leader, said, “The power of social media for health organizations is in listening and engaging with consumers on their terms. Savvy adopters are viewing social media as a business strategy, not just a marketing tool.”

A few more key stats from the survey included:

-28% of users supported health-related causes on the web

-24% uploaded comments giving details about their own health status

-16% posted reviews of medication

-15% mentioned health insurers.

The study showed that health brands would see huge benefits to creating and monitoring social media channels because 43% of users said they would be likely to share positive experiences about a brand of medication they used, and 38 percent said they would share negative opinions, giving a transparent view of medication effectiveness for anyone interested.

Seventy percent of the social media users expected a response from an inquiry on a healthcare company’s social media channels to come back within 24 hours, and 66 percent expected the same response time for a complaint about goods or services.

_________

Social media users had one of their first digital interactions with medical world in February 2009 when Henry Ford Hospital in Detroit, MI became one of the first hospitals to allow a procedure to be live-Tweeted from within the operating room. Used as a sort of real-time textbook, other doctors, medical students and anyone else who was curious could follow along as surgeons gave updates on a kidney surgery that removed a cancerous tumour.

While not as directly profitable as a healthcare brand using social media, such procedures can generate excitement for the hospital or medical organization, especially when they look to raise money during fundraising campaigns or attract new patients.