What makes Google Instant so good (and the copy-cats not so good)?

There’s more to Google Instant than ajax. It’s the n-grams, stupid.

With the release of Google Instant, developers everywhere have sprung into action developing their own version of Instant for their favourite web service. This avalanche of development was undoubtedly spurred on by one developer’s job offer from YouTube based on his YouTube Instant work. But what many of the copy-cats fail to recognize is that an asynchronous results page is not what makes Google Instant effective; it’s their ability to predict your search. And how do they do that? N-grams my friend, n-grams.
What’s an n-gram? It’s simply a string of ‘n’ consecutive tokens. For example a tri-gram (or 3-gram) would be “how are you” or “las vegas vacation”. A bi-gram would be “hey there” or “justin bieber”. Google stores every single n-gram it can get its hands on (via Google Search, GMail, Google Voice, Google Voice Search etc.) in order to improve its accuracy in predicting a human’s search intent. In fact, Google published their list of most commonly used n-grams a while back based on their corpus.
So what does this all mean to you, the endeavouring developer? Repustate has added to its API a call to generate n-grams. With this call, you can generate your own n-grams over any data set you like or based on any web page you like and come up with your own prediction algorithms. For a quick demo, head over to our home page and using the demo box on the right, choose “Generate n-grams” from the drop down, enter a URL, and you’ll get a quick sampling of our n-gram generator.

Or better yet, sign up for our API (it’s free and provides unlimited use) and start having at it. If you have any ideas or suggestions, please let us know here.

Two new updates to our API

Repustate’s API welcomes a new member to the family this week.

Our API will see two new updates this week. The first is our ever popular “clean-html” API call. We’ve beefed it up to handle more cases and to be more resilient in handling odd web pages.The next update is something we’re really happy about and that is our newest API call – ngrams. An n-gram is a string of consecutive tokens of ‘n’ length. For example, a bi-gram is two tokens, such as “I like”. A tri-gram would be “I like Repustate”. You can count as high as you like, but in english, rarely do you go above 5-grams.What’s the importance of n-grams? They let us see frequencies and commonalities which occur in written text, which of course is crucial to our cause. You know when you type in a Google search and it just happens to know what you’re looking for? That’s because it’s returning the most common n-grams people have typed. Google has the world’s largest collection of n-grams; Repustate is trying to get there!

As usual, let us know what you want to see or what you don’t like. We listen to everyone.