Sharing large data structures across processes in Python

At Repustate, much of our data models we use in our text analysis can be represented as simple key-value pairs, or dictionaries in Python lingo. In our particular case, our dictionaries are massive, a few hundred MB each, and they need to be accessed constantly. In fact for a given HTTP request, 4 or 5 models might be accessed, each doing 20-30 lookups. So the problem we face is how do we keep things fast for the client as well as light as possible for the server. We also distribute our software as virtual machines to some customers so memory usage has to be light because we can’t control how much memory our customers will allocate to the VMs they deploy.

To summarize, here’s our checklist of requirements:

  1. Low memory footprint
  2. Can be shared amongst multiple processes with no issues (read only)
  3. Very fast access
  4. Easy to update (write) out of process

So our first attempt was to store the models on disk in a MongoDB and to load them into memory as Python dictionaries. This worked and satisfied #3 and #4 but failed #1 and #2. This is how Repustate operated for a while, but memory usage kept growing and it became unsustainable. Python dictionaries are not memory efficient. And it was too expensive for each Apache process to need a copy of this since we were not sharing the data between processes.

One night I was complaining about our dilemma and a friend of mine, who happens to be a great developer at Red Hat, said these three words: “memory mapped file”. Of course! In fact, Repustate already uses memory mapped files but I completely forgot about this. So that solves half my problem – it meets requirements #2. But what format does the memory mapped file take? Thankfully computer science has already solved all the world’s problems and the perfect data structure was already out there: tries.

Tries (pronounced “trees” for some reason and not “try’s”) AKA radix trees AKA prefix trees are a data structure that lend themselves to objects that need string keys. Wikipedia has a better write up but long story short, tries are great for the type of models Repustate uses.

I found this package, marisa tries, which is a Python wrapper around a C++ implementation of a marisa trie. “Marisa” is an acronym for Matching Algorithm with Recursively Implemented StorAge. What’s great about marisa tries is the storage mechanism really shrinks how much memory you need. The author of the Python plugin claimed 50-100X reduction in size – our experience is similar.

What’s great about the marisa trie package is that the underlying trie structure can be written to disk and then read in via a memory mapped object. With a memory mapped marisa trie, all of our requirements are now met. Our server’s memory usage went down dramatically, by about 40%, and our performance was unchanged from when we used Python’s dictionary implementation.

Next time you’re in need of sharing large amounts of data, give memory mapped tries a chance.


Chinese POS Tagger (and other languages)

Need an Arabic part of speech tagger (AKA an Arabic POS Tagger)? How about German or Italian? You’re in luck – Repustate’s internal POS taggers have been opened up via our API to give our developers the ability to slice and dice multilingual text the way we do.

The documentation for the POS tagger API call outlines all you need to know to get started. With this new API call you get:

  • English POS tagger
  • German POS tagger
  • Chinese POS tagger
  • French POS tagger
  • Italian POS tagger
  • Spanish POS tagger
  • Arabic POS tagger

Beyond this, we’ve unified the tag set you get from the various POS taggers so that you only have to write code once to handle all languages. The complete tag set includes nouns, adjectives, verbs, adverbs, punctuation marks, conjunctions and prepositions. Give it a shot and let us know what you think.