Our sentiment analysis relies on various models to do feature extraction for the purposes of determining sentiment. These models are stored in the form of Python dictionaries and they’re big – about 40MB each, and there are 14 models in total. Doing the math, that’s 560MB of data that must be available to a process. Now, not all of the data is used for each request; only some of the keys in these dictionaries are used for a particular piece of text that needs to be analyzed.
With all this in mind, we recently hit a performance snag with our sentiment engine. The naive approach was to load these models at server startup time by making a reference to the models in the settings.py file of our Django app. (Repustate’s API is Django under the hood). The downside was that it took a while, about 40s, to start up the API server each time we made a change to the code. The upside was that our code was easy to reason about because we’re just using Python dictionaries and they’re simple to use.
But our memory usage, as you can imagine, skyrocketed. We use Apache’s mpm-worker module to run our API servers and each process gets a copy of those models. Our machines have a good amount of RAM, but not enough, and we started to see lots of thrashing and swapping going on. Performance degraded as time went on. We needed a better solution.
Enter the ModelProxy
We came up with the notion of a ModelProxy. This Python class would mimic a normal Python dictionary as much as possible in order to minimize changes to the code (e.g. my_model_proxy[‘some_key’] would still work) but under the hood, the special “magic” methods would access a separate data store.
We took our models and flattened them out, storing the flat key/value pairs in MongoDB. What does a flattened dictionary look like? Imagine you have a dictionary with the following structure:
A flattened version would look like this:
As you can see, we combined the keys ‘c’ and ‘d’ to make a new key ‘c.d’. Applying this recursively converts an otherwise several level deep dictionary into a flat, simple key/value pair dictionary of only one level. How you combine the individual keys to form the new “composite” key is up to you and might be influenced by your underlying data store. MongoDB doesn’t allow ‘.’ in key names, so we used ‘//’ as our separator. Here’s the code to flatten a Python dictionary. You’ll notice in the code that we also hash the keys – again that’s to remove any ‘.’ in the key names, which our keys do contain.
So now we have a flattened dictionary. We push these key/value pairs into a MongoDB collection and now we’ll have a really fast, process-independent , data store. The last bit is the aforementioned ModelProxy. The proxy provides a similar interface to a Python dictionary, but whenever __getitem__ is called vis-a-vie the “index” notation i.e. some_dict[some_key] we convert that into a query on our MongoDB collection. If we use the dictionary from the first example above, this is how you’d use the model proxy:
You’ll notice that when we called __getitem__ on a key whose value is another dictionary, the return value was a new ModelProxy instance. This is how we’re able to traverse an otherwise multilevel dictionary without having to change any notation or Python code. Here’s the code for the ModelProxy. You’ll also see that we have some helper functions in there to facilitate getting the values for multiple keys all in one query.
The one downside is we had to introduce the value method to actually retrieve the value from MongoDB, but other than that, we were able to mimic a Python dictionary perfectly (or as much as we needed to).