How to manage large data structures in Python

Our sentiment analysis relies on various models to do feature extraction for the purposes of determining sentiment. These models are stored in the form of Python dictionaries and they’re big – about 40MB each, and there are 14 models in total. Doing the math, that’s 560MB of data that must be available to a process. Now, not all of the data is used for each request; only some of the keys in these dictionaries are used for a particular piece of text that needs to be analyzed.

With all this in mind, we recently hit a performance snag with our sentiment engine. The naive approach was to load these models at server startup time by making a reference to the models in the file of our Django app. (Repustate’s API is Django under the hood). The downside was that it took a while, about 40s, to start up the API server each time we made a change to the code. The upside was that our code was easy to reason about because we’re just using Python dictionaries and they’re simple to use.

But our memory usage, as you can imagine, skyrocketed. We use Apache’s mpm-worker module to run our API servers and each process gets a copy of those models. Our machines have a good amount of RAM, but not enough, and we started to see lots of thrashing and swapping going on. Performance degraded as time went on. We needed a better solution.

Enter the ModelProxy

 We came up with the notion of a ModelProxy. This Python class would mimic a normal Python dictionary as much as possible in order to minimize changes to the code (e.g. my_model_proxy[‘some_key’] would still work) but under the hood, the special “magic” methods would access a separate data store.

We took our models and flattened them out, storing the flat key/value pairs in MongoDB. What does a flattened dictionary look like? Imagine you have a dictionary with the following structure:

{'a':'b', 'c':{'d':'e'}}

A flattened version would look like this:

{'a':'b', 'c.d':'e'}

As you can see, we combined the keys ‘c’ and ‘d’ to make a new key ‘c.d’. Applying this recursively converts an otherwise several level deep dictionary into a flat, simple key/value pair dictionary of only one level. How you combine the individual keys to form the new “composite” key is up to you and might be influenced by your underlying data store. MongoDB doesn’t allow ‘.’ in key names, so we used ‘//’ as our separator. Here’s the code to flatten a Python dictionary. You’ll notice in the code that we also hash the keys – again that’s to remove any ‘.’ in the key names, which our keys do contain.

So now we have a flattened dictionary. We push these key/value pairs into a MongoDB collection and now we’ll have a really fast, process-independent , data store. The last bit is the aforementioned ModelProxy. The proxy provides a similar interface to a Python dictionary, but whenever __getitem__ is called vis-a-vie the “index” notation i.e. some_dict[some_key] we convert that into a query on our MongoDB collection. If we use the dictionary from the first example above, this is how you’d use the model proxy:

>>> proxy['a'].value()
>>> proxy['c']
<ModelProxy instance>
>>> proxy['c']['d'].value()

You’ll notice that when we called __getitem__ on a key whose value is another dictionary, the return value was a new ModelProxy instance. This is how we’re able to traverse an otherwise multilevel dictionary without having to change any notation or Python code. Here’s the code for the ModelProxy. You’ll also see that we have some helper functions in there to facilitate getting the values for multiple keys all in one query.

The one downside is we had to introduce the value method to actually retrieve the value from MongoDB, but other than that, we were able to mimic a Python dictionary perfectly (or as much as we needed to).

A very simple example of using “expect” on unix

Often times you have to interact with programs that require passwords or some other input from the user. For security purposes, some programs will not read from stdin so you have to be creative. Enter “expect”. Expect is a program written in Tcl that allows you to mimic a conversation you’d have with any number of programs. There are lots of examples on the web, but I wanted to put up a really simple one just to get the picture across of how it can be used.

Let’s consider this simple python program,

input = raw_input("Enter:")
print "You entered %s" % input

Running python will print “Enter:” to the command line and wait for the user to enter something followed by hitting “Enter”. It will then print out what the user entered, a simple echo program. Let’s drive this programatically using expect (aside: yes, I know you don’t need expect to something like this).

Here’s our expect script:

spawn python
expect "Enter:"
send "Repustater"
expect eof

Line by line:

  1.  Indicates which script should be used to execute the code that follows
  2.  spawn starts a new program. So we’re running our python program as we normally would
  3. OK here’s the magic; we’re saying that we expect the string “Enter:” to be returned by the spawned program.
  4. Now we’re saying send the string “Repustater” (watch that carriage return, we need it otherwise python will just wait) to the waiting python process.
  5. This line means to wait for the program we spawned to finish before we exit the expect script.

There you have it, a really simple intro how to use expect to expect what you expect. There is so much you can do with expect and it’s left to the reader to go out and discover all these nifty features. It really comes in handy when you want to interact with programs that require passwords but for security purposes, won’t read from stdin.