My first motivation for investigating redis was due to an optimization challenge at work. Our network of websites covers track and field and cross-country. We have a database of 13 million performance records, and naturally our site provides visitors with the ability to view up-to-date rankings. Currently we use Sphinx for this, but Sphinx is geared for full-text search, and was quite slow (we’ve since optimized sphinx further, which I hope to write about later). However, being able to produce rankings per education level, gender, season, event doesn’t require full-text search. It doesn’t require any sort of search. Instead, I decided to partition our performance IDs across several (about 60) sets and use zsets to do ordering by performance score. More about the specifics at the end.

Redis seemed a great fit, but upon initial tests I quickly discovered that pumping 13 million records worth of rankings takes more than 8GB of RAM. So I began looking at the redis code and found that sets are implemented using intsets for up to 512 integers and hash tables after that. I then had an idea to allow sets to contain both “encodings”, read about that here.

Using intsets proves to be both faster, and less memory hungry. So if you need to create sets of pure integers, raise the “set-max-intset-entries” value to an appropriate level and have at it.

I only know this through testing, which I did using the redis 2.4 branch against my own code where sets can simultaneously be encoded using both methods. If I would have tested against 2.4 with a much higher max-intset value I’m certain I would have had similar results.

To test I launched either vanilla 2.4, or my own version of redis-server. I generated a file of redis commands and piped the file into hiredis, using the same input file whether I was testing my modified version of redis-server, or the 2.4 branch.

The summary:

  • The commands completed in about an hour with hash table sets
  • They completed in about 20 minutes with intsets
  • Memory utilization was also about half with intset encoding

Intset insertion proved to be about twice as good in terms of memory and sadd speed. I must say that I didn’t fully benchmark set intersection using both encodings. It may be that hash tables are faster, but I doubt it.

My test input file was created from redis/flood.php script from here. Each of 5 million integers is added to:

  • 1 of 50 state sets
  • 1 of 4 season sets
  • 1 of 2 level sets
  • 1 of 2 gender sets
  • a zset with a score

Hope someone finds this useful.