Had an idea for reducing the memory footprint of redis sets. A redis set may be encoded as an intset (if the set contains nothing but integers) or a hash table. But if a set contains integers and other values, you lose out on the benefit of intsets (speed and memory efficiency) and in some cases you have to wait while redis converts your intset to a hash table. This may not be a big penalty, but it all depends on the order in which you add items.

I also noticed that intsets can be encoded in different sizes, which further brings me to my idea: what if each “set” could contain data encoded in all possible formats? Or put another way, what if data were consistently encoded in the most efficient method available? 16-bit intsets for integers less than 65536, 32-bit or 64-bit intsets for appropriately sized integers, and hash tables for non-integers. Surely there will be better encoding formats/data-structures in the future, so it makes sense to make to prepare for them. Yes, the logic in t_set.c would have to change quite a bit, but there aren’t more than 700 lines of code at the moment.

My instincts tell me these efforts will yield measurable results. However, I’m already anticipating some push-back from the other contributors for making sets “less pure”. If my solution results in better memory usage, less code, and a speedup, I’ll still be happy.

So … onward! This past week I began coding the changes. Nothing to commit yet.