Bellicose Beliefs: Versioned Entity Caching for high read/write on Google App Engine Python

The usage of memcache for lowering the load on App Engine Datastore is well known. Here is an approach to versioned caching of datastore entities in App Engine. The Model is deigned primarily for purposes where entity is updated very frequently. The idea for these kind of entity generated while working on the Multiuser Chat room for App Engine using Channel APIs.

Expectations

The Model is designed with the following factors in mind

Very frequent read/write access to entity
Effective usage of memcache to reduce load on datastore operations
Strong consistency between datastore and memcache

Design Plan

The Entities have an internal version number. This version number is increased everytime there is an update in the entity.

from google.appengine.ext import db


#   The maximum difference in revisions acceptable  at any instant
#   between memcached values and the datastore values. The higher it
#   is, the greater catastrophe when memcache goes down, but lesser
#   datastore usage. The lesser it is, the more consistent your datastore
#   and memcache are, and higher datastore operations. 4~6
FAULT_TOLERANCE = 4

class GlobalVersionedCachingModel(db.Model):
    """
    The Model uses internal versioning of information with prime focus on very
    high read/writes and consistency.
    Every entity has a datastore's version number information and the version
    number from memcache. When the entity is updated, it happens in memcache
    only and the memcached version number increases. If this number is greater
    than the datastore version number by a certain amount called
    "fault tolerance", then the datastor entity is sync'd with the memcache
    entity.
    """
    
    _db_version = db.IntegerProperty (default=0, required=True)
    _cache_version = db.IntegerProperty (default=0, required=True)
    _fault_tolerance = db.IntegerProperty(default = FAULT_TOLERANCE)

Approach #1

Initially, both the versions are set to 0 when the entity is created. _fault_tolerance is the maximum allowed difference between the cache version and the datastore version. Since the entity is primarily read from and written to memcache, and later updated to datastore, the datastore version number can be behind the memcache version number. When the difference between the two exceeds fault tolerance, then the datastore is updated with the memcache details.

The downside of this approach is that when the memcache goes out, then the datastore entity is fetched. This entity, in worst case scenario, could be lagging from the actual entity by a maximum of _fault_tolerance factor. Fault Tolerance can be decreased to improve the consistency between memcached entity and the datastore entity but that will result in higher datastore read/write operations.

Approach #2

In this approach, instead of directly writing into the datastore, we initiate a task queue and that gets the task done for us. The good part of this approach is that we can keep smaller values of _fault_tolerance and still expect faster processing. This approach is ideal where strong consistency is required between the entities and latency shall also be minimal

Methods

def get2 (keys, **kwargs):
    keys, multiple = datastore.NormalizeAndTypeCheckKeys (keys)
    getted_cache = memcache.get_multi (map (str, keys))
    ret = map (deserialize_entities, getted_cache.values ())
    keys_to_fetch = [key for key in keys if getted_cache.get(key, None) is not None]
    getted_db = db.get(keys_to_fetch)
    memcache_to_set = dict ((k,v) for k,v in zip (map (str,keys_to_fetch), 
                                            map (serialize_entities, getted_db)))
    ret.extend(getted_db)
    memcache.set_multi (memcache_to_set)
    if multiple:
        return ret
    if len (ret) > 0:
        return ret[0]

class GlobalVersionedCachingModel(db.Model):
    """
    The Model uses internal versioning of information with prime focus on very
    high read/writes and consistency.
    Every entity has a datastore's version number information and the version
    number from memcache. When the entity is updated, it happens in memcache
    only and the memcached version number increases. If this number is greater
    than the datastore version number by a certain amount called
    "fault tolerance", then the datastor entity is sync'd with the memcache
    entity.
    """
    
    _db_version = db.IntegerProperty (default=0, required=True)
    _cache_version = db.IntegerProperty (default=0, required=True)
    _fault_tolerance = db.IntegerProperty(default = FAULT_TOLERANCE)
    created = db.DateTimeProperty (auto_now_add=True)
    updated = db.DateTimeProperty (auto_now=True)
    
    @property
    def keyname (self):
        return str (self.key ())

    def remove_from_cache (self, update_db=False):
        """
        Removes the cached instance of the entity. If update_db is True,
        then updates the datastore before removing from cache so that no data
        is lost.
        """
        if update_db:
            self.update_to_db()
        memcache.delete(self.keyname)
    
    def update_to_db (self):
        """
        Updates the current state of the entity from memcache to the datastore
        """
        self._db_version = self._cache_version
        logging.info('About to write into db. Key: %s' %self.keyname)
        self.update_cache ()
        return super (GlobalVersionedCachingModel, self).put ()
    
    def update_cache (self):
        """
        Updates the memacahe for this entity
        """
        memcache.set (self.keyname, serialize_entities (self))

    def put (self):
        self._cache_version += 1
        memcache.set (self.keyname, serialize_entities (self))
        if self._cache_version - self._db_version >= self._fault_tolerance or \
                                                self._cache_version == 1:
            self.update_to_db ()
    
    def delete (self):
        self.remove_from_cache()
        return super (GlobalVersionedCachingModel, self).delete ()
    
    @classmethod
    def get_by_key_name (cls, key_names, parent=None, **kwargs):
        try:
            parent = db._coerce_to_key (parent)
        except db.BadKeyError, e:
            raise db.BadArgumentError (str (e))
        rpc = datastore.GetRpcFromKwargs (kwargs)
        key_names, multiple = datastore.NormalizeAndTypeCheck (key_names, basestring)
        logging.info(key_names)
        keys = [datastore.Key.from_path (cls.kind (), name, parent=parent) for name in key_names]
        if multiple:
            return get2 (keys)
        else:
            return get2 (keys[0], rpc=rpc)