Friday, August 03, 2012

Second Level Caching


Why do we need caching?
Database operations are very expensive. A typical query would have to be first transmitted over the network. Then database has to create a query plan for it and then execute it. The result then has to be transmitted back over the network again before the client can process the results. Most databases will cache the results and some will cache queries thereby preventing the execution of the queries and sometimes eliminate the need for the query plan. However, these are largely dependent on the number of unique queries and also the network overhead is still there.

The natural and obvious answer is to have a cache the query result at the client end of the database connection. Hibernate provide two levels of caching. First level cache (L1) and Second level cache (L2). First level cache is default cache and every hibernate session has associated L1 cache. L1 cache makes sure minimum number of queries are generated and all the updates are executed at the end of session thus reducing network traffic.

Second level (L2) cache is optional and disabled by default. Second level cache is associated with session factory in hibernate application. Objects cached using second level cache are available to all the hibernate sessions in that particular session factory, thus avoiding network traffic when same objects are requested by different hibernate sessions. L2 cache can be used to cache object entities as well as query results. L2 cache can be distributed or replication or local depending upon the environment.

Caching Strategies
The following four caching strategies are available

Read-only - This strategy is useful for data that is read frequently but never updated. This is by far the simplest and best-performing cache strategy.

Read/write - Caches data that is sometimes updated while maintaining the semantics of "read committed" isolation level. If the database is set to "repeatable read", this concurrency strategy almost maintains the semantics. Repeatable read isolation is compromised in the case of concurrent writes.

Nonstrict read/write - Caches data that is sometimes updated without ever locking the cache. If concurrent access to an item is possible, this concurrency strategy makes no guarantee that the item returned from the cache is the latest version available in the database. Configure your cache timeout accordingly!

Transactional - This is a fully transactional cache that may be used only in a JTA environment.

Distributed vs Replication vs Local cache modes


Replication caching
A replicated cache is a clustered, fault tolerant cache where data is fully replicated to every node in the cluster. The best part of a replicated cache is its access speed. Since the data is replicated to each cluster node, it is available for use without any waiting. This is referred to as "zero latency access," and is perfect for situations in which an application requires the highest possible speed in its data access. Each cluster node (JVM) accesses the data from its own memory. In contrast, updating a replicated cache requires pushing the new version of the data to all other cluster nodes. Since, data is replicated to all machines, adding servers does not increase aggregate cache capacity.

Replication caching with invalidation
Invalidation is a clustered mode that does not actually share any data at all, but simply aims to remove data that may be stale from remote caches. This cache mode only makes sense if you have another, permanent store for your data such as a database and are only using L2 cache as an optimization in a read-heavy system, to prevent hitting the database every time you need some state. If a cache is configured for invalidation rather than replication, every time data is changed in a cache other caches in the cluster receive a message informing them that their data is now stale and should be evicted from memory.

Distributed caching
A distributed, cache is a clustered, fault-tolerant cache that has linear scalability. Data is partitioned among the nodes of the cluster. For fault-tolerance, partitioned caches can be configured to keep each piece of data on one or more unique modes within a cluster.

The data in a distributed cache is spread out over all the servers in such a way that no two servers (or more than n nodes in cluster) are responsible for the same piece of cached data. This means that the size of the cache and the processing power associated with the management of the cache can grow linearly with the size of the cluster. Also, if the number of copy of data is one or higher, any cluster node can fail without the loss of data.

Any update would result in remote calls equal to the number of nodes in cluster keeping copy of updated data, and get operation anywhere in the cluster would result in at most 1 remote call.

Local mode
Local caches don't join a cluster and don't communicate with other caches in a cluster.

L2 Cache Implementations

Ehcache
Ehcache is an open source second level caching framework and can easily be integra ted with the Hibernate. Ehcache provides replication caching using JGroups, RMI or JMS and distributed caching using Terracotta server. Ehcache also has enterprise edition which comes with Bigmemory (Big data + In memory), terracotta server array for distributed caching and few other features.

Infinispan
It is open source Jboss distributed grid platform. It is JSR-107 based cache interface. While infinispan can be run in Local and replication mode, its distributed mode is more powerful since each data entry is spread out to only fixed number of replicas (as oppposed to replication mode where each data is copied to all nodes limiting cache size despite of number of nodes) thus providing resilience to server failures as well as scalability.

Hazelcast
Hazelcast provides distributed second level cache for Hibernate entities, collections and queries. Hazelcast has two implementations of Hibernate 2nd level cache, one for hibernate-pre-3.3 and one for hibernate-3.3.x versions.

Oracle Coherence
Oracle coherence cache is fastest caching framework available and it supports all forms cache modes. However, it is not an open source.

Query caching
Hibernate query results are not cached by default like entity cache. Every HQL, JPQL, SQL and criteria query always hits database first time. Once, query cache is enabled, query result is cached in query cache region. Query cache region holds the query including parameters and resultset identifiers (all non-identifiers columns are discarded) in case of entities are returned. If query returns the scalar values actual result is held in query cache region. Since, only identifiers values are held in query cache region, it is required to configure second level cache for entity returned in the resultset, otherwise for each identifier hibernate ends up in database hit making performance even worse. For each query cache region hibernate uses updatableTimestamp region internally, which holds the latest timestamp when the entities held in query cache region were updated and uses to invalidate the query cache. Majority of the queries are not benefited from L2 cache unless they are frequently accessed and entities in the resultset not changed often.

Do we really need caching?
While second level caching often improves performance than not, its not first option to optimize performance. Badly designed object model and bad queries can not be improved by caching. L2 cache helps in cases where same data is frequently accessed/used and rarely modified. There is no point in caching data that is not accessed frequently or at least before cache expires. In that case we are unnecessary holding up the memory without no significant increase in performance, in some cases even downgrading performance. When data is modified by another external source, L2 cache may not be aware of such change and still holding stale data. This situations can be dealt by using hibernates Session.lock() method and lock modes before any updates to data in session to make sure latest data is used. However, this may result in database hit every time session.lock is executed bypassing cache. Also, L2 cache may have its own overhead when managing cache especially in distributed environment.

It is important to set a performance benchmark for an application with and without caching for number of put/get operations per second for same data.

No comments: