TSG is currently working with one Alfresco insurance client with up to 17,000 users in a mission critical system where reliability and performance are paramount. As part of our performance tuning, the client is considering turning off the Hazelcast component to improve performance while reducing fault tolerance. This post will present the options for other Alfresco clients’ consideration.
Alfresco Clustering – What is high availability?
Alfresco, like any other highly available system, provides for clustering for redundancy as well as high availability. As documented in the Alfresco documentation:
This scenario shows a single repository database and content store, and two Tomcat nodes/web servers on two separate machines accessing the content simultaneously. The configuration does not guard against the content store or database failure, but allows multiple servers to share the web load, and provides redundancy in case of a server failure. Each server has local indexes (in the local content store).
This is the simplest cluster to set up and is ideal for small-scale installations. A cluster consisting of two or more machines working together provides a higher level of availability, reliability, and scalability than can be obtained from a single node.
A hardware load balancer balances the web requests among multiple servers. The load balancer must support ‘sticky’ sessions so that each client always connects to the same server during the session. The content store and database will reside on separate servers, which allows us to use alternative means for content store and database replication.
Alfresco Clustering – Load Balancer and High Availability
The key to understanding high availability for Alfresco is to understand what happens when an Alfresco node goes down for any reason (hardware, network, connections….). Alfresco leverages Hazelcast to replicate session information and various node/property caches from one node to the others in the cluster. In a true high availability environment, since a user’s session information is contained on all servers, any request from the user could be processed by any node in the cluster.
That being said, Alfresco recommends “sticky” sessions for a variety of reasons so that a user’s load is processed by a consistent node. In a scenario when a node fails, the load balancer will point the user to the alternative node where all the information about the session exists. Without the session information on the new node, the user would have to re-login and would lose any context that was stored in the session.
Alfresco Clustering – Performance versus High Availability
With clustering turned on, it isn’t difficult to imagine how the Alfresco node would process a request.
- Receive Request from User
- Read Session information
- Process request
- Write Session information
- Hazelcast caches information to other nodes for high availability
- Return control to user
During our performance review, we found that Hazelcast writes were fairly chatty between nodes and accounted for a 1-2 second performance delay in some cases when running with 2000+ simultaneous users.
Alfresco Clustering – Sticky Sessions and removing Hazelcast
For our client, the decision chosen was for the improved day to day performance by removing the dependency on Hazelcast clustering for some of Alfresco’s caches. The performance gains that we have seen by using “invalidating” caches, rather than fully distributed caches resulted in 1-2 second improvements in response times, as well as a lower overall CPU utilization during peak usage. In the event that a node goes down:
- The user would be bounced from wherever they were in the application and have to re-login. In this case, our client had SSO and Kerbros, so there was no actual login screen in the case of an outage. The user would just see a full page refresh while they are re-logged in to the other Alfresco server and taken back to where they were since they are leveraging TSG’s OpenContent Management System that persists the user’s current context on the browser.
- Any updates that hadn’t been saved fully before the server went down (Property updates, Annotations…..) would have to be redone.
For the client, the above steps are exactly consistent with a network disruption or PC issue. Given the low probability of a node failing and the simple user step to re-do session information for the client scenario, the client chose to go without Hazelcast.
High Availability with Alfresco can come with some performance degradation. Clients need to analyze their risk of a fault and recovery versus the performance improvements. There are a variety of caches and combinations of caches that can be configured to help balance the decisions between performance and being highly available.