Several of our clients have moved from running on-premise systems to the Amazon Web Services cloud. One of the most significant benefits with AWS in regards to high availability can be found by leveraging database high availability with AWS’ Relational Database Service (RDS) and file system high availability with AWS S3 or EFS (Elastic File System). This post will describe how to achieve a high availability architecture for Alfresco solutions.
Defining High Availability
For a high availability system, the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are as close to zero(0) seconds as possible. RTO is how quickly can the ECM system be brought back online, in other words, i.e. how much time was lost due to the outage; while RPO is a measure of how often the system is backed up and answers the question, how much data will be lost from an outage. For high availability systems, we need to consider geography of the business. Maintaining a zero RTO and RPO regardless of where data or users reside is very expensive and may entail an unnecessarily complex solution that might not make business sense. How close you get to zero depends on how critical the data is and how much budget is available to achieve high availability.
This post will focus on a high availability implementation within a single AWS Region. While understanding the need for a high availability system is simple, evaluating the factors and criteria that define high availability can be a multi-faceted discussion full of cost and architecture trade offs. (Note: not all AWS services are available in all Regions)
If you need a multi-region active-active solution with near-zero downtime, AWS is releasing services in 2018 to support critical systems. More detail can be learned from this re:Invent 2017 session, How to Design a Multi-Region Active-Active Architecture (ARC319). The AWS session is a long presentation and discussion so we’ll save you some time by listing the conclusions here with a short commentary.
- Avoid synchronous replication & simultaneous deployments as much as possible [new services to support it coming in 2018, but it is still complex. See an RDS example here.]
- Design applications for idempotency & eventual consistency as much as possible [critical functions should be consistent, non/near-critical provide for eventual consistency]
- Closely monitor replication & code sync delays [have an orderly process to hand off rolling updates across regions]
- Have push buttons ready to switch traffic for tenants [to different Regions] [tenants being groups of users, i.e. customers, business areas/groups, etc.]
- Make High-Level Metrics monitoring systems also Multi-Region [the ability to identify what happened before a region went down by looking at data that was sent to a different region]
- It is an involved exercise. It requires careful planning and design
- However, various AWS services make implementation much easier by doing undifferentiated heavy lifting for our customers [services coming in 2018]
- For companies with extremely high availability requirements or/and geographically distributed user base, benefits of multi-region active-active architecture can be profound
- In such cases, consider designing applications for multi-region active-active implementation from day one
In the Beginning
Before designing a high-availability architecture with our clients we’ll sit down and discuss the classifications of data managed by the system: what is critical, what business processes are impacted by down time, what is the current architecture and where are its weaknesses. This discussion generally follows the outline of our Health Check engagements. (See ECM Health Check).
Common Characteristics of Alfresco High-Availability Architectures
For Alfresco, high availability systems running on AWS share common characteristics: clustered web, repository, and index servers. Common AWS high-availability region-wide services include RDS, S3, and EFS.
In the discussion below, we’ll describe how AWS supports the high availability of each common architecture tier.
When deploying an Alfresco solution, we start with a 3-tier web architecture: web application, repository (services), and database. In regards to the web application tier, this is where Alfresco Share, Alfresco Admin, and TSG’s OpenContent Management Suite run. It also includes supporting applications such as Tomcat and Apache.
AWS Elastic Load Balancer (ELB) provides the means to route user sessions to a multi-server web tier . There are a few caveats with Alfresco web clients out-of-the-box which you can read about here. Alfresco’s user sessions are persisted at the repository tier which generally requires a user to be routed from the same web server to the same alfresco server on the repository tier. This can be done by setting old-style sticky sessions at the ELB layer or using the JSESSIONID cookie.
Either way, this creates an issue for users if the web server they are connected to goes down. The ELB routes them to the next healthy one. A user’s session is not known to the healthy web server and thus the user is asked to log in again. Adding in a session caching solution enhances the web tier and resolves this issue; users transition smoothly across the high-availability cluster.
This is the heart of the Alfresco system. With its modern architecture, Alfresco is ready out-of-the box for clustering. It natively works across AWS Availability Zones (AZs), provided the security group port is open and the instances live within the same VPC. For more specific product updates in release 5.2.3 read here.
TSG’s OpenContent web services deployment also resides in the repository tier. It is deployed as an embedded subsystem within Alfresco. More details are available on the OC product page here.
If Alfresco is using S3 as a content store the S3 Connector amp is deployed and configured at the repository tier. This ensures each instance of Alfresco, regardless of which AZ it is in, has access to the same regional S3 bucket. If one AZ in the region goes down, it does not impact content storage or access by the Alfresco instance running in a different AZ.
To get the best search experience with Alfresco out of the box, Alfresco deploys Solr. Release 5.2.3 is compatible with both Solr 4 and 6 as described here. While Solr 4 is the default, Alfresco recommends using Solr 6 if possible as Solr 6 offers some additional capabilities and is needed to use sharding for larger repositories. This report from Alfresco, describes a high-availability option for sharding, Alfresco report on Solr Index DR POC results.
Since Alfresco permits deploying Solr Indexing on the same server as the Repository as well as on separate servers, a high-availability (HA) architecture will necessarily be different. If deployed on the same server, then it simply piggybacks along with the repository tier HA design. However, if on a separate server the HA design will depend highly on if the index is sharded or not. If it’s not sharded, the HA design is similar to the one found in the Alfresco AWS Quick Start Reference Architecture.
In the Reference Architecture scenario, the index server is managed within an AWS auto-scaling (AS) group. The AS group will add or remove servers based on metrics such as CPU usage. When every Availability Zone in a region is up and running, each AZ has its own index server and each index server has its own copy of the data. In an AZ failure scenario, the index server may need to scale up to handle the traffic load but there is no need to copy data as it is already there.
RDS gives you automated failover with millisecond latency between a Main database and a Standby database in another AWS Region’s availability zone. Additional Read Replicas can be created to ensure a copy of the database is in a third Availability Zone or even in another AWS Region. To keep tabs on the latency timing between the Main, Standby, and Read Replicas an AWS Cloudwatch alarm can be configured to send a notification if the latency exceeds a specified threshold. In addition to write latency, there a numerous other RDS metrics that can be monitored. Alfresco supports several RDS database configurations, including:
- SQL Server
AWS provides several storage options the two that are redundant across a region are S3 (object store) and EFS (file store). The main difference is while an S3 bucket is automatically replicated to all Availability Zones (AZs) in a Region, when configuring EFS you have the option to choose one or more AZs to include in the file cluster. The behavior differences between EFS and S3 are significant. EFS will simply behave as a shared file server for Alfresco. S3 in combination with Alfresco’s S3 Connector allows usage of multiple tiers of storage, cross-region replication, AWS KMS for encryption and several other benefits highlighted here.
AWS also provides an EBS (block store) service. EBS volumes are not multi-AZ. They can be copied to other AZs and Regions by taking snapshots. This Alfresco Solr Sharding white paper illustrates this process for persisting Solr index shard data across multi-AZs. This post describes how to copy an EBS volume to another Region.
This post has surveyed the most common architecture used to deploy a high availability Alfresco system on AWS. There are many additional AWS services that support a high availability environment you can read more about them in our Alfresco on AWS white paper.
Let us know your thoughts in the comments below.
Looking for more information?
- Alfresco documentation on Back Up and Restore
- Alfresco DevCon 2018: From Zero to Hero: Backing Up Alfresco (video) (slides)
- Alfresco One on the AWS Cloud – Reference Architecture
- Alfresco Solr Index High Availability Disaster Recovery POC
AWS references – guidelines and steps for implementing backup processes
- White papers
- AWS Projects