Safety First – Backup Concepts of Oracle E-Business Suite Environments for Oracle Cloud Infrastructure
Johannes Michler, PROMATIS Group, Ettlingen (TechnologyRegion Karlsruhe)
Oracle Cloud Infrastructure (OCI) is used by many of our customers to run their Oracle E-Business Suite workload. Especially when not just running development and testing systems on OCI, a solid concept is required for backing up (and restoring) the environment in the event of a disaster – be it due to user or system errors. Let’s take a closer look at the available options.
Basic Concepts and Terminology – RPO and RTO
The two most important notions when designing a backup strategy are definitely the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO). RPO is the amount of data accepted to be lost in case of disaster. For example, an RPO of 30 minutes means that in any situation, you never want to lose more than 30 minutes of transactions when a disaster happens. RTO is the time it takes to recover from a disaster and have an instance up and running again. The Oracle documentation on the database – especially the “high availability overview” – provides more details on this (see [1]).
OCI Levels of Isolation
There are three isolation levels associated with Oracle Cloud Infrastructure that help to protect against failures, see the OCI documentation on this:
- Region: A region is a highly isolated part of Oracle Cloud Infrastructure, which is located in a geographical area, e.g. EU-Frankfurt or US-WEST (Phoenix). The separation across regions provides protection even against e.g. large natural disasters.
- Availability domains: Most regions are divided into three availability domains. These are isolated data centers and are thus highly unlikely to fail simultaneously. Because availability domains do not share infrastructure such as power or cooling, or the internal availability domain network, a failure in one availability domain within a region is unlikely to impact the availability of other domains within the same region.
- Fault Domain: This is a partition within one data center. By putting compute instances in different fault domains, the failure of a physical box in Fault Domain 1 does not impact a physical box (and its compute instances) in Fault Domain 2.
Block, Object and File Storage
Oracle Cloud Infrastructure offers three major types of storage with different advantages and disadvantages. All three types are also relevant for backups:
- Block storage is a type of storage that is attached to a compute instance as either a boot or an additional volume. Usually the attachment happens through iSCSI. While its performance is very high, block storage resides in just one availability domain (the one where the compute instance is running). Block volumes can be conveniently backed up at specified times through backup policies, and these backups can also be put to the object storage across regions. It is even possible to replicate Block Volumes across regions with a delay of usually not more than 30 minutes (see [2]). All the data from E-Business Suite instances (Apps- and DB-Tier) are usually located on block volumes (see [3]).
- Object storage can be viewed as a “web service” that helps store and retrieve particularly large objects. As referenced in [4], this service applies “per region” and is highly durable due to the automatic storage of several copies across multiple availability domains.
- File storage provides a NFS mount point that can be attached to multiple compute instances, eventually also across availability domains (or even regions if they are connected). While the service is a “per AD” service, the file storage can be in AD1, while a compute instance can be in AD2 for convenient access to the storage (cf. [5]). The service provides snapshot functionalities and stores multiple (durable) copies of all data (within one AD).
Objectives of this discussion
For the further scope of this blog, I will describe strategies that will help achieve both an RPO and RTO (for production systems) of appx. 30 minutes each. The strategy should be able to handle outages of one availability domain, but does not need to cover “regional outages”.
I will describe three different scenarios in this article, since they have different appropriate procedures:
- Handling of development instances
- Handling of conference room pilot or other testing instances
- Handling of production instances
Development Instances
Typically on development instances it is not necessary to have a “full blown” backup of the instance. In case of a disaster, one is usually able to just create a new development instance as a new copy of the production system. If all developers work according to “common best practices”, all of their source code is contained by a source code management system and can be easily reinstalled onto the new development system.
In reality, however, this is not always the case. Especially with PL/SQL and APEX development, it is a well-known practice to develop directly in the database and only occasionally transfer the source code into a version control system – often only when moving from development to testing.
To cover this scenario, it is advisable to also carry out “some backups” of development systems. Usually, however, it suffices to back up the following content:
- (Custom) database objects such as packages, procedures or views: Those can be easily backed up using expdp with a “XX%” name filter that is run at e.g. 8 a.m., noon and 4 p.m. The result can be backed up to a file storage in a different availability domain and restored from there after a new development instance has been created.
- APEX Applications can be backed up using sqlcl (see [6]). This can again be run multiple times per day, and the result can be written to a file storage location.
- XML Publisher Reports are stored in the xdo_lobs table. This table can be dumped regularly using expdp.
We put all the three exports into a shell script that cleans up old data after 4 weeks, providing a safety net for developers who do not immediately check-in their source code to GIT.
This means a RPO of 4 hours (for source code only) and an RTO of 2-12 hours (create a new P2T copy and restore the latest source code). The cost is minimal – storing appx. 1 GB in file storage.