DoD IL5 Multi-Region Active-Active on OCI: Reference Design

Introduction: Why Multi-Region Architecture Matters

When a major AWS outage took down multiple services a few weeks ago, it reminded everyone that even the most trusted cloud platforms are not immune to regional failure.

For federal systems, downtime is more than a disruption – it’s a mission risk.

The incident also highlighted a broader assumption across the cloud community — that features like auto-scaling and redundant instances automatically translate to true resilience. In reality, those mechanisms are designed for elasticity and high availability within a single region, not for surviving a regional-level failure or control-plane outage.

That’s what makes multi-region architecture so vital – and calls for rigorous, mission-aligned engineering to ensure uninterrupted operations. To explore how 2i successfully implemented the IL5-compliant solution for the Department of Defense (DoD) on Oracle Cloud Infrastructure (OCI), we sat down with Ken, a Senior Cloud Engineer who helped design and implement the system.

Interviewer: Ken, tell me a little about your role at 2i and what you do day-to-day.

Ken: My title is Senior Cloud Engineer, but that can cover a lot of ground. In short, I turn government requirements into working, secure systems.

For this project, my focus was on implementing the government’s request for regional failover—essentially making sure that if one OCI region goes down, operations can continue somewhere else. That meant taking our single-region setup and figuring out what had to change to make it multi-region without breaking existing features.

Designing a Multi-Region Active-Active Architecture on OCI

Interviewer: Can you describe the solution 2i developed and its architecture for our readers. How does traffic actually flow between regions?

Ken: Technically speaking, the solution we built is a multi-region, active-active architecture on OCI. The primary region is Ashburn, and Phoenix acts as the secondary region, but both IL5-authorized OCI GovCloud regions are live at all times.

If Ashburn ever experiences a regional-level outage—network isolation, service degradation, or a control plane issue—traffic automatically fails over to Phoenix, and users can keep working with very minimal downtime.

At the front of the architecture, we use Akamai Global Traffic Manager as our global entry point. GTM continuously runs health checks against our services in both regions. If those checks start failing beyond a defined threshold, GTM updates DNS responses so new sessions get directed to Phoenix instead of Ashburn.

The tuning on those checks is really important—too aggressive and you get traffic flapping, too conservative and users feel downtime.

Implementing GTM at this scale was unusual for a DoD system — it’s not something you normally see in government cloud architectures. That made tuning even more challenging because we were breaking new ground in how global traffic was managed across classified and restricted environments.

Both regions host identical stacks of core services, which is why we call it a hot-hot configuration. It’s not just compute; identity, container images, host-level security tooling, and secrets have to be replicated so authentication and deployments continue to work no matter which region you land in.

Another benefit of GTA is that it can make routing decisions based on geography. So if a user is physically closer to Phoenix, they can default there, which reduces latency and improves throughput.

That means the extra capacity isn’t just sitting idle waiting for a disaster — it actively improves the experience.

So from the user’s perspective, nothing really changes. But under the hood, there’s a lot of complexity that lets them keep working even if an entire OCI region is having a bad day.

Overcoming Failover and Failback Challenges in Active-Active

Interviewer: What ended up being the hardest technical challenge you had to overcome for DoD?

Ken: Honestly, tuning Global Traffic Manager to make sure failover and failback were invisible was one of the biggest challenges.

DNS-based routing sounds simple, but you have to get health checks just right. If they’re too sensitive, traffic flaps back and forth. If they’re too tolerant, users experience downtime.

Failback is even trickier—you can’t just slam traffic back into a recovering region. Because GTM is DNS-based, failback is actually more complex than failover.

When Ashburn recovers, you can’t instantly push traffic back; cached DNS records can cause inconsistent behavior. We had to model timing windows, TTL behavior, and session persistence to make sure users didn’t get split across regions mid-operation. We spent a lot of time modeling and testing that behavior.

Another major challenge was Artifactory tokens. Because of licensing limits, we didn’t have access to higher-tier features that make multi-region token management easier. So we built our own.

Normally, government systems rotate credentials every 90 days. We automated token rotation every two to four hours—fully self-healing, with retries if something fails. It exposed other issues for a while, kind of like playing whack-a-mode, but in the end we came out with a really strong, resilient process.

Interviewer: How did you handle replication and synchronization across regions, especially for things like DISA STIGs and FedRAMP High baselines?

Ken: Not every service needs to be in multiple regions. We started by identifying which common services were essential to mission operations.

JFrog Artifactory was one of them – that’s where we store all our binaries, container images, Helm charts, basically the building blocks for everything we deploy. If that’s down, no one can push new software or updates.

Trend Micro was another critical piece because it handles host-level security — scanning for vulnerabilities and protecting systems in real time. You can’t afford to have that go dark in one region.

Some components can run independently in each region. Others—like credentials and secrets—need to exist in both. We built new Jenkins pipelines using the OCI API to replicate those securely.

It took a lot of iterations. We’d sketch ideas on a whiteboard, test them, break things, fix them, and repeat until it was rock solid. But once we got it right, we knew those critical services would keep running even if a whole region went offline.

Automating and Testing for Resilience

Interviewer: What role did automation tools like Terraform or NiFi play?

Ken: Terraform was key for deployment—it’s stricter now about which resources belong in which region, so we don’t accidentally deploy something where it shouldn’t go. Apache NiFi still handles routing traffic, and Active Directory remains the identity source for users, so logins keep working no matter which region they hit.

For monitoring, we set up Slack webhooks and email alerts to identify anomalies early and take corrective action before they affect operations. If we see degradation, we can even trigger a manual failover to Phoenix before things escalate.

Interviewer: How did you test or prove that failover worked?

Ken: We set up live demos for the customer. First we showed everything running in Ashburn. Then we simulated a complete outage of that region and, within a few minutes, demonstrated the same functionality running in Phoenix.

From a user perspective, it was seamless. That’s the ultimate metric—continuity you don’t even notice.

Keeping Operations Reliable

Interviewer: Once everything was live, how do you keep it running smoothly?

Ken: We’re now in the operations and sustainment phase. Our job is to make sure the infrastructure stays healthy—patches, updates, continuous monitoring.

It’s business as usual, but our systems alert us constantly, so we’re always aware of what’s happening. That keeps everyone sharp without needing big, formal failover drills. We “own what we build,” and that culture keeps reliability high.

Interviewer: So what happens now if a whole region actually goes down?

Ken: With the new design, traffic automatically switches through Akamai GTM, and users barely notice.

Before, with a single region, everything would have been offline until it came back. Maybe a few internal tools would still run, but the main mission systems would be unavailable. This design eliminates that risk.

Balancing Cost and Continuity

Interviewer: There’s obviously a cost to running active-active. How do you think about that trade-off?

Ken: It costs more—you’re maintaining two live environments—but it’s worth it. There are alternatives, like hot-warm, where you limit the amount of stand-by infrastructure to keep costs down while still allowing failover in a reasonable timeframe.

The AWS outage showed what can happen when you don’t invest in regional resilience. Cloud providers offer the building blocks, but regional fault tolerance is our responsibility. That’s the essence of the shared-responsibility model.

Interviewer: How did the customer respond once everything was implemented?

Ken: The customer demonstrations were overwhelmingly positive and the government’s feedback was great. During the walkthroughs they didn’t have many questions—they just wanted to see each requirement checked off. But you could tell they were impressed that failover worked so smoothly.

A lot of that credit goes to our leadership. They got us started early, so by the time deadlines came around, the hard work was done. We were just fine-tuning details.

Interviewer: Are there plans to expand to other regions later?

Ken: Possibly. At this point, we’ve moved out of the initial build out phase and into operations and sustainment of this new multi-regional offering, where we do make performance improvements and fix any bugs that are discovered.

As the OCI government realm already includes Chicago, and we built the system so it can easily scale. Because everything is deployed with Terraform, adding a new region is mostly a matter of updating configuration files with region-specific values. It’s modular by design.

Building a Culture of Collaboration

Interviewer: Looking back, what stands out most about the project?

Ken: The collaboration. People like Jacob make it easy to learn and experiment. You can throw out ideas, whiteboard solutions, and nobody shoots them down. No question is a dumb question. It’s a smart, supportive team. And that makes even the toughest challenges fun.

I joined 2i in May 2024, and this project has easily been one of the most rewarding things I’ve worked on—technically challenging, meaningful to the mission, and something I’m proud to support long-term.

Interviewer: Thanks, Ken. Any final thoughts?

Ken: Just that resilience doesn’t happen by accident. You have to plan for it, design for it, and test it constantly. It’s been great seeing 2i lead that effort and show what’s possible when you build with continuity in mind.

Conclusion: OCI IL5 Multi-Region Active-Active & Mission Continuity

What 2i’s team accomplished for the DoD goes beyond implementing a multi-region design on Oracle Cloud Infrastructure. It demonstrates what’s possible when resilience is treated as a core architectural principle, not an afterthought.

The project underscores a larger truth about modern mission systems: real continuity extends across clouds. It’s no longer enough to rely on dynamic scaling or redundancy within a single region or platform.

Mission assurance depends on cloud-to-cloud resilience — the ability to sustain operations even when part of the ecosystem is degraded or offline.

For federal agencies, that’s the future of modernization: architectures that anticipate failure, absorb it, and keep the mission moving forward.

Federal missions cannot rely on assumptions. Redundancy inside one region is not enough to survive a control-plane outage or region-wide disruption. Connect with 2i to build the level of resilience your mission actually requires — automated, tested, and multi-region by design.

Talk to 2i Engineers

DoD IL5 Multi-Region Active-Active Architecture on OCI: Reference Design