← all posts

The IOD Migration: 1,000 VMs, Four Failure Modes, and the Migration That Almost Didn't Happen

May 2026

Every migration story has a hero workload — the one that proves the pattern, gets the presentation at the all-hands, and makes everyone feel good about the direction. For the CCoE, that was Project CT. Clean, contained, successful.

This is not that story.

This is the story of IOD — Instance on Demand — the platform that nearly broke the migration model before it got started. One thousand virtual machines. Eight years of organic growth. Hand-operated monitoring. Four known failure modes that no single team owned, and a Bangalore-based support team watching institutional knowledge walk out the door as key members exited.

IOD was the kind of migration that gets assigned to the team that can absorb the risk because they have the deepest bench. In our case, that team was the CCoE. We had a bench of exactly three people.

What We Were Dealing With

IOD was ITOM's platform for delivering test and certify infrastructure, product-on-demand instances, and database-on-demand services. It had evolved over nearly a decade of acquisitions, reorganizations, and "temporary" solutions that became permanent. The virtualization layer was a mix of vSphere clusters that had been configured by engineers who had long since left the company. The monitoring was a dashboard that someone had built and forgotten about. The support model was: when something broke, whoever noticed first would SSH in and fix it.

The environment broke down into three workload categories:

Test and certify (60-70% of the environment). ITOM's engineering teams needed isolated environments to validate builds, run integration tests, and certify releases. These environments had bursty utilization patterns — idle for days, then slammed for hours before a release deadline.

Product on demand (20%). Customer-facing instances of ITOM products that ran on IOD infrastructure. These had the highest uptime requirements and the most complex dependency chains.

Database on demand (10%). Oracle RAC instances that product teams could provision on demand. The DBA team was already stretched thin keeping production databases running and had no bandwidth for migration planning.

The four failure modes were well-known but unowned:

1. Storage contention during peak test cycles would cause cascading failures across unrelated tenants

2. Network misconfigurations after VMware cluster changes would silently orphan VMs

3. Backup failures that went undetected for weeks because there was no centralized monitoring

4. Certificate expirations that took down services because no one owned the certificate lifecycle

Each of these had been documented in post-mortems. Each had a workaround that someone knew. And each had a different owner — or no owner at all.

The First Principle: Services Over Workloads

The ITOM Cloud and SaaS Strategy document laid out a clear vision: migrate to AWS, adopt cloud-native operations, reduce the operational burden on the Bangalore team. But the document was aspirational. The execution required answering a harder question: *What exactly are we migrating?*

The instinct was to define the migration in terms of infrastructure: X VMs, Y terabytes of storage, Z databases. This is what lift-and-shift consultants sell. But IOD was not a migration that could be solved by moving VMs to EC2 and calling it a day. The problem was not the virtualization layer. The problem was the operational model that had grown around it.

So we flipped the framing. Instead of asking "what workloads need to move to AWS?", we asked "what services does IOD provide, and which of those do product groups actually need?"

This reframing changed everything.

ITOM did not need a 100% translation of IOD to AWS. They needed:

  • Test environments that could be provisioned on demand and torn down when done
  • Product instances that met their availability SLA
  • Database instances with predictable performance
  • Monitoring that actually told them when something was broken
  • A support model that did not depend on three people in Bangalore remembering tribal knowledge
  • The technology to deliver these services on AWS existed. The challenge was organizational: the IOD platform had been running the same way for so long that the service and the infrastructure had become indistinguishable in people's minds.

    The Coordination Problem

    The migration had three stakeholder groups, each with different priorities:

    Herender's team (R&D) wanted test environments migrated first, with minimal disruption to the release cycle. They could tolerate short windows of downtime around release deadlines but not during them.

    Ganapati's team (Operations) wanted triage procedures established before any migration happened. They were the ones getting paged when things broke, and they had no interest in inheriting a new set of failure modes on AWS.

    The DBA team wanted nothing to do with the migration. They were already underwater keeping Oracle RAC running and viewed the AWS migration as a distraction from their real job.

    My role was to coordinate across these groups without direct authority over any of them. The CCoE had no org chart power over ITOM. We had whatever credibility we had earned from the CT migration and whatever goodwill we could build through showing up, listening, and solving problems that were in no one's job description.

    The approach was iterative. We started with the test and certify environments — the lowest risk, highest tolerance for disruption. We established a landing zone in Frankfurt using AWS Service Catalog with a hub-and-spoke design that could be replicated for other regions. We set up basic monitoring with Prometheus and Grafana, which was already more than IOD had. We documented every issue we found and published the lessons learned so the next wave would go faster.

    The first ten VMs took two months. The next hundred took six weeks. The pattern started to hold.

    What Made It Hard

    The technology was straightforward. The politics were not.

    The IOD platform had been running for so long that it had accumulated institutional dependencies that had nothing to do with technology. There were teams that depended on IOD instances that had been provisioned years ago and forgotten. There were compliance requirements that assumed the IOD operational model. There were budget lines that would need to be renegotiated if IOD went away.

    The hardest conversation was with the Bangalore team. They knew the platform better than anyone, and they were watching their expertise become obsolete. The message I had to deliver was not "we are migrating to AWS" — it was "we are migrating to AWS, your knowledge of how IOD actually works is the difference between this succeeding and failing, and we need you to document everything you know before you leave."

    Some of them did. Some of them had already checked out. The ones who stayed made the migration possible.

    What We Learned

    Define the outcome before you define the migration. The trap of lift-and-shift is that it measures success by what moved, not by what improved. IOD needed better service delivery, not a different data center. AWS was the means, not the end.

    The hardest migrations are the ones no one wants to own. IOD was not anyone's priority. It was everyone's problem. The CCoE was able to make progress because we were willing to take on work that fell through the cracks — the triage, the documentation, the coordination — and because we had built enough credibility to ask hard questions without being dismissed.

    Institutional knowledge is a single point of failure. The Bangalore team's exits created a knowledge vacuum that could not be filled by documentation alone. The migration was as much about preserving tribal knowledge as it was about moving workloads. We should have started the documentation process earlier and made it a formal part of the transition plan.

    Prioritize services over workloads. This was the guiding principle that made the IOD migration tractable. By focusing on what the business actually needed from the platform, we could make targeted migration decisions instead of trying to boil the ocean. The test environments went first because they had the highest tolerance for experimentation. The database workloads went last because they had the highest risk and the least organizational bandwidth.

    Epilogue

    I don't know how many of the 1,000 VMs ultimately made it to AWS. I left Micro Focus before the migration was complete. But I know that the pattern we established — prioritize services, build the landing zone, iterate, document, and let the organizational pressure build over time — outlasted my involvement.

    IOD taught me something that CT never could: that the ceiling on any migration is not technical. It is organizational. The hardest thing to move is not the workload. It is the institutional memory, the unwritten procedures, and the accumulated trust that keeps the lights on.

    Everything else is just infrastructure.

    *Built on a home lab, powered by local models, and owned by Andrew Katana.*

    ---

    Sources & Acknowledgments: This narrative draws on the same source material as the CCoE series — my private OneNote journal (Nov 2020–Jul 2022), the ITOM Cloud and SaaS Strategy document, the Mid-year-master.pptx review (June 2022) that tracked IOD migration progress, and the CCoE narrative extracted from the Gary archive.

    Built on a home lab, powered by local models, and owned by Andrew Katana.

    Connect on LinkedIn →