The Path of Cloud

Usually the title of this blog is referred to as the path to cloud.  Based on recent conversations, cloud is already leveraged in most organizations.  Thus, we are in the cloud, not moving to it.  But what is the best way to weave these pockets of cloud together?  I hear questions around the nature of Platform as a Service or public cloud versus private cloud versus hybrid cloud.   Sometimes discussions reflect on how to reign in rouge public cloud use or how to provide seamless access to multiple Software as a Service offerings.  The success of those efforts is the end state.  They should not be goals for the beginning of the journey.

As a former athlete and forever competitor, my experience in developing any skill is to begin with the basics.  All training is focused on ensuring form is correct before leveraging complex movements or thoughts.  This is one hundred percent true of cloud as well.  The purpose of enterprise cloud is for IT to become a service broker, or IT as a Service.   To achieve this, we need to automate everything IT has been trying to do for the last 30 years. 

These are my guidelines for a successful path of cloud:

1.    Solidify virtualization

If your organization is not comfortable with virtualization, this will be a large barrier to cloud.  Ensure the standard architecture for virtualization can support all workloads in performance and capacity.  Implement reference architectures across main data centers and remote sites.  Reduce the variables of physical hardware, software versions, patch levels, and virtual machine templates to a handful of approved types.  Remove the process of negotiating for resources with application owners and developers.  Leverage templates only.

2.    Implement cloud management and monitoring

Once virtualization is standardized across the enterprise, performance and capacity management is key to ensure internal and external customer satisfaction.  Older monitoring tools were not built for cloud agility and hypervisor capabilities.  Capacity management is also different for virtualized environments as the various workloads and peaks running on the same hardware is difficult to account for in manual methods.

This step does not mean all previous tools are discarded.  Instead, the ones providing unique data should be retained and rolled into the main analysis engine.  But, where features overlap, it is an opportunity to prune down the number of solutions in the environment to a manageable number.  This will also lower CapEx and OpEx costs for IT by not maintaining support on redundant software.

3.    Implement configuration management

The solution required for this is not unique to cloud environments.  The important requirements are that all hypervisors and operating systems are supported.  The configuration management system should tie into cloud management and monitoring for a complete root cause picture when diving into issues.

4.    Implement log aggregation and analysis

This toolset is also not unique to cloud environments but should support all standardized infrastructure and applications.  Integration with the management and monitoring tool is ideal for root cause analysis.  Understanding system and application logs is the only way to know the underpinnings of any environment.

5.    Implement cost analysis

The main purpose of cost analysis is not to “Chargeback” to the business units for their overall environment consumption.  Chargeback is a great feature, but more importantly IT must understand its spending.  Also, IT must strive to become an innovation center instead of the cost center that exists today.  Recognizing the least expensive location for workloads and reinforcing workload sizing is critical to success in later stages of cloud.

Once these five steps are complete, or at least well underway, then it is time to automate.  Cloud is automating the request and delivery of a service.  If an empty Operating System is deployed, this is Infrastructure as a Service.  If it is a development platform, a web server, a database server, etc, then it is Platform as a Service.  If the service is to a ready to use application, this is Software as a Service.

The reason I am providing these definitions is I have found, to my surprise, that they differ among the industry.  The delineation of these terms is not to require folks to adhere to them, but to instead show the complexity of each.  Each service builds on the former mentioned in the paragraph above.  Thus, when I am asked where to start with automation, I always suggest Infrastructure as a Service because this is the simplest to provide.  Simple is not to be confused with easy.  When integrating requests with approvals, ticketing systems, IP management systems, networks, security, global DNS, operating systems, backups, monitoring, change management databases, and other tools, automation is a big step.  Do not let this list of integrations deter you.  It is the end state.  The beginning is self-service access to basic virtual machines. 

Keep in mind that the rewards of cloud are vast.  The new measurement for an IT organization is not percent virtualized or time for deployment of a template.  It is not cost per GB of storage or amount of network ports used in a data center.  Innovation percentage versus “keeping the lights on” is the new measurement.  The early adopter organizations are reaching 50% to 75% on the innovation side of the equation.  This is only accomplished through the path of cloud.

Documentation. The Necessary Evil.

Documentation might not be evil, per se, but I have not heard the words, "Oh my gosh!  I get to document my implementation.  That is so exciting!"  Or "I love filling out tickets, change control forms, CMDBs, etc.!"  If someone who feels this way is out there, you are certainly a rare breed and could probably find a job anywhere, at any time.  For the rest of us, documentation is an often-dreaded afterthought, once the great fun of designing and deployment is over.

However, documentation is very important.  Without it, infrastructures, applications, and security function as a black box.  The environment is a mystery to anyone who did not implement the solution from the virtual ground up.  While some might consider this job security, it is not the best way to run an enterprise.  At this point, you might be thinking of the many times documentation would have been useful to you.  But, let me supply some examples nonetheless.

A prime example is the datacenter move.  Whether this move is due to a merger or acquisition or it is time to move to a better-suited location, knowing the ins and outs of the infrastructure and applications is critical to a successful migration.  So many times the project starts with the question, what is running over there?  Followed by, who is using it?  Once those questions have been answered satisfactorily, discovering the physical and logical layouts of said datacenter and applications must be tackled.  Generally, the project is launched into an investigative phase that lasts months or years.  The actual move usually does not take longer than a few scheduled weekends.  Of course, these scheduled outage windows do not imply success.  Surprises occur, and the IT organization spends thirty hours with all hands on deck delving into the system looking for the cause of the issue.  If the answer is not found, the application must be rolled back to the original datacenter.  Sometimes the entire migration project does not move past the discovery phase because the entire task is too daunting.  Enterprises are then stuck in costly datacenters on aging and insecure infrastructure.

Another important situation is the moment disaster strikes.  Maybe a single system or application failed.  Maybe a natural disaster or power event affects the entire datacenter.  Regardless, it is time to bring production workloads online in a short amount of time.  But how do we do this?  We might have backups to restore (fingers crossed that the restore actually works).  Or data could be replicated to a secondary or tertiary site, waiting to be powered on as systems.  If we are successful bringing the individual server online, this only sets our environment into a crash consistent state.   The application might not yet be functional in this scenario.  In addition to assessing the health of the environment, we must identify where IPs and host names are set, what the effect is if those values are changed, what inter-system communication occurs and how, what startup dependencies exist either internally or externally, and so on and so forth.  Stress is already high in a disaster, think how peaceful it would be to have the recovery work completely documented.

The above examples were extreme.  However, other very important needs for documentation arise, no matter how trivial they seem at first glance.  It might be time to promote an application into production that developers have tinkered with for the last six months in a test environment.  Go-live is in a week.  How is that accomplished without thorough documentation?  An application might already be in production, but it is time to integrate it with another system, enhance the security, or just upgrade the application.  Without proper documentation of the system configuration, well, the discovery work will far surpass the originally planned project.

I could provide statistics around how much money companies can save in datacenter migrations and consolidations alone - millions of dollars, often just in power savings.  Or we can consider the likely hood of a large enterprise sustaining a profitable business model after unsuccessfully recovering from a disaster – in the low 10s of percentages.  But it is more interesting to share a really irritating experience I had because I did not provide detailed and clear instructions on an upgrade I was leading.

Years ago when Active Directory integrated DNS was only four years old, I joined a company that still used BIND on their UNIX servers for internal DNS with  Active Directory Forest and Windows DHCP.  After convincing everyone AD integrated DNS was secure and reliable, I migrated the company to AD DNS during a Windows 2003 Forest-wide upgrade.  I performed the upgrades at the main datacenter and then provided instructions to my other team members to replace the Windows 2000 domain controllers in the field.  I basically said, “Check the box to install DNS and choose AD integrated.”

A few weeks after the transition, support calls started rolling in that e-mail wasn’t working.  We all know e-mail is the most important application in a company, regardless of where it is tiered on the official application inventory.  So, this was a big deal.  It was discovered that e-mail was malfunctioning because the mx record mysteriously disappeared in the forward lookup zone.  I quickly manually entered the record on the management console and e-mail was back online.  However, this kept happening every week or two.  My credibility was on the line, not to mention I was not getting enough sleep during my on-call rotation due to email being down at a global company.

I finally pieced together that this would happen after domain controllers were replaced in remote locations after reading through many, many AD log files.  In speaking with the person who touched the last DC, I discovered he created the Forward and Reverse Lookup zones manually after the promotion instead of letting them replicate on their own (as AD Integrated DNS is designed to function).  The a and ptr records would remain, but other useful records in AD DNS, like mx records, would not reappear.  You can imagine my fury at the revelation!  But, it could have all been avoided if I explicitly documented the replacement steps for the administrator instead of telling them in passing, “Install AD Integrated DNS.”

Why am I writing about documentation in the cloud era?  This is the age for 3rd generation applications, and mobile and social apps are the only types that we wish to develop.  Self-healing infrastructures are available, and environments automatically deploy more systems to support the user load as thresholds are reached.  This is all true, but it does not excuse us from documenting.  One cannot automate something until the process is defined.  And how is the process defined?  It is documented; otherwise it is ambiguous and open to interpretation.  This is true for any deployment to the hybrid cloud, as many options exist.  Explicit documentation ensures the system configuration is clear.

I recently visited with a few customers who confirmed my thoughts that new services such as self-service provisioning, automation, and network and security virtualization will causes issues without proper documentation.  Once the application is brought online and runs in the cloud, then the infrastructure and security teams must reverse engineer the implementation to ensure they understand the building block components, the configuration, and the overall support mechanisms required.  I wouldn’t go as far to say the cloud era brings more work to these teams, but it does leave them in a precarious position for day two operations and support.

During a presentation about cloud native applications, I recently asked a co-worker, how does developer and application promotion automation include documentation?  He thought for a moment and said, “That is a really good question that I don’t know the answer to.”  He is one of the brightest minds I know on this subject, so if he is not aware of a solution, it might not exist.  And we must resolve this soon.  Otherwise, we will continue to suffer uncharted applications and datacenters.  As folks retire or move on, the knowledge of how systems function will be lost.  Future workers will start from the beginning; they will reverse engineer the systems that generate millions, if not billions, of dollars in revenue.  They will not able to focus on innovation to neither help the company’s profits nor implement emerging technologies.  And this hurts everyone – the individual and the company alike.