Dynamically Right-Sizing Your Cloud Infrastructure

by Chris Gutmanis on January 24, 2020

The benefits of cloud service providers such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform cannot be overstated. With the click of a button (or execution of a script), a user can create anything from a simple web site to hundreds of servers, each with hundreds of cores of CPU power. This allows development resources to be created and scaled on-demand, giving development teams access to bleeding-edge technologies without having to wait for a dedicated infrastructure team to design, request, and build the supporting infrastructure. However, as we will see, with great power comes great fiscal responsibility.

A common objection to cloud migrations, as well as a motivation behind rolling back cloud systems to traditional on-premises infrastructure, is that of the persistent cost involved. In the case of an on-prem setup, machines are right-sized up front to account for bursts of resource demands, that is, machines are sized to be able to handle some level of peak usage and sit idle when not needed. A high initial cost is paid, but after the initial investment, costs drop down significantly as the servers largely run in maintenance mode. In this case, right-sized infrastructure is directly tied with the absolute peak system demands, regardless of how often these resources are actually required.

In the case of a cloud subscription, the up-front cost is a mere fraction of the up-front investment of physical infrastructure, but the costs persist for the life of the systems - many machines are billed with a formula along the lines of base machine cost * hours of operation. The recurring spend is arguably the most common cost-related objection to cloud providers - physical systems are a one-time expense (until the systems need to be upgraded). That's not to say significant savings in cloud systems are impossible - cloud services today provide on-demand scaling of resources, allowing you to ensure your systems are not only right-sized for peak demands, but also scaled down when demands are lower. There are many tactics one can use to manage cloud costs effectively; this article dives into two possible ways to alter one or both of the above multipliers as needed to reduce compute costs - but it does require insight as to how - and perhaps more importantly, when, your systems are being used.

Example 1: Scaling Vertically

You are using Platform-as-a-Service (PaaS) managed services for SQL databases. One such database is used internally for handling batch processing operations. Investigation into the activity of the resource shows a heartbeat-like graph indicating extreme write-heavy activity for about one hour every day at 4AM corresponding to when a file import is processed. The rest of the time, usage never climbs above 10%. In the on-prem world, the server would be built to be able to handle the spikes and sit idle the rest of the day; however, taking this same approach with managed resources would result in substantial unnecessary costs for the 23 hours per day that 90% of the available resources are unused.

In this case, we have a very predictable pattern with respect to system demands and time of day and can take proactive action to supply resources as needed. The actual implementation can vary based on cloud provider and personal preference; examples include automation accounts or functions in Azure clouds, lambdas in AWS, or an orchestrator machine running scheduled tasks in any environment.

The basic logical flow would read as follows:

At 3:50 AM, (10 minutes before daily spike), scale database server up to premium tier.

At 5:10 AM (10 minutes after spike time ends), verify usage has dropped down to normal levels; if so, scale down to standard tier.  

If usage does not ramp up or drop down in the expected window, notify relevant teams.

Example 2: Scaling Horizontally

You are operating a set of load-balanced Infrastructure-as-a-Service (IaaS) Virtual Machines hosting a consumer-facing web application. During off-peak hours, two machines is enough to handle incoming traffic while allowing for failover if necessary. During the workday, traffic gradually ramps up, requiring the resources of one to three additional machines, variable by day and time of day. This graph has much more variance than the previous example, with some days experiencing higher load than others, but never represented by an immediate spike.

In this scenario, we can be more reactive with respect to our resizing and make use of threshold-based metrics to tell us when we need to scale up or down. Again, there a multitude of ways to accomplish this - for example, a cloud monitor tied to trigger an event when machines begin to see 60% CPU load to scale out, and a second monitor to send the 'scale in' broadcast when load has dropped below 40% for more than five minutes.

The logical flow here would read as follows:

If total CPU exceeds 60%, add another machine to the set. 

If the VM pool contains more than two machines and total usage drops below 40% for more than five minutes, remove one machine from the pool. Report any anomalies to relevant teams.

As you can see, this example makes no assumptions about time of day and may upsize the pool for 3 hours beginning at 7AM on Monday but only need to scale out for 2 hours at 1:30PM on Wednesday.

Complex Scaling Strategies in the Real World

In all likelihood, you will see benefit from a mix and match of these (and other) cost-management strategies in your cloud environment. Oftentimes a single piece of the solution architecture will serve as a candidate for multiple scaling strategies; for example, in an internal build system, one may seek to scale up an individual agent to allocate more power to resource-intensive jobs, while at the same time scaling out to create more build agents and allow parallelization of builds. This is the principle behind many build orchestration tools which provide the ability to spawn ephemeral build agents in response to demand. Similarly, a tax preparation service may need to create significantly more instances of their webserver farms around tax day to handle higher levels of internet traffic, while also vertically scaling to increase the processing power of any given database server.

Conclusion

Like development itself, a cloud migration is forever a work-in-progress. Simply lift-and-shifting your systems from physical hardware to managed solutions provide immense benefits in ease of use and reliability; however, best practices for things like right-sizing your systems have changed and there are significant benefits from constantly evaluating your system architecture and looking for areas of improvement. By routinely monitoring how and when your systems are being most heavily used, you can then work to implement solutions to get the most out of your infrastructure as well as your budget.


Other Interesting Articles

Code Camp for QAs: Cross-Training to Grow Your "T"
Are You Getting the Most Value from Your Development Team?

Share this article

 

We would love to hear what you think. Reach out to us.