Data center users today have little visibility into the performance characteristics and power consumption of the software and hardware components of the data center; and the service providers at data centers have little visibility into user application performance requirements and critical metrics such as response time and throughput. This article discusses approaches to reduce this information gap.
The society is leveraging an increasingly larger amount of cloud services for running many computing applications, ranging from multimedia streaming to file storage. A key incentive for utilizing cloud is the ability to try out new services without investing in new hardware. As a result, essentially everyone in the society is becoming a user of the cloud and, hence, of data centers that host the cloud infrastructure.
Following this trend, the number of developers and researchers who work on data centers and cloud computing problems is also growing. Developers/researchers go beyond the typical user, and seek to optimize their applications and systems that are running on the cloud. However, a challenge with this optimization goal is that cloud providers do not expose most of the performance and power consumption characteristics of the data center software and hardware components to applications or to users. As a result, optimization at any layer (e.g., optimizing an application or optimizing an OS) requires reverse engineering to discover information about the other layers or makes coarse assumptions.
A recent reverse engineering example was conducted on Amazon Web Services (AWS), where the goal was to identify the best performing instance for sorting 100 TB of data. After evaluating dozens of instance types and spending a sizable amount of funds, the researchers optimized their application. However, as the performance of a virtual machine (VM) instance on the cloud typically depends on the utilization of various components of the cloud infrastructure (servers, storage, network, etc.), it is unclear whether such an experiment is repeatable. This is because, one, cloud enables consolidation of hardware, and, two, even without consolidation, shared resource usage (i.e., of network and storage) implies that one load’s performance can strongly be impacted by other loads running at the same time.
Another outcome of the information gap among users, providers, and resources is the coarse-grained pricing and performance options. For example, in a typical cloud setting, it is possible to determine how many cycles an application took to complete, but it is not easy to determine its underlying resource usage, which determines the power and cooling need, and thus, impacts the overall cost.
In my research lab, in collaboration with the Massachusetts Open Cloud (MOC, a pubic cloud project led by Boston University), we are working on developing a monitoring infrastructure that can expose rich information about all layers of the cloud (facility, network, hardware, OS, middleware, application, and user layers) to all the other layers. In this way, we hope to reduce the need for costly reverse engineering or coarse-grained assumptions.
Detailed information flow across the layers of the cloud is also key to developing intelligent, adaptive, and realistic performance and energy optimization techniques. For example, if we want to regulate data center power consumption while maintaining service level agreements (SLAs), we believe that it will be essential to be able to identify application-specific power/performance characteristics, and then determine the best power management technique to be applied under SLA constraints.
This article describes a data center monitoring system that is able to collect and organize data coming from multiple layers of a cloud data center. The monitoring system can be used for a number of applications, including security/vulnerability analysis or integration of data centers into smart grid programs. In this article, I will elaborate on the energy cost management opportunities enabled by the monitoring infrastructure, such as “data center demand response.”
There are a number of existing monitoring systems available out there. Amazon CloudWatch monitors virtual resources of users, but does not provide information on physical resources. On academic research side, some methods focus on providing advanced capabilities to providers, but without exposing information to end users or researchers. Typically, existing methods either cater to specific end-user applications or require more invasive implementations such as installing agents on each user VM. GMonE and PCMONS are examples of cloud monitoring systems proposed by academic research teams.[2,3]
Our design of MOC’s monitoring architecture assumes a standard Infrastructure-as-a-Service (IaaS) cloud setup that is composed of switches, storage, and servers on the physical layer, and the virtual layer managed by OpenStack. The MOC is running in the Massachusetts Green High Performance Computing Center. Figure 1 shows the main components of our monitoring architecture. Our architecture is composed of four layers (from left to right in the figure):
- Data collection layer, where monitoring information from different layers are collected in a scalable and low-overhead way.
- Data retention and consolidation layer, where collected information is accumulated in a time-series database (InﬂuxDB).
- Services layer, which includes services such as alerting and metering as well as privacy preserving API services to expose monitoring data to users.
- Advanced monitoring applications layer, where a wide variety of additional services that utilize the monitoring data operate.
The monitoring infrastructure makes use of existing powerful system software such as Ceilometer, LogStash, and Sensu, as shown in Figure 1. We integrate these telemetry services into scalable storage systems such as MongoDB and ElasticSearch. We use the time-series database InfluxDB for maintaining the information collected by Sensu, which is an alerting system. We also consolidate various types of data collected by Ceilometer and LogStash in InfluxDB.
The multidimensional database enables formulating complex queries, such as queries that expose the states of the components of the data center to cloud users, or queries that enable cloud administrators to understand the impacts of changes made in the physical layer on application performance. The third layer, services layer, is where data collected from different hardware and software components layers can be used for performing various services of MOC, such as “metering” for accurate assessment of resource use, debugging services, or vulnerability/anomaly detection. In addition to the fundamental services, our MOC monitoring system also provides a Security API (which provides networking and packet information from users willing to supply such date) and a monitoring API to expose the collected data through a user-friendly interface.
One of the end goals of the MOC monitoring system is to provide a set of applications/services to cloud users, researchers, and administrators. For example, we are working on developing privacy preserving APIs to end-users that expose the state of the physical resources that host user VMs. These APIs will enable users to automate achieving stable and repeatable performance. We are also working on correlating and exposing virtual layer performance and physical layer power utilization data to cloud administrators. Exposing such data will enable more efficient energy/cost management strategies and participation of data centers in advanced energy market programs.
CASE STUDY: REGULATION SERVICE RESERVES
In a 2014 article, “Data Centers in the Smart Grid” (Circuit Cellar 286), I discussed mechanisms and policies for data centers to participate in demand response programs. Demand-side power capacity reserve provision is emerging as a major sustainable solution for the integration of the increasing amount of intermittent renewables into the grid. For example, the largest US Independent System Operator (ISO), PJM, has already started pursuing this since 2006.
Using the data collected from our MOC monitoring infrastructure, my research team simulated a policy, which my students and I had recently proposed for enabling dynamic data center “regulation service reserves” (RSR) participation. This policy enables the data center to track a power signal given by the ISO while guaranteeing reasonable workload performance.
Many control knobs (virtual resource limits, voltage and frequency scaling, etc.) can be used for adjusting the power per server in a data center. In this simulation, we used the power and performance data collected from MOC servers when changing the number of vCPUs allocated to a given application. We ran each application alone on a server under a large set of possible configurations, and measured the power consumption via our monitoring infrastructure along with application performance.
Power consumption of a target MOC server in low-power (sleep) state is 40 W, and waking up a server from sleep state takes close to 2 minutes. We assumed that the data center utilization is around 50%, which is a typical practical scenario. As the control policy is scalable with the size of the cloud data center, we conduct experiments by assuming a cluster of 100 servers in the data center participating in RSR.
Figure 2 provides results of RSR provision when servicing two types of workload in the MOC: bodytrack and freqmine (both from PARSEC suite). We select these two jobs because they have very different profiles in both power consumption and performance: freqmine has a higher maximal power consumption value, and thus a larger dynamic power range than bodytrack, and its running time is several times longer as well. The figure shows that over 1 hour (during which time a number of jobs of a given type arrive at the cluster), our 100-server cluster is capable of tracking the RSR signal very accurately for both types of workloads.
More than 90% of the incoming jobs in this example can be serviced with a degradation ratio smaller than 2.5× of the fastest running time (fastest time refers to no waiting in job queues and running with highest number of vCPUs) for bodytrack. For freqmine, 90% of the degradation is close to 1× of the fastest running time. Freqmine has better performance than bodytrack because of its longer runtime, which makes the job waiting time in the queue relatively small, and thus the overall degradation is smaller.
We also computed the potential credits received by our cluster for RSR participation. The hourly electrical cost is a function of average power consumed, but the demand-side receives credits as a function of the regulation capacity provided. In these experiments, we were able to achieve 24.5% regulation capacity for bodytrack, and 30.5% capacity for freqmine, both of which are highly promising, considering these would translate to a similar percentage of savings in today’s advanced power markets.
CASE STUDY: PEAK SHAVING
The electricity bill of large industrial and commercial customers such as a data center is composed of, one, the charge for the total energy consumed and, two, the charge for the peak power within the billing period (weekly, monthly, or annually). Market operators put extra charges on peak power to avoid power shortage during on-peak usage periods. In addition, as the power infrastructure for a data center has to be built based on the peak power requirements, reducing peak power also helps reduce this one-time infrastructure cost.
Peak shaving caps the power usage of a data center within an upper bound. It is usually implemented by using energy storage devices to modulate the power consumption during the off-peak and on-peak periods. Server-level dynamic power capping along with job scheduling techniques (e.g., load shedding and shifting) is another possible solution for peak shaving, when jobs serviced in the data center are delay-tolerant.
Similar to the case study for RSR, my team simulated peak shaving using the data collected from our MOC monitoring infrastructure. We focused on simulating server-level capping because such control policies for server power management rely heavily on accurate monitoring.
We simulated a peak shaving policy we introduced recently. Figure 3 shows highly promising results for 30% peak shaving (i.e., where the new upper bound of the power consumption is limited to be 70% of the original peak).
This article has discussed the architecture for a scalable, multilayer monitoring infrastructure for MOC, an open regional public cloud, and presented motivating examples that leverage the collected data for advanced performance and energy management methods. Our implementation is on-going, yet the first results show substantial benefits of the proposed infrastructure for developing techniques that improve cloud and data center sustainability.
I believe our monitoring infrastructure will enable optimization of hardware and software components at all layers, including application, OS, hypervisor/cloud, and physical infrastructure, and support a number of diverse research projects. We plan to make the monitoring infrastructure available to users and researchers of MOC in the future.
Author’s Note: I would like to acknowledge our Boston University “monitoring team” at Massachusetts Open Cloud, who created and built the infrastructure described in this article: Dr. Ata Turk, Hao Chen, Ozan Tuncer, Hua Li, Qingqing Li, and Professor Orran Krieger.
 M. Conley, A. Vahdat, and G. Porter, “Achieving Cost-Efficient, Data-Intensive Computing in the Cloud,” in Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC ‘15), 2015.
 J. Montes, A. Sanchez, B. Memishi, M. S. Perez, and G. Antoniu, “GMonE: A Complete Approach to Cloud Monitoring,” Future Generation Computer Systems, 2013.
 D. Chaves, S. Aparecida, R. B. Uriarte, and C. B. Westphall, “Toward an Architecture for Monitoring Private Clouds,” IEEE Communications Magazine, 2011.
 H. Chen, M. C. Caramanis, and A. K. Coskun, “The Data Center as a Grid Load Stabilizer,” in Asia and South Pacific Design Automation Conference (ASP-DAC), 2014.
 H. Chen, M. C. Caramanis, and A. K. Coskun, “Reducing the Data Center Electricity Costs Through Participation in Smart Grid Programs,” in IEEE International Green Computing Conference (IGCC), 2014.
PUBLISHED IN CIRCUIT CELLAR MAGAZINE • MARCH 2016 #308 – Get a PDF of the issue